19 KiB
Privacy Gaps
This document is the inverse of docs/forensic/: instead of "what we proved we can clean," it lists "what we cannot claim to clean." It exists because honesty is more important than perfection — users need to know where the privacy guarantee bends.
For the design rationale, see .claude/rules/privacy-invariants.md §3 ("Forensic > unit tests for any 'did we remove the metadata?' claim").
Last updated: 2026-05-15 (dropped stale Electron timestamp/xattr subsection alongside preserveTimestamps removal in #99). Status: scaffold. Filled in incrementally as gaps are accepted, formally documented, or closed.
Format-by-format coverage
The current shipping state. Expect this table to drift; the README's Format Support Matrix is the canonical source — this column adds a "what doesn't get cleaned even when the format is 'supported'" view that the matrix doesn't show.
| Format | Status | Known gaps |
|---|---|---|
| JPEG | Supported (full) | None known. See forensic/jpeg.md. |
| Supported (best-effort) | Embedded files + AcroForm data not touched (see forensic/pdf.md §"caveats"). |
|
| PNG | Supported (full) | None known. See forensic/png.md. |
| MP4 / MOV | Supported (partial) | Timed-metadata tracks, hdlr names, compressorname, mdat orphans, sidecar files — see §MP4 video gaps below. |
| DOCX / XLSX / PPTX / ODT | Supported (partial) | Tracked changes/comments, RSIDs, embedded media EXIF, customXml/, file paths in *.rels — see Office Phase 2 hardening (issue #62). |
| HEIC / AVIF | Unsupported (in flight) | Strategy tracked in issue #48. |
| GIF / WebP / BMP / TIFF | Unsupported in web build | Hand-rolled walkers planned (see README §"Format Support Matrix"). |
| MKV | Unsupported | Strategy tracked in issue #43, deferred to v6. |
| RAW | Unsupported (v5+) | See §RAW unsupported below. |
| SVG, JXL, JPEG 2000, AVI | Unsupported | No strategy planned for v5 (#44 closed wontfix — small audience, no demand signal). |
macOS extended file attributes (xattr) — lost in Phase G
Decided 2026-05-11, shipped 2026-05-14 (issue #80). Prior to Phase G, the Electron desktop build scrubbed macOS-specific extended attributes (kMDItemContentCreationDate, kMDItemDateAdded, kMDItemFSContentChangeDate, kMDItemFSCreatorCode, com.apple.quarantine, com.apple.metadata:*, etc.) via the XattrCommand running against the system xattr binary. With the Electron shell retired in Phase G, that code path is gone.
What may leak: Spotlight-indexed timestamps, the "Where from" download origin URL, Finder tags, the Quarantine flag (which records the application and the date it was downloaded). These survive on the file even after metadata stripping, because the browser cannot reach the filesystem's xattr namespace from sandboxed JavaScript.
What you can do:
- On the file you saved from MetaScrub, run
xattr -c <file>from Terminal. That clears all extended attributes in one command. - For a directory of cleaned files:
find <dir> -type f -exec xattr -c {} + - If you need a deeper sweep (Spotlight metadata stores, recent-items lists), ExifTool standalone and dedicated forensic tooling go further than the simple
xattr -cstrip.
Why this trade-off: Phase G retired the Electron shell because the maintenance cost of a full electron-builder + code-signing + release-matrix story exceeded the privacy advantage it added over the PWA. The xattr scrubbing was the only privacy-relevant capability the shell had that the PWA cannot reach; documenting it as a gap and pointing users at xattr -c is the right trade given the audience.
Android INTERNET permission — granted but never used
Decided 2026-05-17 (issue #153). The Capacitor APK declares <uses-permission android:name="android.permission.INTERNET" /> in AndroidManifest.xml. Anyone running aapt dump permissions on the APK will see this and reasonably ask: "but the README says zero network traffic?"
Why the permission is there: Capacitor's WebView interceptor serves bundled assets via https://localhost/ (so the renderer is in a secure context — required for WASM and service workers). Android's WebView grants asset loads against any scheme through the platform's HTTP plumbing, which requires the app to hold INTERNET. There is no Capacitor configuration that loads local assets without it.
What actually enforces no outbound traffic: the CSP meta tag injected by vite.config.web.ts:
<meta http-equiv="Content-Security-Policy" content="... connect-src 'self' ...">
connect-src 'self' means the WebView refuses any fetch(), XHR, WebSocket, EventSource, or beacon to any origin other than the https://localhost/ Capacitor scheme. The renderer has no analytics SDK, no error reporting, no auto-update check, no font CDN, and no call sites that would hit a remote origin in the first place.
What you can verify yourself:
aapt dump permissions app-debug.apk— shows onlyINTERNET(noREAD_EXTERNAL_STORAGE,ACCESS_NETWORK_STATE,ACCESS_COARSE_LOCATION, etc.).- Run the APK behind a host-side firewall or mitmproxy with the device on a captive-WiFi — no outbound DNS or TCP from the app's UID after launch.
- (Future) the no-network Playwright test (issue #67) will exercise this in CI.
What MetaScrub could do (won't): ship a WebView shim that loads assets via content:// instead of https://localhost/, removing the need for INTERNET. The drawback is losing the secure-context guarantee (no WASM, no service worker). The trade-off isn't worth it — the CSP layer + the absence of fetch sites already gives the same outcome with a much smaller change.
Android Downloads/ — world-readable on Android ≤9
Decided 2026-05-17 (issue #153). The Capacitor APK uses the system WebView; <a download> invocations route through Android's DownloadManager, which writes the cleaned file to the public Downloads/ folder. On Android 9 and below, that folder is world-readable — any other app on the device with READ_EXTERNAL_STORAGE (granted at install for legacy apps) can enumerate and read the cleaned file.
What may leak: the cleaned file's existence and contents to other installed apps with broad storage permissions. The metadata strip itself is unaffected — the file is byte-equivalent to what the user would get from the web build. The leak is "another app on the same device can read the file you just saved," not "the metadata wasn't actually stripped."
Affected versions: Android 6.0–9.0 (API 23–28). Android 10+ (API 29+) introduced scoped storage; the cleaned file is isolated to the app's Downloads entry and not world-readable.
What you can do:
- Use Android 10 or later if available (scoped storage applies automatically).
- Move the cleaned file to your app-private storage immediately after the strip (e.g. into Signal's media folder before sharing).
- This isn't an APK-specific regression; it's a property of the platform's pre-scoped-storage filesystem model. A self-hosted PWA install (
Add to Home Screen+ the browser's download flow) shows the same behaviour on Android ≤9, but the APK is the recommended Android distribution either way (see.claude/rules/project-direction.md).
What MetaScrub could do (deferred): a Capacitor-aware adapter in web_api.ts (detectable via typeof Capacitor !== 'undefined') using @capacitor/filesystem to write into the app's private external storage with an explicit location picker for export. This is a one-issue follow-up; not implemented in the v1 APK pass to keep the patch small.
RAW unsupported
Decided 2026-05-09 (issue #16). Previously, RAW formats (CR2, CR3, NEF, ARW, RAF, ORF, DNG, RW2, X3F, and dozens of vendor variants) were processed by the bundled Perl ExifTool inside the Electron desktop build. Phase D removes that wrapper entirely; v5 ships a single WASM/pure-TS code path that does not cover proprietary RAW.
What may leak: Everything ExifTool's RAW support previously stripped — IFD0 EXIF, GPSInfo, MakerNotes, embedded JPEG previews with their own EXIF, XMP, and IPTC. Dropping a RAW into v5 returns "unsupported"; the file is not modified, so any pre-existing metadata remains.
What you can do:
- Use ExifTool standalone — the canonical reference implementation, far more thorough than any wrapper.
- Convert RAW → JPEG/TIFF in your photo editor first, then process with MetaScrub. (Note: the JPEG/TIFF often inherits a subset of the RAW's metadata — verify with
exiftoolbefore assuming a clean output.) - For photo libraries, native OS share-sheets often offer "Remove location" before sharing — coarser than MetaScrub but server-free.
Why this trade-off: ExifTool's RAW support represents roughly two decades of reverse-engineering on undocumented proprietary containers. No production-ready WASM library covers that surface (see docs/poc/little-exif-wasm.md and docs/poc/exiv2-wasm.md for evaluations of the closest candidates). Maintaining the Perl runtime alive in Electron solely for RAW added complexity disproportionate to the audience size; once the convergence-on-one-code-path direction was committed (project-direction.md), keeping it became dead weight.
ZIP archives
The ZipStrategy (issue #184, shipped 2026-05) cleans ZIP archive metadata and recursively re-cleans every supported inner file. Three known gaps remain:
Encrypted ZIPs are refused, not cleaned
What this means: if your .zip contains entries encrypted with a password (ZipCrypto or AES-via-WinZip), MetaScrub refuses to process the archive and surfaces an "Encrypted ZIP archives aren't supported" message.
Why: the bundled ZIP library (JSZip, already a production dep for Office) refuses loadAsync on any archive containing encrypted entries. Without it we'd need a parallel byte-level walker — significant additional code we deferred for v1.
Workaround: decrypt the archive with a dedicated tool (7-Zip, unzip from the command line, mat2's archive backend) and re-drop the decrypted contents into MetaScrub. We may add a byte-level fallback in a follow-up if demand surfaces.
Self-extracting EXE stub bytes are preserved
What this means: if a .zip is wrapped in a self-extracting Windows executable (the bytes before the first local file header form a PE stub), MetaScrub preserves those bytes verbatim. The stub itself may carry the original creator's identifying metadata (PE timestamps, OriginalFilename string, etc.).
Why: modifying the stub would break the SFX behavior. Distinguishing "intentional SFX stub" from "arbitrary leading garbage" reliably from the byte stream isn't reasonable.
Workaround: repackage the contents as a plain .zip (without the SFX wrapper) before dropping it into MetaScrub.
Multi-disk / spanned archives are refused
.zip archives split across multiple .z01/.z02/… files are rejected with a parse-failed error. JSZip does not support multi-disk reads. Reassemble the archive locally (e.g. zip -F) before processing.
MP4 / MOV video gaps
The current VideoStrategy (mp4box.js-based box-tree rewriter) drops udta, meta, and Xtra containers but does not cover several known sources of leak. These are tracked individually; this section is the user-facing summary.
Timed-metadata tracks (GoPro GPS, DJI telemetry, CAMM, tmcd)
Status: ✅ Track blanking shipped in PR #120 (#35 closed). Remaining gap: mdat orphan bytes — see below.
What was fixed: Handler-type-gated trak blanking. The strategy now peeks at mdia/hdlr.handler_type before deciding to recurse or blank each track. Kept types: vide, soun, subt, sbtl, clcp, hint. Blanked types (replaced with same-size free box): meta (GoPro GPMF), tmcd (timecode), data, camm, text (DJI telemetry), and any other unrecognised handler.
Known side-effect gap: QuickTime .mov files may embed subtitle or chapter tracks with handler_type = "text" — the same four-byte code DJI uses for flight-log telemetry. There is no way to distinguish them at the hdlr level alone. These subtitle tracks are blanked (privacy-first). If you have .mov files with embedded subtitles you need to preserve, extract them before stripping (ffmpeg -i input.mov -map 0:s subtitles.srt). This is documented in docs/gap-analysis/video.md §handler-type-text.
What still leaks: The raw sample bytes the timed-metadata track's stco/co64 pointed at remain in mdat. See Orphaned mdat bytes below.
Orphaned mdat bytes after track blanking
Status: issue #42 (privacy gap — partial mitigation now shipped via #35 trak blanking).
What it is: When a metadata track is blanked at the box-tree level, the underlying mdat (media data) atom still contains the raw sample bytes the track referenced. A forensic walk can carve those bytes back out via strings | grep or structural carving for GPMF magic bytes.
What may leak: GPS coordinates as ASCII or packed binary, gyro readings — the same data that was in the blanked telemetry track.
What you can do now: Re-encode with ffmpeg -map_metadata -1 -map 0:v -map 0:a -c copy output.mp4 to drop all sample tracks except video and audio; this rewrites mdat from scratch and eliminates orphan bytes.
hdlr handler name strings (encoder fingerprint)
Status: issue #38 (fingerprint hardening, lower priority).
What it is: ISOBMFF hdlr boxes carry a human-readable name string (often "VideoHandler", "Apple Video Media Handler", "GoPro AVC Encoder", etc.). The current strategy doesn't zero these.
What may leak: Encoder/device family — not direct identity but useful for fingerprinting.
compressorname in avc1/hvc1 codec sample entries
Status: issue #39 (fingerprint hardening, lower priority).
What it is: Sample entries for H.264/H.265 codecs carry a 32-byte compressorname field (commonly "H.264", "x264 - core 152", etc.). Same fingerprinting concern as hdlr names.
H.264/H.265 SEI NAL units
Status: issue #41 (known limitation — accepted, not fixable without re-encoding).
What it is: H.264/H.265 video streams can carry SEI (Supplemental Enhancement Information) NAL units inside the encoded bitstream itself. These can encode timestamps, GPS coordinates, recording-device identifiers, and arbitrary user data.
What may leak: Same surface as timed-metadata tracks, but baked into the video stream. Removing them requires re-encoding the video — which violates the project's "no quality loss" promise and breaks the core forensic invariant ("we don't decode and re-encode").
What you can do: Re-encode with a tool that strips SEI: ffmpeg -i input.mp4 -map_metadata -1 -bsf:v "filter_units=remove_types=6" -c:v libx264 -c:a copy output.mp4. Note: this DOES re-encode (lossy) and is therefore explicitly out of scope for MetaScrub's pure-strip approach.
Sidecar files (DJI .SRT, GoPro .THM/.LRV, Insta360 .LRV/.LRF)
Status: issue #46 (priority-1, privacy must-fix).
What it is: Many cameras drop sidecar files alongside the video — most notably DJI's .SRT files, which contain literal lat,lon,alt,timestamp per frame as plain ASCII. MetaScrub only processes the file the user dropped; the sidecar sitting in the same folder is untouched.
What may leak: The entire flight path / route in plain text. Worse than the in-file gaps because the data is outside the file the user thinks they cleaned.
What you can do (until #46 ships): Manually delete sidecar files. Common patterns: *.SRT next to *.MP4 (DJI); *.THM and *.LRV (GoPro); *.LRV / *.LRF (Insta360). The planned UX will detect these and offer to delete them at the same time as the parent file is cleaned.
Filesystem timestamps
Decided 2026-05-12 (issue #83). The app zeros every in-file timestamp it can reach (see privacy-invariants.md §6 for the full policy) — but the filesystem timestamps on the cleaned output file are partially out of reach. Two known platform gaps, both documented here rather than fixed in code:
Web build — output download mtime/atime is OS-clock time
Status: platform-inherent. No code fix available without violating the "no server-side processing" invariant.
What it is: When the browser writes a downloaded file to disk via <a download> or the File System Access API, the OS sets the file's mtime and atime from the system clock. There is no utimes-equivalent API exposed to web pages. The data inside the file is timestamp-free (see privacy-invariants.md §6 and the per-format forensic writeups), but the file's "Date modified" in the user's Downloads folder reflects when they downloaded it.
What may leak: Time-of-download — not time-of-content. An adversary inspecting a downloaded file's filesystem mtime learns roughly when the user saved it, not when the photo was taken / video was recorded. The correlation is weaker the longer the user holds the file before sharing.
Partial mitigation: Every File the web build constructs sets lastModified: 0. <a download> ignores this (the OS clock wins), but Web Share Target consumers (#23) may honor it — meaning files shared via iOS/Android share sheets carry a zero lastModified to the receiving app. The disk-write gap remains.
What you can do: If filesystem-timestamp parity matters for your threat model:
- Touch the file to a known time after download:
touch -t 197001010000 cleaned.jpg. - Move the file through tmpfs / a virtual disk to a different filesystem to break the original mtime.
- For Linux/macOS users: a one-liner
find ~/Downloads/cleaned*.jpg -exec touch -t 197001010000 {} \;after a batch.
File creation time (birthtime / crtime) cannot be set from a sandboxed renderer
Status: platform-inherent across both builds.
What it is: macOS HFS+/APFS, Linux ext4, and NTFS all store a separate file-creation time alongside mtime/atime. No portable cross-platform API exists to set creation time after a file has been written — the OS records it once at write time and reading-back tools (stat, Finder "Created", File Explorer) report that value.
What may leak: Time-of-write — same shape as the mtime gap above, but resistant to the touch workaround (which only affects mtime/atime).
What you can do:
- macOS:
SetFile -d "01/01/1970 00:00:00" file.jpg(requires Xcode Command Line Tools). - Linux ext4: no portable post-write API;
debugfsis offline-only and root-required. - Windows: PowerShell
(Get-Item file.jpg).CreationTime = '1970-01-01 00:00:00'works for NTFS. - Cross-platform: copy the file across a filesystem boundary (e.g., to a tmpfs or an exFAT USB) — many copy operations recreate birthtime as "now," but you can then
touchmtime/atime to epoch.
How this document gets updated
- A new gap is identified → file an issue → add a section here pointing to the issue.
- An existing gap is fixed → the section moves to the relevant
docs/forensic/<format>.mdwriteup (proving the gap is closed) and is removed from here. - A gap is permanently accepted (e.g. SEI NAL units, where the fix violates a project invariant) → the section stays here permanently with the workaround.
This file is required reading for anyone touching format strategies. The contrast between forensic/ (proven clean) and PRIVACY_GAPS.md (known dirty) is the project's honesty surface.