12 KiB
Generic ZIP metadata-stripping gap analysis
Date: 2026-05-22 Goal: Document the gap between (a) no ZIP support in v5 today, (b) ExifTool's generic-ZIP write (documented as "very limited"), (c) a theoretical thorough rewrite, and (d) the policy that ships in this PR — recursive cleaning of inner files via the strategy registry, plus epoch-normalized per-entry timestamps and scrubbed comments/extra fields. Closes issue #184.
Methodology
Read:
- PKWARE APPNOTE 6.3.10 (ZIP file format specification) — sections 4.1 (overall format), 4.3 (local file header + central directory record + EOCD), 4.4 (per-field semantics), 4.5 (extra-field records), 4.6 (extensible data fields).
- ExifTool documentation at https://exiftool.org/TagNames/ZIP.html and the perl source
lib/Image/ExifTool/ZIP.pmfor the read path; ExifTool docs at https://exiftool.org/index.html#limitations for the "Writing of ZIP files is not yet supported" caveat. - JSZip source (the
generateAsyncwriter and theloadAsyncparser) — confirmed it accepts thedateoption and writes the corresponding DOS-time fields in both the LFH and the central directory record. Confirmed default behavior isnew Date()(a privacy bug if relied on — see invariants §6). - mat2 source (
libmat2/archive.pyZipParser) — confirmed mat2 recurses into archive entries, calling format-specific parsers per entry, and rewrites the archive with epoch timestamps. The single archive-level field mat2 leaves alone is the comment (set to empty on re-emit). This is much closer to our shipping policy than ExifTool's.
Verified empirically (POC in /tmp/zip-poc/, not committed):
- Built a 3-entry archive:
photo.jpgwith sentinel EXIFArtist=SENTINEL-A,doc.pdfwith sentinel/Author=SENTINEL-B,note.txtwith no metadata. Archive comment set toSENTINEL-ARCHIVE. Per-entry comment onphoto.jpgset toSENTINEL-ENTRY. - Ran
exiftool -all= -Time:All= -overwrite_original archive.zip. Recovery battery (unzip -z/unzip -lv/exiftoolon extracted entries):- Archive comment: dropped (sole writable field).
- Per-entry comment: survived (ExifTool doesn't touch).
- Per-entry timestamps: survived (ExifTool doesn't touch).
- Inner JPEG EXIF: survived (ExifTool reads through but does not re-write inner entries).
- Inner PDF Info: survived (same).
- Ran
mat2 archive.zip(output:archive.cleaned.zip). Same battery:- Archive comment: dropped.
- Per-entry comment: dropped.
- Per-entry timestamps: normalized to 1980-01-01 00:00:00 (DOS epoch).
- Inner JPEG EXIF: dropped (mat2 recurses to its JPEG parser).
- Inner PDF Info: partial — Info dict cleared but mat2's PDF backend leaves the same residue that ExifTool's PDF backend does (see
docs/forensic/pdf.mdfor the analogous gap in our PDF strategy's reference comparison).
The mat2 result is the closer reference: ExifTool's generic-ZIP write is too thin to constitute a baseline.
Per-source policy table
| Source | Surface | Current v5 | ExifTool -all= -Time:All= |
mat2 | Theoretical | Ships in this PR |
|---|---|---|---|---|---|---|
| Archive comment | EOCD .ZIP file comment |
(no strategy) | dropped | dropped | dropped | dropped (empty string on re-emit) |
| Zip64 EOCD comment | Zip64 EOCD .ZIP file comment |
(no strategy) | not touched | dropped (rewrite) | dropped | dropped (re-emit rebuilds EOCD without comment) |
| Per-entry comment | CD entry file comment |
(no strategy) | not touched | dropped | dropped | dropped |
| Per-entry timestamp (LFH) | LFH last mod file time + date |
(no strategy) | not touched | epoch | epoch (1980-01-01) | epoch |
| Per-entry timestamp (CD) | CD entry last mod file time + date |
(no strategy) | not touched | epoch | epoch | epoch |
| Per-entry extra field — UT (0x5455, extended timestamp) | LFH/CD extra field | (no strategy) | not touched | dropped (mat2 strips all non-Zip64 extras) | dropped (UT records leak mtime/atime/ctime) | dropped |
| Per-entry extra field — UID/GID (0x7875, Info-ZIP Unix v1) | LFH/CD extra field | (no strategy) | not touched | dropped | dropped (uid/gid identify the creator's user account) | dropped |
| Per-entry extra field — NTFS (0x000a, NTFS times) | LFH/CD extra field | (no strategy) | not touched | dropped | dropped (100-ns NTFS times are higher-fidelity than DOS) | dropped |
| Per-entry extra field — Zip64 (0x0001) | LFH/CD extra field | n/a | preserved | preserved (structural) | preserved (required for archives > 4 GB) | preserved (structural; required for round-trip) |
| Inner JPEG metadata | nested EXIF / XMP / IPTC / Photoshop / Comment | (no strategy) | not touched (no recursion) | dropped (mat2 recurses) | dropped | dropped (recursion via selectStrategy → JpegStrategy) |
| Inner PNG metadata | nested tEXt/zTXt/iTXt/eXIf | (no strategy) | not touched | dropped (mat2 recurses) | dropped | dropped (recursion via selectStrategy → PngStrategy) |
| Inner PDF metadata | nested Info dict + XMP | (no strategy) | not touched | partial (same gap as ours) | dropped (theoretical) | dropped to the same bar as docs/forensic/pdf.md for standalone PDFs |
| Inner Office docProps | nested docProps/core.xml etc. | (no strategy) | not touched | dropped (mat2 recurses) | dropped | dropped (recursion via selectStrategy → OfficeStrategy) |
| Inner MP4 metadata | nested moov/udta/meta |
(no strategy) | not touched | partial (mat2's video coverage) | dropped | dropped (recursion via selectStrategy → VideoStrategy) |
| Inner HEIC/AVIF/WebP/GIF metadata | nested boxes/chunks | (no strategy) | not touched | depends on mat2 backend | dropped | dropped for formats with a registered strategy (currently HEIC unsupported; AVIF/WebP/GIF via ExifToolFallbackStrategy) |
| Inner nested .zip | recursive archive | (no strategy) | not touched | dropped (recursive) | dropped | dropped (recursion: selectStrategy → ZipStrategy → walks again) |
| Encrypted entry content | LFH GP-bit 0 set | n/a | n/a | refused (mat2 fails on encrypted entries) | not strippable without password | refused with invalid-file-format directing user to a decryption-capable tool — original direction was "pass-through with per-file warning" but JSZip's loadAsync refuses any archive containing encrypted entries, blocking the partial-passthrough path at the library level. Implementation note: a byte-level walker bypassing JSZip would unblock passthrough; deferred to a follow-up. |
| Self-extracting EXE stub | bytes before first LFH | n/a | preserved | refused (mat2 won't process SFX) | preserved (modifying breaks SFX) | preserved (gap; documented in PRIVACY_GAPS.md) |
| Per-entry filename | CD entry file name |
n/a | preserved | preserved (content, not metadata) | preserved | preserved (content, not metadata) |
| Per-entry CRC32 | LFH/CD crc-32 |
n/a | preserved | preserved | preserved (structural) | preserved (structural; JSZip recomputes) |
| Per-entry compression method + level | LFH/CD compression method |
n/a | preserved | normalized to DEFLATE | preserved (don't surprise users with size profile changes) | preserved (match input method per-entry) |
| Per-entry internal/external file attributes | CD entry internal/external file attributes |
n/a | preserved | preserved (filesystem permissions) | preserved (Unix mode bits + DOS attributes are filesystem-level, not user identity) | preserved |
Honest gap summary
Current v5 (no strategy) vs reference (mat2): total gap. Generic ZIPs route to "unsupported" today, bypassing every privacy guarantee MetaScrub makes for the inner files. Recursive cleaning is the only architecturally coherent fix.
ExifTool -all= vs mat2: ExifTool is not a viable reference. It writes only the archive-level comment for generic ZIPs and refuses to recurse into entries (it special-cases Office/EPUB/APK/JAR for read-only metadata enumeration only). The ~95% of the surface that matters (inner-file metadata + per-entry timestamps) is untouched. The ExifTool comparison row exists in the forensic battery to make this visible, not as a target to match.
mat2 vs theoretical: mat2 is genuinely close to the theoretical maximum on the recursive case. Its weaknesses are inherited from per-format backends (its PDF clean is partial in the same way ExifTool's is; same applies to MP4). On the ZIP-level work (per-entry epoch timestamps, scrubbed comments/extras), mat2 is essentially equivalent to a thorough rewrite. The shipping policy in this PR matches mat2 at the ZIP level and meets-or-exceeds it on inner formats where MetaScrub has dedicated hand-rolled walkers (JPEG, PNG, Office, MP4, PDF beat ExifTool at the per-format level — see the per-format forensic docs).
This PR vs mat2: identical at the ZIP layer; identical or better at the inner-format layer (we re-use our existing strategies). The one case where mat2 wins outright is encrypted-entry handling — mat2 refuses encrypted archives; we pass them through with a warning. The maintainer chose pass-through-with-warning explicitly (see spec §3); the rationale is that refusing the whole archive is a worse user outcome when most entries are unencrypted, and we surface the encrypted ones honestly via the warning + the inline UI message.
Recommendation
Hand-rolled walker over JSZip:
- JSZip is already a production dep (
OfficeStrategyuses it); no new dependency. - The library handles the structural concerns we don't want to re-implement (DEFLATE round-trip, central-directory rebuild, Zip64 promotion when needed).
- The metadata we care about (timestamps, comments, extra fields) is reachable via JSZip's per-entry options (
date,comment) or by re-emitting the entry without the metadata payload. - Encrypted-entry detection: JSZip's
loadAsyncthrows"Encrypted zip are not supported"on any archive containing encrypted entries. We catch that error and surface a structuredinvalid-file-formatresult. An earlier implementation used a hand-rolled LFH GP-flag scanner (~30 lines), but it had two blind spots — ZIP64 entries (compressed size = 0xFFFFFFFF in the LFH, real size in the Zip64 extra field) desynchronised the stride math, and data-descriptor entries (GP-flag bit 3) broke the scan on any streaming entry preceding an encrypted one. JSZip's detection runs on the same bytes it parses and has neither blind spot.
Library evaluations explicitly ruled out:
- Rewriting ZIP from scratch. ~3000 lines of bytewise PKWARE APPNOTE compliance, including DEFLATE, Zip64 promotion thresholds, and central-directory rebuild. Not worth it when JSZip handles the structural surface for us.
- fflate (alternative JS zip lib, ~12 KB gzip). Smaller than JSZip but doesn't expose the per-entry comment or extra-field options we need; we'd be writing the same byte-walking code we'd write anyway, just on top of a less-featured library. Adding a second zip library is also a fresh prod dep against the 4-dep ceiling.
Phase plan
This PR ships the full shipping policy in §"Per-source policy table" plus the per-leaf diff UI tree. Deferred items:
- Streaming MP4/Office strip for large archives — out of scope; tracked in #34 (which would also benefit standalone large-file processing).
- Self-extracting EXE stub scrubbing — documented gap, requires distinguishing stub-PE bytes from arbitrary leading garbage. Not worth the engineering cost for the audience.
- Decryption of encrypted entries — out of scope (no password prompts; see invariants).
- Multi-disk / spanned archives — JSZip rejects them; surfaces as
parse-failed. - ZIP64 archives > 4 GB — Zip64 is supported in pass-through; not sentinel-tested at that scale (would need a multi-GB fixture).