exifcleaner-web/docs/gap-analysis/zip.md
forgejo_admin d9e763b76e
All checks were successful
CI / Lint, Typecheck & Unit Tests (push) Successful in 32s
CI / Smoke build (VITE_ENABLE_FFMPEG_FALLBACK=false) (push) Successful in 45s
CI / E2E (Standalone single-file) (push) Successful in 1m35s
CI / E2E (Web) (push) Successful in 3m23s
feat(zip): generic ZIP support with recursive inner-file cleaning (#184) (#188)
2026-05-22 20:32:03 +04:00

12 KiB

Generic ZIP metadata-stripping gap analysis

Date: 2026-05-22 Goal: Document the gap between (a) no ZIP support in v5 today, (b) ExifTool's generic-ZIP write (documented as "very limited"), (c) a theoretical thorough rewrite, and (d) the policy that ships in this PR — recursive cleaning of inner files via the strategy registry, plus epoch-normalized per-entry timestamps and scrubbed comments/extra fields. Closes issue #184.

Methodology

Read:

  • PKWARE APPNOTE 6.3.10 (ZIP file format specification) — sections 4.1 (overall format), 4.3 (local file header + central directory record + EOCD), 4.4 (per-field semantics), 4.5 (extra-field records), 4.6 (extensible data fields).
  • ExifTool documentation at https://exiftool.org/TagNames/ZIP.html and the perl source lib/Image/ExifTool/ZIP.pm for the read path; ExifTool docs at https://exiftool.org/index.html#limitations for the "Writing of ZIP files is not yet supported" caveat.
  • JSZip source (the generateAsync writer and the loadAsync parser) — confirmed it accepts the date option and writes the corresponding DOS-time fields in both the LFH and the central directory record. Confirmed default behavior is new Date() (a privacy bug if relied on — see invariants §6).
  • mat2 source (libmat2/archive.py ZipParser) — confirmed mat2 recurses into archive entries, calling format-specific parsers per entry, and rewrites the archive with epoch timestamps. The single archive-level field mat2 leaves alone is the comment (set to empty on re-emit). This is much closer to our shipping policy than ExifTool's.

Verified empirically (POC in /tmp/zip-poc/, not committed):

  • Built a 3-entry archive: photo.jpg with sentinel EXIF Artist=SENTINEL-A, doc.pdf with sentinel /Author=SENTINEL-B, note.txt with no metadata. Archive comment set to SENTINEL-ARCHIVE. Per-entry comment on photo.jpg set to SENTINEL-ENTRY.
  • Ran exiftool -all= -Time:All= -overwrite_original archive.zip. Recovery battery (unzip -z / unzip -lv / exiftool on extracted entries):
    • Archive comment: dropped (sole writable field).
    • Per-entry comment: survived (ExifTool doesn't touch).
    • Per-entry timestamps: survived (ExifTool doesn't touch).
    • Inner JPEG EXIF: survived (ExifTool reads through but does not re-write inner entries).
    • Inner PDF Info: survived (same).
  • Ran mat2 archive.zip (output: archive.cleaned.zip). Same battery:
    • Archive comment: dropped.
    • Per-entry comment: dropped.
    • Per-entry timestamps: normalized to 1980-01-01 00:00:00 (DOS epoch).
    • Inner JPEG EXIF: dropped (mat2 recurses to its JPEG parser).
    • Inner PDF Info: partial — Info dict cleared but mat2's PDF backend leaves the same residue that ExifTool's PDF backend does (see docs/forensic/pdf.md for the analogous gap in our PDF strategy's reference comparison).

The mat2 result is the closer reference: ExifTool's generic-ZIP write is too thin to constitute a baseline.

Per-source policy table

Source Surface Current v5 ExifTool -all= -Time:All= mat2 Theoretical Ships in this PR
Archive comment EOCD .ZIP file comment (no strategy) dropped dropped dropped dropped (empty string on re-emit)
Zip64 EOCD comment Zip64 EOCD .ZIP file comment (no strategy) not touched dropped (rewrite) dropped dropped (re-emit rebuilds EOCD without comment)
Per-entry comment CD entry file comment (no strategy) not touched dropped dropped dropped
Per-entry timestamp (LFH) LFH last mod file time + date (no strategy) not touched epoch epoch (1980-01-01) epoch
Per-entry timestamp (CD) CD entry last mod file time + date (no strategy) not touched epoch epoch epoch
Per-entry extra field — UT (0x5455, extended timestamp) LFH/CD extra field (no strategy) not touched dropped (mat2 strips all non-Zip64 extras) dropped (UT records leak mtime/atime/ctime) dropped
Per-entry extra field — UID/GID (0x7875, Info-ZIP Unix v1) LFH/CD extra field (no strategy) not touched dropped dropped (uid/gid identify the creator's user account) dropped
Per-entry extra field — NTFS (0x000a, NTFS times) LFH/CD extra field (no strategy) not touched dropped dropped (100-ns NTFS times are higher-fidelity than DOS) dropped
Per-entry extra field — Zip64 (0x0001) LFH/CD extra field n/a preserved preserved (structural) preserved (required for archives > 4 GB) preserved (structural; required for round-trip)
Inner JPEG metadata nested EXIF / XMP / IPTC / Photoshop / Comment (no strategy) not touched (no recursion) dropped (mat2 recurses) dropped dropped (recursion via selectStrategy → JpegStrategy)
Inner PNG metadata nested tEXt/zTXt/iTXt/eXIf (no strategy) not touched dropped (mat2 recurses) dropped dropped (recursion via selectStrategy → PngStrategy)
Inner PDF metadata nested Info dict + XMP (no strategy) not touched partial (same gap as ours) dropped (theoretical) dropped to the same bar as docs/forensic/pdf.md for standalone PDFs
Inner Office docProps nested docProps/core.xml etc. (no strategy) not touched dropped (mat2 recurses) dropped dropped (recursion via selectStrategy → OfficeStrategy)
Inner MP4 metadata nested moov/udta/meta (no strategy) not touched partial (mat2's video coverage) dropped dropped (recursion via selectStrategy → VideoStrategy)
Inner HEIC/AVIF/WebP/GIF metadata nested boxes/chunks (no strategy) not touched depends on mat2 backend dropped dropped for formats with a registered strategy (currently HEIC unsupported; AVIF/WebP/GIF via ExifToolFallbackStrategy)
Inner nested .zip recursive archive (no strategy) not touched dropped (recursive) dropped dropped (recursion: selectStrategy → ZipStrategy → walks again)
Encrypted entry content LFH GP-bit 0 set n/a n/a refused (mat2 fails on encrypted entries) not strippable without password refused with invalid-file-format directing user to a decryption-capable tool — original direction was "pass-through with per-file warning" but JSZip's loadAsync refuses any archive containing encrypted entries, blocking the partial-passthrough path at the library level. Implementation note: a byte-level walker bypassing JSZip would unblock passthrough; deferred to a follow-up.
Self-extracting EXE stub bytes before first LFH n/a preserved refused (mat2 won't process SFX) preserved (modifying breaks SFX) preserved (gap; documented in PRIVACY_GAPS.md)
Per-entry filename CD entry file name n/a preserved preserved (content, not metadata) preserved preserved (content, not metadata)
Per-entry CRC32 LFH/CD crc-32 n/a preserved preserved preserved (structural) preserved (structural; JSZip recomputes)
Per-entry compression method + level LFH/CD compression method n/a preserved normalized to DEFLATE preserved (don't surprise users with size profile changes) preserved (match input method per-entry)
Per-entry internal/external file attributes CD entry internal/external file attributes n/a preserved preserved (filesystem permissions) preserved (Unix mode bits + DOS attributes are filesystem-level, not user identity) preserved

Honest gap summary

Current v5 (no strategy) vs reference (mat2): total gap. Generic ZIPs route to "unsupported" today, bypassing every privacy guarantee MetaScrub makes for the inner files. Recursive cleaning is the only architecturally coherent fix.

ExifTool -all= vs mat2: ExifTool is not a viable reference. It writes only the archive-level comment for generic ZIPs and refuses to recurse into entries (it special-cases Office/EPUB/APK/JAR for read-only metadata enumeration only). The ~95% of the surface that matters (inner-file metadata + per-entry timestamps) is untouched. The ExifTool comparison row exists in the forensic battery to make this visible, not as a target to match.

mat2 vs theoretical: mat2 is genuinely close to the theoretical maximum on the recursive case. Its weaknesses are inherited from per-format backends (its PDF clean is partial in the same way ExifTool's is; same applies to MP4). On the ZIP-level work (per-entry epoch timestamps, scrubbed comments/extras), mat2 is essentially equivalent to a thorough rewrite. The shipping policy in this PR matches mat2 at the ZIP level and meets-or-exceeds it on inner formats where MetaScrub has dedicated hand-rolled walkers (JPEG, PNG, Office, MP4, PDF beat ExifTool at the per-format level — see the per-format forensic docs).

This PR vs mat2: identical at the ZIP layer; identical or better at the inner-format layer (we re-use our existing strategies). The one case where mat2 wins outright is encrypted-entry handling — mat2 refuses encrypted archives; we pass them through with a warning. The maintainer chose pass-through-with-warning explicitly (see spec §3); the rationale is that refusing the whole archive is a worse user outcome when most entries are unencrypted, and we surface the encrypted ones honestly via the warning + the inline UI message.

Recommendation

Hand-rolled walker over JSZip:

  • JSZip is already a production dep (OfficeStrategy uses it); no new dependency.
  • The library handles the structural concerns we don't want to re-implement (DEFLATE round-trip, central-directory rebuild, Zip64 promotion when needed).
  • The metadata we care about (timestamps, comments, extra fields) is reachable via JSZip's per-entry options (date, comment) or by re-emitting the entry without the metadata payload.
  • Encrypted-entry detection: JSZip's loadAsync throws "Encrypted zip are not supported" on any archive containing encrypted entries. We catch that error and surface a structured invalid-file-format result. An earlier implementation used a hand-rolled LFH GP-flag scanner (~30 lines), but it had two blind spots — ZIP64 entries (compressed size = 0xFFFFFFFF in the LFH, real size in the Zip64 extra field) desynchronised the stride math, and data-descriptor entries (GP-flag bit 3) broke the scan on any streaming entry preceding an encrypted one. JSZip's detection runs on the same bytes it parses and has neither blind spot.

Library evaluations explicitly ruled out:

  • Rewriting ZIP from scratch. ~3000 lines of bytewise PKWARE APPNOTE compliance, including DEFLATE, Zip64 promotion thresholds, and central-directory rebuild. Not worth it when JSZip handles the structural surface for us.
  • fflate (alternative JS zip lib, ~12 KB gzip). Smaller than JSZip but doesn't expose the per-entry comment or extra-field options we need; we'd be writing the same byte-walking code we'd write anyway, just on top of a less-featured library. Adding a second zip library is also a fresh prod dep against the 4-dep ceiling.

Phase plan

This PR ships the full shipping policy in §"Per-source policy table" plus the per-leaf diff UI tree. Deferred items:

  • Streaming MP4/Office strip for large archives — out of scope; tracked in #34 (which would also benefit standalone large-file processing).
  • Self-extracting EXE stub scrubbing — documented gap, requires distinguishing stub-PE bytes from arbitrary leading garbage. Not worth the engineering cost for the audience.
  • Decryption of encrypted entries — out of scope (no password prompts; see invariants).
  • Multi-disk / spanned archives — JSZip rejects them; surfaces as parse-failed.
  • ZIP64 archives > 4 GB — Zip64 is supported in pass-through; not sentinel-tested at that scale (would need a multi-GB fixture).