exifcleaner-web/docs/forensic/zip.md
forgejo_admin d9e763b76e
All checks were successful
CI / Lint, Typecheck & Unit Tests (push) Successful in 32s
CI / Smoke build (VITE_ENABLE_FFMPEG_FALLBACK=false) (push) Successful in 45s
CI / E2E (Standalone single-file) (push) Successful in 1m35s
CI / E2E (Web) (push) Successful in 3m23s
feat(zip): generic ZIP support with recursive inner-file cleaning (#184) (#188)
2026-05-22 20:32:03 +04:00

22 KiB
Raw Permalink Blame History

ZIP forensic recovery test

Date: 2026-05-22 Goal: Verify that metadata stripped by ZipStrategy cannot be recovered by an attacker with standard ZIP forensic tooling, across nine surfaces that span the ZIP container itself (archive comment, per-entry comment, per-entry extra field, per-entry timestamp) plus the inner files commonly carried inside archives (JPEG EXIF, PDF Info, DOCX docProps, nested ZIPs, encrypted entries). Compare against exiftool -all= -Time:All= and mat2 as reference points.

Reproducible at: tools/forensic/zip.tsnpx tsx tools/forensic/zip.ts from the project root.

Methodology

The runner builds two synthetic ZIP fixtures programmatically. The primary fixture exercises eight metadata surfaces; a separate encrypted-archive fixture exercises the ninth.

The byte-level ZIP builder is independent of JSZip — per .claude/rules/format-strategy-workflow.md, adversarial-independence between the fixture builder and the production strategy's library protects the test from a shared-quirk class of false negatives. Same rationale as tools/forensic/video.ts's walkAtoms (independent of parseBoxes).

Sentinels embedded across nine surfaces:

# Sentinel Surface Where it lives in the fixture
1 SENTINEL-ARCHIVE-CMNT-A1B2C3 Archive comment EOCD .ZIP file comment
2 SENTINEL-ENTRY-CMNT-D4E5F6 Per-entry comment Central directory entry file comment on notes.txt
3a SENTINEL-EXTRA-7G8H9I Per-entry extra field — custom (0x7878) Arbitrary unregistered ID; tests the "unknown record = strip anyway" path on notes.txt
3b SENTINEL-EXTRA-UT-K2L3M4 Per-entry extra field — UT extended timestamp (0x5455) Info-ZIP UT record (the wall-clock mtime/atime/ctime trio commonly written by Linux zip, macOS Finder, Word's "Save as zip") on notes.txt
3c SENTINEL-EXTRA-UIDGID-N5O6P7 Per-entry extra field — UID/GID Unix v1 (0x7875) Info-ZIP Unix v1 record (creator's uid/gid; identifies the user account that created the archive) on notes.txt
3d SENTINEL-EXTRA-NTFS-Q8R9S0 Per-entry extra field — NTFS times (0x000a) Windows NTFS record (100-ns mtime/atime/ctime, higher-fidelity than DOS) on notes.txt
4 2023-04-15 14:32:11 Per-entry timestamp DOS-encoded last-mod date/time in LFH + CD of every entry
5 SENTINEL-JPEG-EXIF-J1K2L3 Inner JPEG EXIF Artist EXIF/APP1 IFD0 tag 0x013b (Artist) in photo.jpg
6 SENTINEL-PDF-INFO-M4N5O6 Inner PDF /Author /Info /Author in report.pdf (pdf-lib setAuthor())
7 SENTINEL-DOCX-P7Q8R9 Inner DOCX <dc:creator> docProps/core.xml dc:creator + cp:lastModifiedBy in memo.docx
8 SENTINEL-NESTED-S1T2U3 Nested-zip archive comment (recursion test) EOCD .ZIP file comment of inner.zip (carried as an entry)
9 SENTINEL-ENCRYPTED-V4W5X6 Encrypted entry inner content (KNOWN GAP) Cleartext payload of secret.txt in the separate encrypted-archive fixture, with general-purpose-flag bit 0 (encrypted) set on the LFH/CD

The primary fixture is a 5-entry ZIP: notes.txt (carrying surfaces 2-4), photo.jpg (surface 5), report.pdf (surface 6), memo.docx (surface 7), and inner.zip (surface 8 — itself carrying a nested-readme entry and the nested archive comment). Every entry's last-mod timestamp is 2023-04-15 14:32:11 (surface 4); the EOCD carries the archive comment (surface 1). The encrypted-archive fixture is a 1-entry ZIP whose secret.txt LFH has GP-bit 0 set; ZipStrategy refuses encrypted archives at the magic-byte check, so surface 9 is documented as a known gap rather than tested for byte-level stripping.

The fixtures are then stripped three ways:

  1. ZipStrategy — our JSZip-based implementation, invoked in-process. The runner inlines the production routing (OfficeStrategy → JpegStrategy → PngStrategy → PdfStrategy → ZipStrategy) and wires it into setZipStrategyRouter so inner-entry recursion goes through the same selectStrategy() path as the production renderer.
  2. exiftool -all= -Time:All= -overwrite_original — the canonical reference for image-metadata tools. ExifTool's documentation explicitly states "Writing of ZIP files is not yet supported"; the runner records this refusal as REFUSED rather than treating it as a runner failure. This is the documented finding from the gap analysis, surfaced directly in the matrix.
  3. mat2 — the FOSS reference used by Tails OS. mat2's libmat2/archive.py ZipParser recurses into archive entries, calls format-specific parsers per entry, and rewrites the archive with epoch timestamps + scrubbed comments. This is the meaningful comparison reference — ExifTool isn't a viable baseline because it doesn't write generic ZIPs at all.

For each cleaned output, the recovery battery applies six techniques:

  1. unzip -z <file> — prints the archive comment. Catches surface 1.
  2. zipinfo -v <file> — verbose listing including per-entry comments and extra fields. Catches surfaces 2 + 3.
  3. unzip -l <file> — listing including per-entry timestamps. Catches surface 4 (looks for the literal 2023-04-15 or 1980-01-01).
  4. Inner-file extraction + exiftool -a -G1 -s per entry — surfaces structured metadata in extracted JPEG / PDF / DOCX. Catches surfaces 5 + 6 + 7.
  5. Inner-file extraction + strings per entry — catches any sentinel left in plain-text bytes anywhere in the extracted entry tree, including the nested-zip archive comment when the nested archive is itself extracted. Catches surface 8 (and provides a cross-check for surfaces 5-7).
  6. Raw strings over the cleaned ZIP bytes — catches any leakage of sentinels into the outer ZIP's central directory or LFH stream that wouldn't surface through the per-entry channels.

Verdict per surface per strip path: DROPPED (sentinel absent), NORMALIZED (timestamp is 1980-01-01 instead of the input's 2023-04-15), SURVIVED (sentinel found anywhere), REFUSED (the tool declines to process ZIP), SKIP (channel not collected), KNOWN_GAP (documented gap, not tested against this output).

Bar: zero sentinel survivors across every recovery technique for ZipStrategy on surfaces 1-8 (counting 3a3d as one). Surface 4 is NORMALIZED (not DROPPED — the format requires some timestamp, and ZIP's 1980-01-01 epoch is the minimum DOS-time per .claude/rules/privacy-invariants.md §6). Surface 9 is a documented KNOWN_GAP — ZipStrategy refuses encrypted archives outright in v1, so the encrypted-inner sentinel is unaddressable through normal flow. The runner exits non-zero on UNEXPECTED survivors for surfaces 1-8 (counting 3a3d as one).

Results

Captured 2026-05-22 from npx tsx tools/forensic/zip.ts. Tools: exiftool 13.30, mat2 0.13.4, unzip 6.0, zipinfo 6.0.

# Surface Expected Input (sanity) ZipStrategy exiftool -all= -Time:All= mat2
1 Archive comment DROPPED present DROPPED REFUSED¹ DROPPED
2 Per-entry comment DROPPED present DROPPED REFUSED¹ DROPPED
3a Extra field — custom (0x7878) DROPPED present DROPPED REFUSED¹ DROPPED
3b Extra field — UT timestamp (0x5455) DROPPED present DROPPED REFUSED¹ DROPPED
3c Extra field — UID/GID (0x7875) DROPPED present DROPPED REFUSED¹ DROPPED
3d Extra field — NTFS times (0x000a) DROPPED present DROPPED REFUSED¹ DROPPED
4 Per-entry timestamp NORMALIZED → 1980-01-01 2023-04-15 NORMALIZED REFUSED¹ NORMALIZED
5 Inner JPEG EXIF Artist DROPPED present DROPPED REFUSED¹ DROPPED
6 Inner PDF /Author DROPPED present DROPPED REFUSED¹ DROPPED
7 Inner DOCX <dc:creator> DROPPED present DROPPED REFUSED¹ DROPPED
8 Nested-zip archive comment DROPPED present DROPPED REFUSED¹ DROPPED
9 Encrypted entry inner content KNOWN_GAP n/a² KNOWN_GAP³ REFUSED¹ KNOWN_GAP⁴

¹ ExifTool exits with Error: Writing of ZIP files is not yet supported. Documented limitation per exiftool.org/#limitations — ExifTool reads metadata from ZIPs (and recognises Office/EPUB/APK as special cases for read-only enumeration) but does not write back. The matrix records REFUSED rather than SKIP because this is the central finding from the gap analysis: ExifTool is not a meaningful reference for stripping generic ZIPs.

² Surface 9 lives in a separate encrypted-archive fixture, not the primary fixture, so the "input sanity" column does not apply.

³ ZipStrategy returns { ok: false, error: { code: "invalid-file-format", detail: "Encrypted ZIP archives aren't supported — use a dedicated tool (7-Zip, ExifTool standalone) that can decrypt to clean inner content." } } when given the encrypted-archive fixture. Inner content not addressable through this code path; documented in spec §3 and docs/PRIVACY_GAPS.md.

⁴ mat2 refuses the encrypted-archive fixture (its archive parser errors on encrypted entries). Parity with ZipStrategy. Result class is "refused with clear error" for both tools; neither leaks the inner sentinel because neither produces a cleaned output.

Aggregate verdicts:

  • ZipStrategy: 11/11 strict surfaces (DROPPED or NORMALIZED — counting 3a/3b/3c/3d separately). 1 documented gap (encrypted entries).
  • exiftool -all= -Time:All=: 0/11. Tool refuses to write ZIPs entirely.
  • mat2: 11/11 strict surfaces. 1 gap (encrypted entries — parity).

Runner exit code: 0 (PASS — no UNEXPECTED survivors).

Interpretation

ZipStrategy and mat2 are equivalent on this fixture; ExifTool is not a viable reference at all.

  • ZipStrategy scrubs the four ZIP-level surfaces (archive comment, per-entry comments, per-entry extra fields, per-entry timestamps → epoch) by re-emitting via JSZip.generateAsync({ comment: "" }) with every entry passed date: new Date(Date.UTC(1980, 0, 1, 12, 0, 0)) and comment: "". The inner-file surfaces are scrubbed via recursion through selectStrategy(): each decompressed entry's bytes are routed back through the strategy registry. A photo.jpg entry hits JpegStrategy; a report.pdf hits PdfStrategy; a memo.docx hits OfficeStrategy; a inner.zip hits ZipStrategy recursively. The output ZIP carries the cleaned-leaf bytes under the original entry names. Surface 8 (nested archive) confirms recursion works structurally — the inner archive's EOCD comment is dropped just like the outer's.

  • ExifTool is not a meaningful reference for generic ZIPs. Its documentation says so: "Writing of ZIP files is not yet supported." The matrix records REFUSED across every surface so the comparison row is honest. The gap analysis (docs/gap-analysis/zip.md) makes the same finding from the read-side — ExifTool special-cases Office/EPUB/APK/JAR for read-only metadata enumeration only, never re-writes the archive. The ~95% of the surface that matters (inner-file metadata + per-entry timestamps + per-entry comments) is untouched.

  • mat2 is the meaningful reference. Its libmat2/archive.py ZipParser recurses into each archive entry, dispatches to a format-specific backend, and rewrites the archive with epoch DOS timestamps + scrubbed archive comment. On this fixture mat2 achieves the same surface-by-surface result as ZipStrategy: 8/8 DROPPED-or-NORMALIZED, refuses the encrypted fixture. The outputs differ in size (ZipStrategy 2494 bytes vs mat2 4106 bytes — mat2 re-encodes the inner PDF via Cairo, producing a larger rasterised PDF; we keep the original PDF structure via pdf-lib's targeted scrub). For users who care about preserving inner-document fidelity (text remains text, not bitmap), ZipStrategy's approach is strictly preferable. For users who care about maximum sentinel destruction at the cost of inner-file fidelity, the two tools are equivalent on this fixture.

Where ZipStrategy beats ExifTool outright: every surface. ExifTool cannot strip generic ZIPs at all; it refuses the input. Even the surfaces ExifTool can read (archive comment, inner-file structured metadata) are surfaced read-only, not stripped.

Where ZipStrategy matches mat2: all 8 strict surfaces. Per-entry epoch + comment scrub + inner-file recursion via the strategy registry produces the same result as mat2's per-backend recursion model.

Where mat2 nominally beats ZipStrategy: none on this fixture. The previous direction note in the gap analysis described the design as "encrypted entries pass through with warning, mat2 refuses outright" — but the shipping policy was changed to "refuse encrypted archives" (spec §3) because JSZip's loadAsync refuses encrypted entries at the library level. As a result, ZipStrategy and mat2 are now at parity on encrypted-archive handling: both refuse cleanly.

Caveats and limits of this test

  • Encrypted archives are refused, not stripped. JSZip's loadAsync won't load an archive with any encrypted entry ("Encrypted zip are not supported"), so v1 surfaces invalid-file-format and directs the user to a decryption-capable tool. A byte-level walker bypassing JSZip would unblock partial-passthrough cleaning of zip-level metadata around encrypted content; deferred. Surface 9 is documented as a known gap in docs/PRIVACY_GAPS.md.
  • Self-extracting EXE stubs are preserved. Bytes before the first local file header (the PE stub) are not touched — modifying them breaks the SFX behavior. Documented gap. Not tested here.
  • Multi-disk / spanned archives — JSZip rejects them; surface as parse-failed. Not tested.
  • ZIP64 (0x0001) extra-field records are the structural exception we preserve, but not sentinel-tested here. Zip64 only triggers on archives or entries exceeding the 32-bit size fields (~4 GB), which a synthetic fixture can't reach cheaply. The gap-analysis policy table is explicit: "preserved (structural; required for archives > 4 GB)." Multi-GB Zip64 verification is deferred.
  • The fixture is synthetic but exercises a realistic extra-field profile. Surfaces 3a3d embed the four record IDs common to Word's "Save as zip", macOS Finder, 7-Zip, and Info-ZIP: custom 0x7878, UT extended timestamp (0x5455), UID/GID Unix v1 (0x7875), and NTFS times (0x000a). Each record carries its own ASCII sentinel inside its data payload, so the recovery battery verifies per-record-id stripping. ZipStrategy's policy ("strip every extra-field record except 0x0001") is confirmed against each ID independently, matching mat2's behavior.
  • The PDF inner sentinel is /Info /Author only. The PDF strategy's full sentinel battery (Title / Author / Subject / Producer / Creator / XMP / Annotations / Lang — 10 sentinels) is exercised in docs/forensic/pdf.md. Here we use one sentinel because the point is to verify the recursion path through ZipStrategy → PdfStrategy, not to re-test the PDF strategy's depth (which is already covered).
  • The DOCX inner sentinel covers <dc:creator> + <cp:lastModifiedBy>. The Office strategy's full gap battery is exercised in docs/forensic/office.ts. Same rationale as the PDF case.
  • The mat2 encrypted-archive refusal is detected via mat2 exit status, not by reading mat2's stderr. Surfacing the structured refusal reason from mat2 would require parsing its Python traceback; that's a fragility we deliberately avoid.
  • No unzip -P or AES-decryption attempt. The encrypted fixture uses a bogus ZipCrypto payload (12-byte header + 1 byte of "ciphertext"); the password is unknown and not material to the test. Surface 9 is about refusal behavior, not about whether the encryption is "real."

Reproducing

# From the project root
npx tsx tools/forensic/zip.ts

Outputs go to /tmp/zip-forensic/:

  • input-primary.zip — the 5-entry fixture (surfaces 1-8 (counting 3a3d as one))
  • input-encrypted.zip — the 1-entry encrypted fixture (surface 9)
  • output-ours.zipZipStrategy output
  • output-exiftool.zip — exiftool-cleaned copy (empty on refusal — exiftool didn't write anything new)
  • output-mat2.zip — mat2-cleaned copy
  • output-*-encrypted.zip — encrypted-fixture outputs (mostly refusal copies)
  • output-ours.zip.extracted/, etc. — per-output extraction tree used by the inner-exiftool + inner-strings channels
  • report.json — structured per-surface verdict per strip path

Required tools: exiftool (libimage-exiftool-perl), mat2, unzip, zipinfo, strings (binutils). All available on Debian/Ubuntu via apt.

Debian/Ubuntu one-liner: sudo apt install libimage-exiftool-perl mat2 unzip binutils.

What this directory is for

docs/forensic/ documents adversarial recovery tests run after implementation lands, complementing docs/gap-analysis/ (which runs before implementation to scope what should be removed). The pattern: implement → unit-test correctness → forensic-test unrecoverability → document the result.

Each format gets its own writeup as we go: zip.md here, pdf.md / jpeg.md / office.md / png.md / video.md for the formats shipped earlier. The runner scripts at tools/forensic/<format>.ts stay in the repo so the tests can be re-run any time the strategy changes.

Captured runner output (2026-05-22)

Sentinels embedded in fixture:
  ARCHIVE_CMNT       SENTINEL-ARCHIVE-CMNT-A1B2C3
  ENTRY_CMNT         SENTINEL-ENTRY-CMNT-D4E5F6
  EXTRA_FIELD        SENTINEL-EXTRA-7G8H9I
  JPEG_EXIF          SENTINEL-JPEG-EXIF-J1K2L3
  PDF_INFO           SENTINEL-PDF-INFO-M4N5O6
  DOCX_CREATOR       SENTINEL-DOCX-P7Q8R9
  NESTED_ARCHIVE     SENTINEL-NESTED-S1T2U3
  ENCRYPTED_INNER    SENTINEL-ENCRYPTED-V4W5X6
  TIMESTAMP_LITERAL  2023-04-15 (non-epoch)

Primary fixture:   /tmp/zip-forensic/input-primary.zip (3793 bytes)
Encrypted fixture: /tmp/zip-forensic/input-encrypted.zip (144 bytes)

=== Stripping primary fixture ===
  ZipStrategy:  ok (2494 bytes)
  exiftool:     refused-by-design — ExifTool: 'Writing of ZIP files is not yet supported' — documented limitation per https://exiftool.org/#limitations
  mat2:         ok (4106 bytes)

=== Results matrix (9 surfaces × 3 strip paths) ===
| Surface                                          | Expected                | input    | ZipStrategy  | exiftool | mat2       |
|--------------------------------------------------|-------------------------|----------|--------------|----------|------------|
| 1. Archive comment                               | DROPPED                 | SURVIVED | DROPPED      | REFUSED  | DROPPED    |
| 2. Per-entry comment                             | DROPPED                 | SURVIVED | DROPPED      | REFUSED  | DROPPED    |
| 3a. Extra field — custom (0x7878)                | DROPPED                 | SURVIVED | DROPPED      | REFUSED  | DROPPED    |
| 3b. Extra field — UT extended timestamp (0x5455) | DROPPED                 | SURVIVED | DROPPED      | REFUSED  | DROPPED    |
| 3c. Extra field — UID/GID (0x7875)               | DROPPED                 | SURVIVED | DROPPED      | REFUSED  | DROPPED    |
| 3d. Extra field — NTFS times (0x000a)            | DROPPED                 | SURVIVED | DROPPED      | REFUSED  | DROPPED    |
| 4. Per-entry timestamp                           | NORMALIZED → 1980-01-01 | SURVIVED | NORMALIZED   | REFUSED  | NORMALIZED |
| 5. Inner JPEG EXIF Artist                        | DROPPED                 | SURVIVED | DROPPED      | REFUSED  | DROPPED    |
| 6. Inner PDF /Author                             | DROPPED                 | SURVIVED | DROPPED      | REFUSED  | DROPPED    |
| 7. Inner DOCX <dc:creator>                       | DROPPED                 | SURVIVED | DROPPED      | REFUSED  | DROPPED    |
| 8. Nested-zip archive comment                    | DROPPED                 | SURVIVED | DROPPED      | REFUSED  | DROPPED    |
| 9. Encrypted entry inner content                 | KNOWN_GAP               | DROPPED  | KNOWN_GAP    | REFUSED  | KNOWN_GAP  |

PASS — all 11 (DROPPED/NORMALIZED) strict surfaces verified for ZipStrategy.