22 KiB
ZIP forensic recovery test
Date: 2026-05-22
Goal: Verify that metadata stripped by ZipStrategy cannot be recovered by an attacker with standard ZIP forensic tooling, across nine surfaces that span the ZIP container itself (archive comment, per-entry comment, per-entry extra field, per-entry timestamp) plus the inner files commonly carried inside archives (JPEG EXIF, PDF Info, DOCX docProps, nested ZIPs, encrypted entries). Compare against exiftool -all= -Time:All= and mat2 as reference points.
Reproducible at: tools/forensic/zip.ts — npx tsx tools/forensic/zip.ts from the project root.
Methodology
The runner builds two synthetic ZIP fixtures programmatically. The primary fixture exercises eight metadata surfaces; a separate encrypted-archive fixture exercises the ninth.
The byte-level ZIP builder is independent of JSZip — per .claude/rules/format-strategy-workflow.md, adversarial-independence between the fixture builder and the production strategy's library protects the test from a shared-quirk class of false negatives. Same rationale as tools/forensic/video.ts's walkAtoms (independent of parseBoxes).
Sentinels embedded across nine surfaces:
| # | Sentinel | Surface | Where it lives in the fixture |
|---|---|---|---|
| 1 | SENTINEL-ARCHIVE-CMNT-A1B2C3 |
Archive comment | EOCD .ZIP file comment |
| 2 | SENTINEL-ENTRY-CMNT-D4E5F6 |
Per-entry comment | Central directory entry file comment on notes.txt |
| 3a | SENTINEL-EXTRA-7G8H9I |
Per-entry extra field — custom (0x7878) | Arbitrary unregistered ID; tests the "unknown record = strip anyway" path on notes.txt |
| 3b | SENTINEL-EXTRA-UT-K2L3M4 |
Per-entry extra field — UT extended timestamp (0x5455) | Info-ZIP UT record (the wall-clock mtime/atime/ctime trio commonly written by Linux zip, macOS Finder, Word's "Save as zip") on notes.txt |
| 3c | SENTINEL-EXTRA-UIDGID-N5O6P7 |
Per-entry extra field — UID/GID Unix v1 (0x7875) | Info-ZIP Unix v1 record (creator's uid/gid; identifies the user account that created the archive) on notes.txt |
| 3d | SENTINEL-EXTRA-NTFS-Q8R9S0 |
Per-entry extra field — NTFS times (0x000a) | Windows NTFS record (100-ns mtime/atime/ctime, higher-fidelity than DOS) on notes.txt |
| 4 | 2023-04-15 14:32:11 |
Per-entry timestamp | DOS-encoded last-mod date/time in LFH + CD of every entry |
| 5 | SENTINEL-JPEG-EXIF-J1K2L3 |
Inner JPEG EXIF Artist | EXIF/APP1 IFD0 tag 0x013b (Artist) in photo.jpg |
| 6 | SENTINEL-PDF-INFO-M4N5O6 |
Inner PDF /Author | /Info /Author in report.pdf (pdf-lib setAuthor()) |
| 7 | SENTINEL-DOCX-P7Q8R9 |
Inner DOCX <dc:creator> |
docProps/core.xml dc:creator + cp:lastModifiedBy in memo.docx |
| 8 | SENTINEL-NESTED-S1T2U3 |
Nested-zip archive comment (recursion test) | EOCD .ZIP file comment of inner.zip (carried as an entry) |
| 9 | SENTINEL-ENCRYPTED-V4W5X6 |
Encrypted entry inner content (KNOWN GAP) | Cleartext payload of secret.txt in the separate encrypted-archive fixture, with general-purpose-flag bit 0 (encrypted) set on the LFH/CD |
The primary fixture is a 5-entry ZIP: notes.txt (carrying surfaces 2-4), photo.jpg (surface 5), report.pdf (surface 6), memo.docx (surface 7), and inner.zip (surface 8 — itself carrying a nested-readme entry and the nested archive comment). Every entry's last-mod timestamp is 2023-04-15 14:32:11 (surface 4); the EOCD carries the archive comment (surface 1). The encrypted-archive fixture is a 1-entry ZIP whose secret.txt LFH has GP-bit 0 set; ZipStrategy refuses encrypted archives at the magic-byte check, so surface 9 is documented as a known gap rather than tested for byte-level stripping.
The fixtures are then stripped three ways:
ZipStrategy— our JSZip-based implementation, invoked in-process. The runner inlines the production routing (OfficeStrategy → JpegStrategy → PngStrategy → PdfStrategy → ZipStrategy) and wires it intosetZipStrategyRouterso inner-entry recursion goes through the sameselectStrategy()path as the production renderer.exiftool -all= -Time:All= -overwrite_original— the canonical reference for image-metadata tools. ExifTool's documentation explicitly states "Writing of ZIP files is not yet supported"; the runner records this refusal asREFUSEDrather than treating it as a runner failure. This is the documented finding from the gap analysis, surfaced directly in the matrix.- mat2 — the FOSS reference used by Tails OS. mat2's
libmat2/archive.pyZipParserrecurses into archive entries, calls format-specific parsers per entry, and rewrites the archive with epoch timestamps + scrubbed comments. This is the meaningful comparison reference — ExifTool isn't a viable baseline because it doesn't write generic ZIPs at all.
For each cleaned output, the recovery battery applies six techniques:
unzip -z <file>— prints the archive comment. Catches surface 1.zipinfo -v <file>— verbose listing including per-entry comments and extra fields. Catches surfaces 2 + 3.unzip -l <file>— listing including per-entry timestamps. Catches surface 4 (looks for the literal2023-04-15or1980-01-01).- Inner-file extraction +
exiftool -a -G1 -sper entry — surfaces structured metadata in extracted JPEG / PDF / DOCX. Catches surfaces 5 + 6 + 7. - Inner-file extraction +
stringsper entry — catches any sentinel left in plain-text bytes anywhere in the extracted entry tree, including the nested-zip archive comment when the nested archive is itself extracted. Catches surface 8 (and provides a cross-check for surfaces 5-7). - Raw
stringsover the cleaned ZIP bytes — catches any leakage of sentinels into the outer ZIP's central directory or LFH stream that wouldn't surface through the per-entry channels.
Verdict per surface per strip path: DROPPED (sentinel absent), NORMALIZED (timestamp is 1980-01-01 instead of the input's 2023-04-15), SURVIVED (sentinel found anywhere), REFUSED (the tool declines to process ZIP), SKIP (channel not collected), KNOWN_GAP (documented gap, not tested against this output).
Bar: zero sentinel survivors across every recovery technique for ZipStrategy on surfaces 1-8 (counting 3a–3d as one). Surface 4 is NORMALIZED (not DROPPED — the format requires some timestamp, and ZIP's 1980-01-01 epoch is the minimum DOS-time per .claude/rules/privacy-invariants.md §6). Surface 9 is a documented KNOWN_GAP — ZipStrategy refuses encrypted archives outright in v1, so the encrypted-inner sentinel is unaddressable through normal flow. The runner exits non-zero on UNEXPECTED survivors for surfaces 1-8 (counting 3a–3d as one).
Results
Captured 2026-05-22 from npx tsx tools/forensic/zip.ts. Tools: exiftool 13.30, mat2 0.13.4, unzip 6.0, zipinfo 6.0.
| # | Surface | Expected | Input (sanity) | ZipStrategy |
exiftool -all= -Time:All= |
mat2 |
|---|---|---|---|---|---|---|
| 1 | Archive comment | DROPPED | present | DROPPED | REFUSED¹ | DROPPED |
| 2 | Per-entry comment | DROPPED | present | DROPPED | REFUSED¹ | DROPPED |
| 3a | Extra field — custom (0x7878) | DROPPED | present | DROPPED | REFUSED¹ | DROPPED |
| 3b | Extra field — UT timestamp (0x5455) | DROPPED | present | DROPPED | REFUSED¹ | DROPPED |
| 3c | Extra field — UID/GID (0x7875) | DROPPED | present | DROPPED | REFUSED¹ | DROPPED |
| 3d | Extra field — NTFS times (0x000a) | DROPPED | present | DROPPED | REFUSED¹ | DROPPED |
| 4 | Per-entry timestamp | NORMALIZED → 1980-01-01 | 2023-04-15 | NORMALIZED | REFUSED¹ | NORMALIZED |
| 5 | Inner JPEG EXIF Artist | DROPPED | present | DROPPED | REFUSED¹ | DROPPED |
| 6 | Inner PDF /Author | DROPPED | present | DROPPED | REFUSED¹ | DROPPED |
| 7 | Inner DOCX <dc:creator> |
DROPPED | present | DROPPED | REFUSED¹ | DROPPED |
| 8 | Nested-zip archive comment | DROPPED | present | DROPPED | REFUSED¹ | DROPPED |
| 9 | Encrypted entry inner content | KNOWN_GAP | n/a² | KNOWN_GAP³ | REFUSED¹ | KNOWN_GAP⁴ |
¹ ExifTool exits with Error: Writing of ZIP files is not yet supported. Documented limitation per exiftool.org/#limitations — ExifTool reads metadata from ZIPs (and recognises Office/EPUB/APK as special cases for read-only enumeration) but does not write back. The matrix records REFUSED rather than SKIP because this is the central finding from the gap analysis: ExifTool is not a meaningful reference for stripping generic ZIPs.
² Surface 9 lives in a separate encrypted-archive fixture, not the primary fixture, so the "input sanity" column does not apply.
³ ZipStrategy returns { ok: false, error: { code: "invalid-file-format", detail: "Encrypted ZIP archives aren't supported — use a dedicated tool (7-Zip, ExifTool standalone) that can decrypt to clean inner content." } } when given the encrypted-archive fixture. Inner content not addressable through this code path; documented in spec §3 and docs/PRIVACY_GAPS.md.
⁴ mat2 refuses the encrypted-archive fixture (its archive parser errors on encrypted entries). Parity with ZipStrategy. Result class is "refused with clear error" for both tools; neither leaks the inner sentinel because neither produces a cleaned output.
Aggregate verdicts:
ZipStrategy: 11/11 strict surfaces (DROPPED or NORMALIZED — counting 3a/3b/3c/3d separately). 1 documented gap (encrypted entries).exiftool -all= -Time:All=: 0/11. Tool refuses to write ZIPs entirely.mat2: 11/11 strict surfaces. 1 gap (encrypted entries — parity).
Runner exit code: 0 (PASS — no UNEXPECTED survivors).
Interpretation
ZipStrategy and mat2 are equivalent on this fixture; ExifTool is not a viable reference at all.
-
ZipStrategy scrubs the four ZIP-level surfaces (archive comment, per-entry comments, per-entry extra fields, per-entry timestamps → epoch) by re-emitting via
JSZip.generateAsync({ comment: "" })with every entry passeddate: new Date(Date.UTC(1980, 0, 1, 12, 0, 0))andcomment: "". The inner-file surfaces are scrubbed via recursion throughselectStrategy(): each decompressed entry's bytes are routed back through the strategy registry. Aphoto.jpgentry hitsJpegStrategy; areport.pdfhitsPdfStrategy; amemo.docxhitsOfficeStrategy; ainner.ziphitsZipStrategyrecursively. The output ZIP carries the cleaned-leaf bytes under the original entry names. Surface 8 (nested archive) confirms recursion works structurally — the inner archive's EOCD comment is dropped just like the outer's. -
ExifTool is not a meaningful reference for generic ZIPs. Its documentation says so: "Writing of ZIP files is not yet supported." The matrix records
REFUSEDacross every surface so the comparison row is honest. The gap analysis (docs/gap-analysis/zip.md) makes the same finding from the read-side — ExifTool special-cases Office/EPUB/APK/JAR for read-only metadata enumeration only, never re-writes the archive. The ~95% of the surface that matters (inner-file metadata + per-entry timestamps + per-entry comments) is untouched. -
mat2 is the meaningful reference. Its
libmat2/archive.pyZipParserrecurses into each archive entry, dispatches to a format-specific backend, and rewrites the archive with epoch DOS timestamps + scrubbed archive comment. On this fixture mat2 achieves the same surface-by-surface result asZipStrategy: 8/8 DROPPED-or-NORMALIZED, refuses the encrypted fixture. The outputs differ in size (ZipStrategy2494 bytes vs mat2 4106 bytes — mat2 re-encodes the inner PDF via Cairo, producing a larger rasterised PDF; we keep the original PDF structure viapdf-lib's targeted scrub). For users who care about preserving inner-document fidelity (text remains text, not bitmap),ZipStrategy's approach is strictly preferable. For users who care about maximum sentinel destruction at the cost of inner-file fidelity, the two tools are equivalent on this fixture.
Where ZipStrategy beats ExifTool outright: every surface. ExifTool cannot strip generic ZIPs at all; it refuses the input. Even the surfaces ExifTool can read (archive comment, inner-file structured metadata) are surfaced read-only, not stripped.
Where ZipStrategy matches mat2: all 8 strict surfaces. Per-entry epoch + comment scrub + inner-file recursion via the strategy registry produces the same result as mat2's per-backend recursion model.
Where mat2 nominally beats ZipStrategy: none on this fixture. The previous direction note in the gap analysis described the design as "encrypted entries pass through with warning, mat2 refuses outright" — but the shipping policy was changed to "refuse encrypted archives" (spec §3) because JSZip's loadAsync refuses encrypted entries at the library level. As a result, ZipStrategy and mat2 are now at parity on encrypted-archive handling: both refuse cleanly.
Caveats and limits of this test
- Encrypted archives are refused, not stripped. JSZip's
loadAsyncwon't load an archive with any encrypted entry ("Encrypted zip are not supported"), so v1 surfacesinvalid-file-formatand directs the user to a decryption-capable tool. A byte-level walker bypassing JSZip would unblock partial-passthrough cleaning of zip-level metadata around encrypted content; deferred. Surface 9 is documented as a known gap indocs/PRIVACY_GAPS.md. - Self-extracting EXE stubs are preserved. Bytes before the first local file header (the PE stub) are not touched — modifying them breaks the SFX behavior. Documented gap. Not tested here.
- Multi-disk / spanned archives — JSZip rejects them; surface as
parse-failed. Not tested. - ZIP64 (0x0001) extra-field records are the structural exception we preserve, but not sentinel-tested here. Zip64 only triggers on archives or entries exceeding the 32-bit size fields (~4 GB), which a synthetic fixture can't reach cheaply. The gap-analysis policy table is explicit: "preserved (structural; required for archives > 4 GB)." Multi-GB Zip64 verification is deferred.
- The fixture is synthetic but exercises a realistic extra-field profile. Surfaces 3a–3d embed the four record IDs common to Word's "Save as zip", macOS Finder, 7-Zip, and Info-ZIP: custom 0x7878, UT extended timestamp (0x5455), UID/GID Unix v1 (0x7875), and NTFS times (0x000a). Each record carries its own ASCII sentinel inside its
datapayload, so the recovery battery verifies per-record-id stripping. ZipStrategy's policy ("strip every extra-field record except 0x0001") is confirmed against each ID independently, matching mat2's behavior. - The PDF inner sentinel is
/Info /Authoronly. The PDF strategy's full sentinel battery (Title / Author / Subject / Producer / Creator / XMP / Annotations / Lang — 10 sentinels) is exercised indocs/forensic/pdf.md. Here we use one sentinel because the point is to verify the recursion path throughZipStrategy → PdfStrategy, not to re-test the PDF strategy's depth (which is already covered). - The DOCX inner sentinel covers
<dc:creator>+<cp:lastModifiedBy>. The Office strategy's full gap battery is exercised indocs/forensic/office.ts. Same rationale as the PDF case. - The mat2 encrypted-archive refusal is detected via
mat2exit status, not by reading mat2's stderr. Surfacing the structured refusal reason from mat2 would require parsing its Python traceback; that's a fragility we deliberately avoid. - No
unzip -Por AES-decryption attempt. The encrypted fixture uses a bogus ZipCrypto payload (12-byte header + 1 byte of "ciphertext"); the password is unknown and not material to the test. Surface 9 is about refusal behavior, not about whether the encryption is "real."
Reproducing
# From the project root
npx tsx tools/forensic/zip.ts
Outputs go to /tmp/zip-forensic/:
input-primary.zip— the 5-entry fixture (surfaces 1-8 (counting 3a–3d as one))input-encrypted.zip— the 1-entry encrypted fixture (surface 9)output-ours.zip—ZipStrategyoutputoutput-exiftool.zip— exiftool-cleaned copy (empty on refusal — exiftool didn't write anything new)output-mat2.zip— mat2-cleaned copyoutput-*-encrypted.zip— encrypted-fixture outputs (mostly refusal copies)output-ours.zip.extracted/, etc. — per-output extraction tree used by the inner-exiftool + inner-strings channelsreport.json— structured per-surface verdict per strip path
Required tools: exiftool (libimage-exiftool-perl), mat2, unzip, zipinfo, strings (binutils). All available on Debian/Ubuntu via apt.
Debian/Ubuntu one-liner: sudo apt install libimage-exiftool-perl mat2 unzip binutils.
What this directory is for
docs/forensic/ documents adversarial recovery tests run after implementation lands, complementing docs/gap-analysis/ (which runs before implementation to scope what should be removed). The pattern: implement → unit-test correctness → forensic-test unrecoverability → document the result.
Each format gets its own writeup as we go: zip.md here, pdf.md / jpeg.md / office.md / png.md / video.md for the formats shipped earlier. The runner scripts at tools/forensic/<format>.ts stay in the repo so the tests can be re-run any time the strategy changes.
Captured runner output (2026-05-22)
Sentinels embedded in fixture:
ARCHIVE_CMNT SENTINEL-ARCHIVE-CMNT-A1B2C3
ENTRY_CMNT SENTINEL-ENTRY-CMNT-D4E5F6
EXTRA_FIELD SENTINEL-EXTRA-7G8H9I
JPEG_EXIF SENTINEL-JPEG-EXIF-J1K2L3
PDF_INFO SENTINEL-PDF-INFO-M4N5O6
DOCX_CREATOR SENTINEL-DOCX-P7Q8R9
NESTED_ARCHIVE SENTINEL-NESTED-S1T2U3
ENCRYPTED_INNER SENTINEL-ENCRYPTED-V4W5X6
TIMESTAMP_LITERAL 2023-04-15 (non-epoch)
Primary fixture: /tmp/zip-forensic/input-primary.zip (3793 bytes)
Encrypted fixture: /tmp/zip-forensic/input-encrypted.zip (144 bytes)
=== Stripping primary fixture ===
ZipStrategy: ok (2494 bytes)
exiftool: refused-by-design — ExifTool: 'Writing of ZIP files is not yet supported' — documented limitation per https://exiftool.org/#limitations
mat2: ok (4106 bytes)
=== Results matrix (9 surfaces × 3 strip paths) ===
| Surface | Expected | input | ZipStrategy | exiftool | mat2 |
|--------------------------------------------------|-------------------------|----------|--------------|----------|------------|
| 1. Archive comment | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 2. Per-entry comment | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 3a. Extra field — custom (0x7878) | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 3b. Extra field — UT extended timestamp (0x5455) | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 3c. Extra field — UID/GID (0x7875) | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 3d. Extra field — NTFS times (0x000a) | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 4. Per-entry timestamp | NORMALIZED → 1980-01-01 | SURVIVED | NORMALIZED | REFUSED | NORMALIZED |
| 5. Inner JPEG EXIF Artist | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 6. Inner PDF /Author | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 7. Inner DOCX <dc:creator> | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 8. Nested-zip archive comment | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 9. Encrypted entry inner content | KNOWN_GAP | DROPPED | KNOWN_GAP | REFUSED | KNOWN_GAP |
PASS — all 11 (DROPPED/NORMALIZED) strict surfaces verified for ZipStrategy.