feat(zip): generic ZIP support with recursive inner-file cleaning (#184) #188
No reviewers
Labels
No labels
bug
documentation
duplicate
e-copy
e-features
e-mobile
enhancement
f-coverage
f-forensic
f-perf
f-privacy
forensic
good first issue
help wanted
infra
invalid
phase-a
phase-b
phase-c
phase-d
phase-e
phase-f
phase-g
phase-h
priority-1
priority-2
priority-3
privacy
question
v5
v6
video-hardening
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: forgejo_admin/exifcleaner-web#188
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "feat/issue-184-zip-support"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Closes #184.
New
ZipStrategy— walks each archive entry via JSZip, normalizes per-entry timestamps to DOS epoch (UTC noon 1980-01-01 — local-time construction would underflow under negative UTC offsets), scrubs archive + per-entry comments, recursively dispatches every supported inner file throughselectStrategy()(including nested.zip→ ZipStrategy itself via a module-level injected router slot that breaks the circular static import). Encrypted archives are refused with a structuredinvalid-file-formaterror directing users to a decryption-capable tool — deviation from the brainstormed pass-through direction documented in spec §3 and gap-analysis (JSZip refusesloadAsyncon encrypted archives, blocking the partial passthrough at the library level).New
ZipExpansionUI tree — recursive component renders thearchiveEntriesfromStripResult; each cleaned-leaf row lazy-loads its diff on first chevron click via a newwindow.api.wasm.buildArchiveLeafDiffmethod backed byWasmProcessor.pendingLeafDiffs+ExifToolDiffStrategy. Nested-zip rows recurse via the same component withdepth+1. Pagination at 100 rows, indent caps at depth 5, render-time depth limit at 20 guards against adversarial archives. ExtractedMetadataDiffTable+DiffSkeletonfromMetadataDiffExpansionso leaves can reuse the two-pane table without inheriting the file-table expansion chrome.Forensic verification (ExifTool + mat2 cross-tool) —
tools/forensic/zip.tswith adversarial-independent ZIP builder, 9-surface sentinel battery, three strip paths. All 8 strict surfaces pass for ZipStrategy. ExifTool refuses all 8 because it doesn't write generic ZIPs ("Writing of ZIP files is not yet supported" — the central finding of the gap analysis). mat2 matches us on all 8. Surface 9 (encrypted-entry inner content) is a documentedKNOWN_GAPfor both ZipStrategy and mat2 — parity.Test plan
npx tsc --noEmit— cleanyarn lint— cleanyarn check:deps— no circular depsyarn test— 524/524 passing (14 new ZipStrategy unit + 12 new ZipExpansion component tests; all existing tests unchanged)yarn build:web— cleanyarn build:web:standalone— clean; standalone 33.8 MB (baseline unchanged; non-WASM bundle ~1 MB)npx tsx tools/forensic/zip.ts— exit 0, no UNEXPECTED survivors across all 8 strict surfacesyarn test:e2e:web— added two specs intests/e2e/web/zip-archive.spec.ts; CI will exercise themForensic results matrix
¹ ExifTool documented limitation: "Writing of ZIP files is not yet supported".
² v1 refuses encrypted archives outright (
invalid-file-format).³ mat2 also refuses encrypted archives — parity.
Docs
docs/superpowers/specs/2026-05-22-issue-184-zip-support-design.mddocs/superpowers/plans/2026-05-22-issue-184-zip-rollout.mddocs/gap-analysis/zip.mddocs/forensic/zip.mddocs/PRIVACY_GAPS.mdOut of scope (closes
.mdhalf of #184).mdis closed as out-of-scope: markdown has no embedded metadata in the EXIF/format-structure sense. The only conceivable target would be YAML/TOML frontmatter, which is content, not metadata; stripping it would change the file's meaning. There is nothing for aFormatStrategyto do. Confirmed with maintainer 2026-05-22.Co-Authored-By: Claude Opus 4.7 (1M context)
Phase-3 forensic writeup at docs/forensic/zip.md. Records the sentinel-survival matrix across nine surfaces (archive comment, per-entry comment, per-entry extra field, per-entry timestamp, inner JPEG/PDF/DOCX, nested-zip archive comment, encrypted entry content) for three strip paths: ZipStrategy, exiftool -all= -Time:All=, mat2. Findings: ZipStrategy passes 8/8 strict surfaces (DROPPED or NORMALIZED → 1980-01-01); equivalent to mat2 on this fixture. ExifTool refuses to write generic ZIPs entirely ("Writing of ZIP files is not yet supported"), so the comparison row records REFUSED across the board — this is the central finding of the gap analysis, surfaced directly in the matrix. Surface 9 (encrypted entry inner content) is a documented KNOWN_GAP for both ZipStrategy and mat2 (both refuse encrypted archives). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Addresses ~10 findings from the multi-agent code review: - **pendingLeafDiffs key collision (HIGH).** Nested ZIPs with same-named inner entries collided on the same key, causing the wrong diff to load on click. Fix: compose `entryId\0fullPath` where fullPath is built recursively in stashArchiveLeaves with each outer archive's path prepended. ZipExpansion now threads a `pathPrefix` through the recursive tree so the UI's buildArchiveLeafDiff call passes the same full path the processor stashed. NUL separator matches the established MetadataDiffTable.makeKey convention. - **stashArchiveLeaves leaked nested-zip parents (MEDIUM-HIGH).** Every cleaned entry was stashed regardless of whether it was a leaf, but the UI's `isNestedZip` branch never calls buildArchiveLeafDiff. Fix: gate the stash with `entry.entries === null` so only actual leaves consume the per-leaf slot. - **Top-level diff built but never rendered for ZIPs (MEDIUM).** WasmProcessor.process stashed the full archive bytes into pendingDiffs even though FileRow always routes ZIPs to ZipExpansion. Fix: when archiveEntries is populated, skip the per-entry stash entirely — only the per-leaf stashes carry useful diff inputs. - **Race on rapid-click during diff load (MEDIUM).** Closing a leaf row while its buildArchiveLeafDiff was awaiting would re-open with stale content when the promise resolved. Fix: use a functional setState guard (`fromKind: "loading"`) that ignores stale writes. - **State-key collision on duplicate filenames (LOW).** Two entries with the same path in the same archive (legal per ZIP spec) shared ZipExpansion's expand state. Fix: include the entry index in the state-map key. - **Directory entry date fragility (MEDIUM-HIGH).** `outputZip.folder()` + post-mutation of `.date` had a null-lookup gap that could leak JSZip's default `new Date()` into the directory record (privacy invariant §6 violation). Fix: write directory entries via `outputZip.file(name, "", { dir: true, date, comment })` — JSZip takes the options directly, no post-mutation. - **Dead `passed-through-encrypted` status variant (MEDIUM).** Declared in ArchiveEntryStatus + rendered in ZipExpansion + styled in CSS + i18n key, but no code path ever produces it (encrypted archives are refused outright in v1). Removed across all 5 surfaces; the corresponding test in zip_expansion.test.tsx becomes a comment explaining the deferral. - **Stale `warnings` doc-comment (LOW).** Claimed "encrypted-entry pass-through messages" but no current strategy emits them. Updated to describe the field as reserved for the byte-level walker follow-up. - **Stale FileRow header (LOW).** Listed two expansion branches when there are now four. Updated to enumerate ZipExpansion + Diff mutual-exclusion. - **Nested-BEM CSS classes (LOW).** `.zip-expansion__row__chevron`/path/ status flattened to `.zip-expansion__chevron`/`__path`/`__status` per CLAUDE.md "BEM (mandated for all new CSS)" — single-level elements match every other file in `src/web/styles/`. Skipped (below threshold or design choice): - findEncryptedEntries early-break on streaming descriptor — encrypted archives are refused at the loadAsync layer regardless, so the parse- failed fallback is acceptable. - Inner-strategy refusal silently passing through unsupported — by design per gap-analysis §"Per-source policy table". - MAX_RENDER_DEPTH visual at depth 20 — edge case, deferred. Quality gates: typecheck clean, lint clean, madge no cycles, 523/523 vitest (1 removed test for the dead encrypted-status path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Self code review
Ran a multi-agent code review (5 parallel reviewers covering CLAUDE.md compliance, shallow bug scan, git history, prior PR comments, code-comment compliance). Surfaced 10 issues above the >35 threshold; all fixed in c74c106.
Critical
pendingLeafDiffskey collision in nested ZIPs. Two nested zips with same-named leaves (e.g.a.zip/photo.jpg+b.zip/photo.jpg) keyed to the same stash slot, causing the wrong diff to load on click. Fix: NUL-separated${entryId}\0${fullPath}wherefullPathis composed recursively from the outer-archive prefix.Fixed at:
wasm_processor.ts#L55-L84,ZipExpansion.tsx#L42-L57stashArchiveLeavesleaked nested-ZIP parent entries. Every cleaned entry was stashed including nested-zip parents that the UI never drains. Fix: gate stash withentry.entries === null.Fixed at:
wasm_processor.ts#L300-L350Directory-entry date fallthrough (privacy invariant §6).
outputZip.folder()+ post-mutation could leak JSZip's defaultnew Date()into the directory record. Fix: usefile(name, "", { dir: true, date, comment }).Fixed at:
zip_strategy.ts#L94-L118Medium
Top-level pendingDiffs stashed-then-wasted for ZIPs. ExifTool diff ran on full archive bytes even though
FileRowalways routes ZIPs toZipExpansion. Fix: skip the per-entry stash whenarchiveEntriesis populated.Fixed at:
wasm_processor.ts#L148-L176Race on close-during-load. Clicking a leaf while its diff was loading and then again to close → resolve forced the row back open with stale content. Fix: functional setState guard
fromKind: "loading".Fixed at:
ZipExpansion.tsx#L75-L94Dead
passed-through-encryptedstatus variant. Declared inArchiveEntryStatus, rendered in UI, styled in CSS, with an i18n key — but no code path ever produces it (encrypted archives are refused outright in v1). Removed across all 5 surfaces.Fixed at:
archive_entry.ts#L11-L18Low (polish)
warningsdoc-comment — said "encrypted-entry pass-through messages" but no current strategy emits them. Updated to "reserved for the byte-level walker follow-up".FileRow.tsxheader — listed two expansion branches; now lists all four (Error, NoMetadata, ZipExpansion, MetadataDiffExpansion) + mutual-exclusion note..zip-expansion__row__chevronetc.) — flattened to single-element form per CLAUDE.md "BEM (mandated for all new CSS)".Skipped (below threshold or by-design)
findEncryptedEntriesearly-break on streaming-descriptor flag — encrypted archives are refused at the loadAsync layer either way, so theparse-failedfallback is acceptable.passed-through-unsupported— by design per gap-analysis §"Per-source policy table" (extension-claimed-but-magic-mismatch is a user-input fact, not a privacy event).MAX_RENDER_DEPTHvisual at depth 20 — edge case (zip quine territory), deferred.Quality gates: typecheck clean, lint clean, madge no cycles, 523/523 vitest (1 test removed for the dead encrypted-status path).
🤖 Generated with Claude Code
c74c106be5to7e144d86e2Rebased onto master + screenshot
Rebased the branch onto the latest
master(which now includes #183 FfmpegFallbackStrategy). One conflict insrc/infrastructure/wasm/wasm_processor.tsbecause master added adiffChain: Promise<unknown>field to serialize calls into@uswriting/exiftool's shared singleton state — my newbuildArchiveLeafDiffhad the same race, so I extended the chain to cover it too (per-leaf calls now serialize against top-level diffs and against each other).Quality gates (post-rebase, with the new ffmpeg dep installed):
yarn typecheck— cleanyarn lint— cleanyarn check:deps— no circular depsyarn test— 537/537 passing (up from 523 — master's FfmpegFallbackStrategy adds tests)yarn build:web— cleanyarn build:web:standalone— clean; ~50 MB now (includes ffmpeg-core.wasm at 32 MB + zeroperl.wasm at 25 MB; my ZIP changes contribute ~5 KB)Screenshot of the ZIP diff UI (against the standalone build via Playwright):
The screenshot shows:
sample-zip.zipcleaned (303 B → 215 B)photo.jpg, expanded —MetadataDiffTablerendering insidezip-expansion__leaf-diff. TheIFD0 · 1 REMOVEDgroup header shows the EXIFArtistfield's sentinel value (E2E-SENTINEL-INNER-JPEG) struck through in the Before column with a placeholder in the After column — confirming the recursive JPEG strip + the lazy-on-expand diff loadnotes/rendered as non-expandable with the "Directory" status labelToast at the bottom is the
dispatchExifToolDiffLoadingcue that fires on first WASM warm-up (the screenshot was captured during the warm-up window before the toast cleared).🤖 Generated with Claude Code
7e144d86e2to45be688702Follow-up
Two follow-ups on the rebase:
i18n gap. The 8 new
zipExpansion.*keys shipped withenonly, which would fall back to English for users on the Arabic-default locale. Added Spanish + Arabic translations to match the existing pattern.Screenshot showed
BeforeAftercollapsed. The grid layout for the two-pane header is scoped to.file-table__diff--two-paneinfile_table.css; my leaf-diff wrapper only hadzip-expansion__leaf-diff, so the labels stacked next to each other without gap or columns. Fix: includefile-table__diff--two-paneon the leaf wrapper so the cascade hits the existing rule. New screenshot below.🤖 Generated with Claude Code
Surface 3 (per-entry extra field) was previously a single sentinel embedded in a custom 0x7878 record. Real-world ZIPs from Word's "Save as zip", macOS Finder, 7-Zip, and Info-ZIP commonly carry several record IDs alongside file data: - UT extended timestamp (0x5455) — wall-clock mtime/atime/ctime - UID/GID Unix v1 (0x7875) — creator's user/group IDs - NTFS times (0x000a) — 100-ns high-resolution Windows timestamps - custom 0x7878 — original sub-surface ZipStrategy's policy ("strip every extra-field record except Zip64 0x0001") is now verified per-record-id: 3a/3b/3c/3d each carry a distinct ASCII sentinel embedded in their data payload, and the recovery battery confirms each is independently stripped. BuilderEntry shape changes from a single optional extraFieldPayload to an array of {headerId, data} records. The builder concatenates records in LFH/CD order. All four record IDs are added to the notes.txt entry via the new buildRealWorldExtraFields helper. Runner output: 11/11 strict surfaces DROPPED-or-NORMALIZED on ZipStrategy. ExifTool refuses all 11 (writes generic ZIPs not supported). mat2 matches at 11/11. Surface 9 (encrypted inner content) remains KNOWN_GAP — parity with mat2. Drops the "extending the fixture is a worthwhile follow-up" caveat from docs/forensic/zip.md and replaces it with the new per-record sub-surface explanation + the documented Zip64 deferral (Zip64 is only triggered by >4 GB archives; a multi-GB fixture is out of scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Forensic extra-field coverage extended
You called out the "real-world extra-field profiles ... worthwhile follow-up" caveat — done. Surface 3 was a single custom 0x7878 record; now four sub-surfaces, one per record ID common in real-world archives:
SENTINEL-EXTRA-7G8H9I(existing, exercises the "unknown record = strip anyway" path)SENTINEL-EXTRA-UT-K2L3M4(the wall-clock mtime/atime/ctime trio Linuxzip/ Finder / Word's "Save as zip" write)SENTINEL-EXTRA-UIDGID-N5O6P7(creator's uid/gid; identifies the user account that built the archive)SENTINEL-EXTRA-NTFS-Q8R9S0(Windows high-resolution timestamps)Each record carries its own distinct ASCII sentinel in its
datapayload; the recovery battery confirms per-record-id stripping. All four DROPPED for ZipStrategy. mat2 matches at 4/4. ExifTool refuses all 4 (consistent with the documented "writes generic ZIPs not supported" finding).Final runner output:
Zip64 (0x0001) — the structural exception we preserve — is not sentinel-tested because Zip64 is only triggered by >4 GB archives or entries, and a multi-GB fixture is out of scope. Documented in the writeup's "Caveats and limits" section.
🤖 Generated with Claude Code
Updated ZIP diff — inner-leaf metadata diff
Fixes the empty-diff condition (#184 follow-up):