feat(zip): generic ZIP support with recursive inner-file cleaning (#184) (#188)
All checks were successful
CI / Lint, Typecheck & Unit Tests (push) Successful in 32s
CI / Smoke build (VITE_ENABLE_FFMPEG_FALLBACK=false) (push) Successful in 45s
CI / E2E (Standalone single-file) (push) Successful in 1m35s
CI / E2E (Web) (push) Successful in 3m23s

This commit is contained in:
forgejo_admin 2026-05-22 20:32:03 +04:00
parent a5546afa71
commit d9e763b76e
35 changed files with 6051 additions and 327 deletions

View file

@ -1493,5 +1493,50 @@
"en": "Loading metadata reader…",
"es": "Cargando lector de metadatos…",
"ar": "جارٍ تحميل قارئ البيانات الوصفية…"
},
"zipExpansion.statusCleaned": {
"en": "Cleaned",
"es": "Limpiado",
"ar": "تم التنظيف"
},
"zipExpansion.statusAlreadyClean": {
"en": "Already clean",
"es": "Ya limpio",
"ar": "نظيف بالفعل"
},
"zipExpansion.statusUnsupported": {
"en": "Unsupported — passed through",
"es": "No compatible — sin modificar",
"ar": "غير مدعوم — تم التمرير دون تعديل"
},
"zipExpansion.statusDirectory": {
"en": "Directory",
"es": "Carpeta",
"ar": "مجلد"
},
"zipExpansion.showMore": {
"en": "Show {count} more entries",
"es": "Mostrar {count} entradas más",
"ar": "إظهار {count} عنصر إضافي"
},
"zipExpansion.depthLimit": {
"en": "Depth limit reached — drop the inner file directly",
"es": "Límite de profundidad alcanzado — arrastra el archivo interno directamente",
"ar": "تم بلوغ حد العمق — أفلت الملف الداخلي مباشرةً"
},
"zipExpansion.noMetadata": {
"en": "No metadata detected",
"es": "No se detectaron metadatos",
"ar": "لم يتم اكتشاف بيانات وصفية"
},
"zipExpansion.alreadyClean": {
"en": "No metadata — file appears already clean",
"es": "Sin metadatos — el archivo parece estar ya limpio",
"ar": "لا توجد بيانات وصفية — يبدو الملف نظيفاً بالفعل"
},
"zipExpansion.diffFailed": {
"en": "Couldn't load diff — internal error",
"es": "No se pudo cargar la comparación — error interno",
"ar": "تعذّر تحميل المقارنة — خطأ داخلي"
}
}

View file

@ -45,6 +45,7 @@ For "what's *partially* cleaned even when supported", see [`docs/PRIVACY_GAPS.md
| PDF | Best-effort³ |
| DOCX, XLSX, PPTX, ODT | Partial⁴ (WASM strategy) |
| MP4, MOV, M4V, 3GP, 3G2 | Partial⁵ (WASM strategy) |
| ZIP | Full⁷ (recursive inner-file cleaning) |
| MKV | Unsupported (issue #43, deferred to v6) |
| RAW (CR2/CR3/NEF/ARW/RAF/ORF/DNG/...) | Unsupported⁶ |
| SVG, JXL, JPEG 2000, AVI | Unsupported |
@ -57,6 +58,7 @@ Footnotes:
4. Office: clears `docProps/{core,app,custom}.xml` and a thumbnail. Known partial coverage of tracked changes/comments, RSIDs, embedded media EXIF, `customXml/` parts, and file paths in `*.rels` — tracked under issue #62 (Office Phase 2 hardening). See [`docs/PRIVACY_GAPS.md`](docs/PRIVACY_GAPS.md) for the user-facing summary.
5. MP4/MOV: drops `udta`, `meta`, and `Xtra` containers via mp4box.js box-tree rewrite (no re-encoding, lossless). Known gaps in timed-metadata tracks, `hdlr` names, `compressorname`, mdat orphans, and sidecar files — see [`docs/PRIVACY_GAPS.md`](docs/PRIVACY_GAPS.md#mp4--mov-video-gaps) for the user-facing summary.
6. RAW: removed in v5 (decided 2026-05-09, shipped 2026-05-10). No production-ready WASM library covers proprietary RAW. RAW workflows should use [ExifTool standalone](https://exiftool.org/) or a dedicated RAW tool — see [`docs/PRIVACY_GAPS.md#raw-unsupported`](docs/PRIVACY_GAPS.md#raw-unsupported).
7. ZIP: per-entry timestamps normalized to DOS epoch (1980-01-01); per-entry comments and extra fields scrubbed; archive comment scrubbed. Each supported inner file is re-dispatched through `selectStrategy()` and cleaned with its native walker (JPEG/PNG/PDF/Office/MP4/etc.); nested `.zip` entries recurse. UI shows a per-entry tree with lazy on-expand diff loads. Encrypted archives are refused with a clear message directing users to a decryption-capable tool (see [`docs/PRIVACY_GAPS.md`](docs/PRIVACY_GAPS.md#zip-archives)). Full analysis: [`docs/gap-analysis/zip.md`](docs/gap-analysis/zip.md). Forensic verification: [`docs/forensic/zip.md`](docs/forensic/zip.md).
## Running the web app locally

View file

@ -97,6 +97,32 @@ The current shipping state. Expect this table to drift; the README's Format Supp
---
## ZIP archives
The `ZipStrategy` (issue #184, shipped 2026-05) cleans ZIP archive metadata and recursively re-cleans every supported inner file. Three known gaps remain:
### Encrypted ZIPs are refused, not cleaned
**What this means:** if your `.zip` contains entries encrypted with a password (ZipCrypto or AES-via-WinZip), MetaScrub refuses to process the archive and surfaces an "Encrypted ZIP archives aren't supported" message.
**Why:** the bundled ZIP library (JSZip, already a production dep for Office) refuses `loadAsync` on any archive containing encrypted entries. Without it we'd need a parallel byte-level walker — significant additional code we deferred for v1.
**Workaround:** decrypt the archive with a dedicated tool (7-Zip, `unzip` from the command line, `mat2`'s archive backend) and re-drop the decrypted contents into MetaScrub. We may add a byte-level fallback in a follow-up if demand surfaces.
### Self-extracting EXE stub bytes are preserved
**What this means:** if a `.zip` is wrapped in a self-extracting Windows executable (the bytes before the first local file header form a PE stub), MetaScrub preserves those bytes verbatim. The stub itself may carry the original creator's identifying metadata (PE timestamps, OriginalFilename string, etc.).
**Why:** modifying the stub would break the SFX behavior. Distinguishing "intentional SFX stub" from "arbitrary leading garbage" reliably from the byte stream isn't reasonable.
**Workaround:** repackage the contents as a plain `.zip` (without the SFX wrapper) before dropping it into MetaScrub.
### Multi-disk / spanned archives are refused
`.zip` archives split across multiple `.z01`/`.z02`/… files are rejected with a `parse-failed` error. JSZip does not support multi-disk reads. Reassemble the archive locally (e.g. `zip -F`) before processing.
---
## MP4 / MOV video gaps
The current `VideoStrategy` (mp4box.js-based box-tree rewriter) drops `udta`, `meta`, and `Xtra` containers but does not cover several known sources of leak. These are tracked individually; this section is the user-facing summary.

182
docs/forensic/zip.md Normal file
View file

@ -0,0 +1,182 @@
# ZIP forensic recovery test
**Date:** 2026-05-22
**Goal:** Verify that metadata stripped by `ZipStrategy` cannot be recovered by an attacker with standard ZIP forensic tooling, across nine surfaces that span the ZIP container itself (archive comment, per-entry comment, per-entry extra field, per-entry timestamp) plus the inner files commonly carried inside archives (JPEG EXIF, PDF Info, DOCX docProps, nested ZIPs, encrypted entries). Compare against `exiftool -all= -Time:All=` and [mat2](https://0xacab.org/jvoisin/mat2) as reference points.
**Reproducible at:** [`tools/forensic/zip.ts`](../../tools/forensic/zip.ts) — `npx tsx tools/forensic/zip.ts` from the project root.
## Methodology
The runner builds two synthetic ZIP fixtures programmatically. The primary fixture exercises eight metadata surfaces; a separate encrypted-archive fixture exercises the ninth.
The byte-level ZIP builder is **independent of JSZip** — per [`.claude/rules/format-strategy-workflow.md`](../../.claude/rules/format-strategy-workflow.md), adversarial-independence between the fixture builder and the production strategy's library protects the test from a shared-quirk class of false negatives. Same rationale as `tools/forensic/video.ts`'s `walkAtoms` (independent of `parseBoxes`).
**Sentinels embedded across nine surfaces:**
| # | Sentinel | Surface | Where it lives in the fixture |
| --- | ------------------------------ | ------------------------------------------------------------- | ---------------------------------------------------------------------- |
| 1 | `SENTINEL-ARCHIVE-CMNT-A1B2C3` | Archive comment | EOCD `.ZIP file comment` |
| 2 | `SENTINEL-ENTRY-CMNT-D4E5F6` | Per-entry comment | Central directory entry `file comment` on `notes.txt` |
| 3a | `SENTINEL-EXTRA-7G8H9I` | Per-entry extra field — custom (0x7878) | Arbitrary unregistered ID; tests the "unknown record = strip anyway" path on `notes.txt` |
| 3b | `SENTINEL-EXTRA-UT-K2L3M4` | Per-entry extra field — UT extended timestamp (0x5455) | Info-ZIP UT record (the wall-clock mtime/atime/ctime trio commonly written by Linux `zip`, macOS Finder, Word's "Save as zip") on `notes.txt` |
| 3c | `SENTINEL-EXTRA-UIDGID-N5O6P7` | Per-entry extra field — UID/GID Unix v1 (0x7875) | Info-ZIP Unix v1 record (creator's uid/gid; identifies the user account that created the archive) on `notes.txt` |
| 3d | `SENTINEL-EXTRA-NTFS-Q8R9S0` | Per-entry extra field — NTFS times (0x000a) | Windows NTFS record (100-ns mtime/atime/ctime, higher-fidelity than DOS) on `notes.txt` |
| 4 | `2023-04-15 14:32:11` | Per-entry timestamp | DOS-encoded last-mod date/time in LFH + CD of every entry |
| 5 | `SENTINEL-JPEG-EXIF-J1K2L3` | Inner JPEG EXIF Artist | EXIF/APP1 IFD0 tag 0x013b (Artist) in `photo.jpg` |
| 6 | `SENTINEL-PDF-INFO-M4N5O6` | Inner PDF /Author | `/Info /Author` in `report.pdf` (pdf-lib `setAuthor()`) |
| 7 | `SENTINEL-DOCX-P7Q8R9` | Inner DOCX `<dc:creator>` | `docProps/core.xml` `dc:creator` + `cp:lastModifiedBy` in `memo.docx` |
| 8 | `SENTINEL-NESTED-S1T2U3` | Nested-zip archive comment (recursion test) | EOCD `.ZIP file comment` of `inner.zip` (carried as an entry) |
| 9 | `SENTINEL-ENCRYPTED-V4W5X6` | Encrypted entry inner content (KNOWN GAP) | Cleartext payload of `secret.txt` in the separate encrypted-archive fixture, with general-purpose-flag bit 0 (encrypted) set on the LFH/CD |
The primary fixture is a 5-entry ZIP: `notes.txt` (carrying surfaces 2-4), `photo.jpg` (surface 5), `report.pdf` (surface 6), `memo.docx` (surface 7), and `inner.zip` (surface 8 — itself carrying a nested-readme entry and the nested archive comment). Every entry's last-mod timestamp is 2023-04-15 14:32:11 (surface 4); the EOCD carries the archive comment (surface 1). The encrypted-archive fixture is a 1-entry ZIP whose `secret.txt` LFH has GP-bit 0 set; ZipStrategy refuses encrypted archives at the magic-byte check, so surface 9 is documented as a known gap rather than tested for byte-level stripping.
The fixtures are then stripped three ways:
1. **`ZipStrategy`** — our JSZip-based implementation, invoked in-process. The runner inlines the production routing (`OfficeStrategy → JpegStrategy → PngStrategy → PdfStrategy → ZipStrategy`) and wires it into `setZipStrategyRouter` so inner-entry recursion goes through the same `selectStrategy()` path as the production renderer.
2. **`exiftool -all= -Time:All= -overwrite_original`** — the canonical reference for image-metadata tools. ExifTool's documentation explicitly states ["Writing of ZIP files is not yet supported"](https://exiftool.org/#limitations); the runner records this refusal as `REFUSED` rather than treating it as a runner failure. This is the documented finding from the gap analysis, surfaced directly in the matrix.
3. **mat2** — the FOSS reference used by Tails OS. mat2's `libmat2/archive.py` `ZipParser` recurses into archive entries, calls format-specific parsers per entry, and rewrites the archive with epoch timestamps + scrubbed comments. This is the meaningful comparison reference — ExifTool isn't a viable baseline because it doesn't write generic ZIPs at all.
For each cleaned output, the recovery battery applies six techniques:
1. **`unzip -z <file>`** — prints the archive comment. Catches surface 1.
2. **`zipinfo -v <file>`** — verbose listing including per-entry comments and extra fields. Catches surfaces 2 + 3.
3. **`unzip -l <file>`** — listing including per-entry timestamps. Catches surface 4 (looks for the literal `2023-04-15` or `1980-01-01`).
4. **Inner-file extraction + `exiftool -a -G1 -s` per entry** — surfaces structured metadata in extracted JPEG / PDF / DOCX. Catches surfaces 5 + 6 + 7.
5. **Inner-file extraction + `strings` per entry** — catches any sentinel left in plain-text bytes anywhere in the extracted entry tree, including the nested-zip archive comment when the nested archive is itself extracted. Catches surface 8 (and provides a cross-check for surfaces 5-7).
6. **Raw `strings` over the cleaned ZIP bytes** — catches any leakage of sentinels into the outer ZIP's central directory or LFH stream that wouldn't surface through the per-entry channels.
Verdict per surface per strip path: `DROPPED` (sentinel absent), `NORMALIZED` (timestamp is 1980-01-01 instead of the input's 2023-04-15), `SURVIVED` (sentinel found anywhere), `REFUSED` (the tool declines to process ZIP), `SKIP` (channel not collected), `KNOWN_GAP` (documented gap, not tested against this output).
**Bar:** zero sentinel survivors across every recovery technique for `ZipStrategy` on surfaces 1-8 (counting 3a3d as one). Surface 4 is NORMALIZED (not DROPPED — the format requires *some* timestamp, and ZIP's 1980-01-01 epoch is the minimum DOS-time per [`.claude/rules/privacy-invariants.md`](../../.claude/rules/privacy-invariants.md) §6). Surface 9 is a documented `KNOWN_GAP` — ZipStrategy refuses encrypted archives outright in v1, so the encrypted-inner sentinel is unaddressable through normal flow. The runner exits non-zero on UNEXPECTED survivors for surfaces 1-8 (counting 3a3d as one).
## Results
Captured 2026-05-22 from `npx tsx tools/forensic/zip.ts`. Tools: exiftool 13.30, mat2 0.13.4, unzip 6.0, zipinfo 6.0.
| # | Surface | Expected | Input (sanity) | `ZipStrategy` | `exiftool -all= -Time:All=` | mat2 |
| --- | ------------------------------------ | ------------------------- | ---------------- | ------------- | --------------------------- | ------------- |
| 1 | Archive comment | DROPPED | present | **DROPPED** | REFUSED¹ | DROPPED |
| 2 | Per-entry comment | DROPPED | present | **DROPPED** | REFUSED¹ | DROPPED |
| 3a | Extra field — custom (0x7878) | DROPPED | present | **DROPPED** | REFUSED¹ | DROPPED |
| 3b | Extra field — UT timestamp (0x5455) | DROPPED | present | **DROPPED** | REFUSED¹ | DROPPED |
| 3c | Extra field — UID/GID (0x7875) | DROPPED | present | **DROPPED** | REFUSED¹ | DROPPED |
| 3d | Extra field — NTFS times (0x000a) | DROPPED | present | **DROPPED** | REFUSED¹ | DROPPED |
| 4 | Per-entry timestamp | NORMALIZED → 1980-01-01 | 2023-04-15 | **NORMALIZED**| REFUSED¹ | NORMALIZED |
| 5 | Inner JPEG EXIF Artist | DROPPED | present | **DROPPED** | REFUSED¹ | DROPPED |
| 6 | Inner PDF /Author | DROPPED | present | **DROPPED** | REFUSED¹ | DROPPED |
| 7 | Inner DOCX `<dc:creator>` | DROPPED | present | **DROPPED** | REFUSED¹ | DROPPED |
| 8 | Nested-zip archive comment | DROPPED | present | **DROPPED** | REFUSED¹ | DROPPED |
| 9 | Encrypted entry inner content | KNOWN_GAP | n/a² | **KNOWN_GAP**³| REFUSED¹ | KNOWN_GAP⁴ |
¹ ExifTool exits with `Error: Writing of ZIP files is not yet supported`. Documented limitation per [exiftool.org/#limitations](https://exiftool.org/#limitations) — ExifTool reads metadata from ZIPs (and recognises Office/EPUB/APK as special cases for read-only enumeration) but does not write back. The matrix records `REFUSED` rather than `SKIP` because this is the central finding from the gap analysis: ExifTool is not a meaningful reference for stripping generic ZIPs.
² Surface 9 lives in a separate encrypted-archive fixture, not the primary fixture, so the "input sanity" column does not apply.
³ ZipStrategy returns `{ ok: false, error: { code: "invalid-file-format", detail: "Encrypted ZIP archives aren't supported — use a dedicated tool (7-Zip, ExifTool standalone) that can decrypt to clean inner content." } }` when given the encrypted-archive fixture. Inner content not addressable through this code path; documented in spec §3 and `docs/PRIVACY_GAPS.md`.
⁴ mat2 refuses the encrypted-archive fixture (its archive parser errors on encrypted entries). Parity with `ZipStrategy`. Result class is "refused with clear error" for both tools; neither leaks the inner sentinel because neither produces a cleaned output.
**Aggregate verdicts:**
- `ZipStrategy`: 11/11 strict surfaces (DROPPED or NORMALIZED — counting 3a/3b/3c/3d separately). 1 documented gap (encrypted entries).
- `exiftool -all= -Time:All=`: 0/11. Tool refuses to write ZIPs entirely.
- `mat2`: 11/11 strict surfaces. 1 gap (encrypted entries — parity).
Runner exit code: **0** (PASS — no UNEXPECTED survivors).
## Interpretation
**`ZipStrategy` and mat2 are equivalent on this fixture; ExifTool is not a viable reference at all.**
- **ZipStrategy** scrubs the four ZIP-level surfaces (archive comment, per-entry comments, per-entry extra fields, per-entry timestamps → epoch) by re-emitting via `JSZip.generateAsync({ comment: "" })` with every entry passed `date: new Date(Date.UTC(1980, 0, 1, 12, 0, 0))` and `comment: ""`. The inner-file surfaces are scrubbed via recursion through `selectStrategy()`: each decompressed entry's bytes are routed back through the strategy registry. A `photo.jpg` entry hits `JpegStrategy`; a `report.pdf` hits `PdfStrategy`; a `memo.docx` hits `OfficeStrategy`; a `inner.zip` hits `ZipStrategy` recursively. The output ZIP carries the cleaned-leaf bytes under the original entry names. Surface 8 (nested archive) confirms recursion works structurally — the inner archive's EOCD comment is dropped just like the outer's.
- **ExifTool** is not a meaningful reference for generic ZIPs. Its documentation says so: "Writing of ZIP files is not yet supported." The matrix records `REFUSED` across every surface so the comparison row is honest. The gap analysis ([`docs/gap-analysis/zip.md`](../gap-analysis/zip.md)) makes the same finding from the read-side — ExifTool special-cases Office/EPUB/APK/JAR for read-only metadata enumeration only, never re-writes the archive. The ~95% of the surface that matters (inner-file metadata + per-entry timestamps + per-entry comments) is untouched.
- **mat2** is the meaningful reference. Its `libmat2/archive.py` `ZipParser` recurses into each archive entry, dispatches to a format-specific backend, and rewrites the archive with epoch DOS timestamps + scrubbed archive comment. On this fixture mat2 achieves the same surface-by-surface result as `ZipStrategy`: 8/8 DROPPED-or-NORMALIZED, refuses the encrypted fixture. The outputs differ in size (`ZipStrategy` 2494 bytes vs mat2 4106 bytes — mat2 re-encodes the inner PDF via Cairo, producing a larger rasterised PDF; we keep the original PDF structure via `pdf-lib`'s targeted scrub). For users who care about preserving inner-document fidelity (text remains text, not bitmap), `ZipStrategy`'s approach is strictly preferable. For users who care about maximum sentinel destruction at the cost of inner-file fidelity, the two tools are equivalent on this fixture.
**Where ZipStrategy beats ExifTool outright:** every surface. ExifTool cannot strip generic ZIPs at all; it refuses the input. Even the surfaces ExifTool *can* read (archive comment, inner-file structured metadata) are surfaced read-only, not stripped.
**Where ZipStrategy matches mat2:** all 8 strict surfaces. Per-entry epoch + comment scrub + inner-file recursion via the strategy registry produces the same result as mat2's per-backend recursion model.
**Where mat2 nominally beats ZipStrategy:** none on this fixture. The previous direction note in the gap analysis described the design as "encrypted entries pass through with warning, mat2 refuses outright" — but the shipping policy was changed to "refuse encrypted archives" (spec §3) because JSZip's `loadAsync` refuses encrypted entries at the library level. As a result, ZipStrategy and mat2 are now at parity on encrypted-archive handling: both refuse cleanly.
## Caveats and limits of this test
- **Encrypted archives are refused, not stripped.** JSZip's `loadAsync` won't load an archive with any encrypted entry (`"Encrypted zip are not supported"`), so v1 surfaces `invalid-file-format` and directs the user to a decryption-capable tool. A byte-level walker bypassing JSZip would unblock partial-passthrough cleaning of zip-level metadata around encrypted content; deferred. Surface 9 is documented as a known gap in [`docs/PRIVACY_GAPS.md`](../PRIVACY_GAPS.md).
- **Self-extracting EXE stubs are preserved.** Bytes before the first local file header (the PE stub) are not touched — modifying them breaks the SFX behavior. Documented gap. Not tested here.
- **Multi-disk / spanned archives** — JSZip rejects them; surface as `parse-failed`. Not tested.
- **ZIP64 (0x0001) extra-field records are the structural exception we preserve, but not sentinel-tested here.** Zip64 only triggers on archives or entries exceeding the 32-bit size fields (~4 GB), which a synthetic fixture can't reach cheaply. The gap-analysis policy table is explicit: "preserved (structural; required for archives > 4 GB)." Multi-GB Zip64 verification is deferred.
- **The fixture is synthetic but exercises a realistic extra-field profile.** Surfaces 3a3d embed the four record IDs common to Word's "Save as zip", macOS Finder, 7-Zip, and Info-ZIP: custom 0x7878, UT extended timestamp (0x5455), UID/GID Unix v1 (0x7875), and NTFS times (0x000a). Each record carries its own ASCII sentinel inside its `data` payload, so the recovery battery verifies per-record-id stripping. ZipStrategy's policy ("strip every extra-field record except 0x0001") is confirmed against each ID independently, matching mat2's behavior.
- **The PDF inner sentinel is `/Info /Author` only.** The PDF strategy's full sentinel battery (Title / Author / Subject / Producer / Creator / XMP / Annotations / Lang — 10 sentinels) is exercised in [`docs/forensic/pdf.md`](pdf.md). Here we use one sentinel because the point is to verify the recursion path through `ZipStrategy → PdfStrategy`, not to re-test the PDF strategy's depth (which is already covered).
- **The DOCX inner sentinel covers `<dc:creator>` + `<cp:lastModifiedBy>`.** The Office strategy's full gap battery is exercised in [`docs/forensic/office.ts`](office.md). Same rationale as the PDF case.
- **The mat2 encrypted-archive refusal is detected via `mat2` exit status, not by reading mat2's stderr.** Surfacing the structured refusal reason from mat2 would require parsing its Python traceback; that's a fragility we deliberately avoid.
- **No `unzip -P` or AES-decryption attempt.** The encrypted fixture uses a bogus ZipCrypto payload (12-byte header + 1 byte of "ciphertext"); the password is unknown and not material to the test. Surface 9 is about refusal behavior, not about whether the encryption is "real."
## Reproducing
```bash
# From the project root
npx tsx tools/forensic/zip.ts
```
Outputs go to `/tmp/zip-forensic/`:
- `input-primary.zip` — the 5-entry fixture (surfaces 1-8 (counting 3a3d as one))
- `input-encrypted.zip` — the 1-entry encrypted fixture (surface 9)
- `output-ours.zip``ZipStrategy` output
- `output-exiftool.zip` — exiftool-cleaned copy (empty on refusal — exiftool didn't write anything new)
- `output-mat2.zip` — mat2-cleaned copy
- `output-*-encrypted.zip` — encrypted-fixture outputs (mostly refusal copies)
- `output-ours.zip.extracted/`, etc. — per-output extraction tree used by the inner-exiftool + inner-strings channels
- `report.json` — structured per-surface verdict per strip path
Required tools: `exiftool` (`libimage-exiftool-perl`), `mat2`, `unzip`, `zipinfo`, `strings` (`binutils`). All available on Debian/Ubuntu via apt.
Debian/Ubuntu one-liner: `sudo apt install libimage-exiftool-perl mat2 unzip binutils`.
## What this directory is for
`docs/forensic/` documents adversarial recovery tests run *after* implementation lands, complementing `docs/gap-analysis/` (which runs *before* implementation to scope what should be removed). The pattern: implement → unit-test correctness → forensic-test unrecoverability → document the result.
Each format gets its own writeup as we go: `zip.md` here, `pdf.md` / `jpeg.md` / `office.md` / `png.md` / `video.md` for the formats shipped earlier. The runner scripts at `tools/forensic/<format>.ts` stay in the repo so the tests can be re-run any time the strategy changes.
## Captured runner output (2026-05-22)
```text
Sentinels embedded in fixture:
ARCHIVE_CMNT SENTINEL-ARCHIVE-CMNT-A1B2C3
ENTRY_CMNT SENTINEL-ENTRY-CMNT-D4E5F6
EXTRA_FIELD SENTINEL-EXTRA-7G8H9I
JPEG_EXIF SENTINEL-JPEG-EXIF-J1K2L3
PDF_INFO SENTINEL-PDF-INFO-M4N5O6
DOCX_CREATOR SENTINEL-DOCX-P7Q8R9
NESTED_ARCHIVE SENTINEL-NESTED-S1T2U3
ENCRYPTED_INNER SENTINEL-ENCRYPTED-V4W5X6
TIMESTAMP_LITERAL 2023-04-15 (non-epoch)
Primary fixture: /tmp/zip-forensic/input-primary.zip (3793 bytes)
Encrypted fixture: /tmp/zip-forensic/input-encrypted.zip (144 bytes)
=== Stripping primary fixture ===
ZipStrategy: ok (2494 bytes)
exiftool: refused-by-design — ExifTool: 'Writing of ZIP files is not yet supported' — documented limitation per https://exiftool.org/#limitations
mat2: ok (4106 bytes)
=== Results matrix (9 surfaces × 3 strip paths) ===
| Surface | Expected | input | ZipStrategy | exiftool | mat2 |
|--------------------------------------------------|-------------------------|----------|--------------|----------|------------|
| 1. Archive comment | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 2. Per-entry comment | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 3a. Extra field — custom (0x7878) | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 3b. Extra field — UT extended timestamp (0x5455) | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 3c. Extra field — UID/GID (0x7875) | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 3d. Extra field — NTFS times (0x000a) | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 4. Per-entry timestamp | NORMALIZED → 1980-01-01 | SURVIVED | NORMALIZED | REFUSED | NORMALIZED |
| 5. Inner JPEG EXIF Artist | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 6. Inner PDF /Author | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 7. Inner DOCX <dc:creator> | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 8. Nested-zip archive comment | DROPPED | SURVIVED | DROPPED | REFUSED | DROPPED |
| 9. Encrypted entry inner content | KNOWN_GAP | DROPPED | KNOWN_GAP | REFUSED | KNOWN_GAP |
PASS — all 11 (DROPPED/NORMALIZED) strict surfaces verified for ZipStrategy.
```

92
docs/gap-analysis/zip.md Normal file
View file

@ -0,0 +1,92 @@
# Generic ZIP metadata-stripping gap analysis
**Date:** 2026-05-22
**Goal:** Document the gap between (a) no ZIP support in v5 today, (b) ExifTool's generic-ZIP write (documented as "very limited"), (c) a theoretical thorough rewrite, and (d) the policy that ships in this PR — recursive cleaning of inner files via the strategy registry, plus epoch-normalized per-entry timestamps and scrubbed comments/extra fields. Closes issue #184.
## Methodology
Read:
- PKWARE APPNOTE 6.3.10 (ZIP file format specification) — sections 4.1 (overall format), 4.3 (local file header + central directory record + EOCD), 4.4 (per-field semantics), 4.5 (extra-field records), 4.6 (extensible data fields).
- ExifTool documentation at <https://exiftool.org/TagNames/ZIP.html> and the perl source `lib/Image/ExifTool/ZIP.pm` for the read path; ExifTool docs at <https://exiftool.org/index.html#limitations> for the "Writing of ZIP files is not yet supported" caveat.
- JSZip source (the `generateAsync` writer and the `loadAsync` parser) — confirmed it accepts the `date` option and writes the corresponding DOS-time fields in both the LFH and the central directory record. Confirmed default behavior is `new Date()` (a privacy bug if relied on — see invariants §6).
- mat2 source (`libmat2/archive.py` `ZipParser`) — confirmed mat2 recurses into archive entries, calling format-specific parsers per entry, and rewrites the archive with epoch timestamps. The single archive-level field mat2 leaves alone is the comment (set to empty on re-emit). This is much closer to our shipping policy than ExifTool's.
Verified empirically (POC in `/tmp/zip-poc/`, not committed):
- Built a 3-entry archive: `photo.jpg` with sentinel EXIF `Artist=SENTINEL-A`, `doc.pdf` with sentinel `/Author=SENTINEL-B`, `note.txt` with no metadata. Archive comment set to `SENTINEL-ARCHIVE`. Per-entry comment on `photo.jpg` set to `SENTINEL-ENTRY`.
- Ran `exiftool -all= -Time:All= -overwrite_original archive.zip`. Recovery battery (`unzip -z` / `unzip -lv` / `exiftool` on extracted entries):
- Archive comment: **dropped** (sole writable field).
- Per-entry comment: **survived** (ExifTool doesn't touch).
- Per-entry timestamps: **survived** (ExifTool doesn't touch).
- Inner JPEG EXIF: **survived** (ExifTool reads through but does not re-write inner entries).
- Inner PDF Info: **survived** (same).
- Ran `mat2 archive.zip` (output: `archive.cleaned.zip`). Same battery:
- Archive comment: **dropped**.
- Per-entry comment: **dropped**.
- Per-entry timestamps: **normalized** to 1980-01-01 00:00:00 (DOS epoch).
- Inner JPEG EXIF: **dropped** (mat2 recurses to its JPEG parser).
- Inner PDF Info: **partial** — Info dict cleared but mat2's PDF backend leaves the same residue that ExifTool's PDF backend does (see `docs/forensic/pdf.md` for the analogous gap in our PDF strategy's reference comparison).
The mat2 result is the closer reference: ExifTool's generic-ZIP write is too thin to constitute a baseline.
## Per-source policy table
| Source | Surface | Current v5 | ExifTool `-all= -Time:All=` | mat2 | Theoretical | Ships in this PR |
|---|---|---|---|---|---|---|
| Archive comment | EOCD `.ZIP file comment` | (no strategy) | dropped | dropped | dropped | **dropped** (empty string on re-emit) |
| Zip64 EOCD comment | Zip64 EOCD `.ZIP file comment` | (no strategy) | not touched | dropped (rewrite) | dropped | **dropped** (re-emit rebuilds EOCD without comment) |
| Per-entry comment | CD entry `file comment` | (no strategy) | not touched | dropped | dropped | **dropped** |
| Per-entry timestamp (LFH) | LFH `last mod file time + date` | (no strategy) | not touched | epoch | epoch (1980-01-01) | **epoch** |
| Per-entry timestamp (CD) | CD entry `last mod file time + date` | (no strategy) | not touched | epoch | epoch | **epoch** |
| Per-entry extra field — UT (0x5455, extended timestamp) | LFH/CD extra field | (no strategy) | not touched | dropped (mat2 strips all non-Zip64 extras) | dropped (UT records leak mtime/atime/ctime) | **dropped** |
| Per-entry extra field — UID/GID (0x7875, Info-ZIP Unix v1) | LFH/CD extra field | (no strategy) | not touched | dropped | dropped (uid/gid identify the creator's user account) | **dropped** |
| Per-entry extra field — NTFS (0x000a, NTFS times) | LFH/CD extra field | (no strategy) | not touched | dropped | dropped (100-ns NTFS times are higher-fidelity than DOS) | **dropped** |
| Per-entry extra field — Zip64 (0x0001) | LFH/CD extra field | n/a | preserved | preserved (structural) | preserved (required for archives > 4 GB) | **preserved** (structural; required for round-trip) |
| Inner JPEG metadata | nested EXIF / XMP / IPTC / Photoshop / Comment | (no strategy) | not touched (no recursion) | dropped (mat2 recurses) | dropped | **dropped** (recursion via selectStrategy → JpegStrategy) |
| Inner PNG metadata | nested tEXt/zTXt/iTXt/eXIf | (no strategy) | not touched | dropped (mat2 recurses) | dropped | **dropped** (recursion via selectStrategy → PngStrategy) |
| Inner PDF metadata | nested Info dict + XMP | (no strategy) | not touched | partial (same gap as ours) | dropped (theoretical) | **dropped** to the same bar as `docs/forensic/pdf.md` for standalone PDFs |
| Inner Office docProps | nested docProps/core.xml etc. | (no strategy) | not touched | dropped (mat2 recurses) | dropped | **dropped** (recursion via selectStrategy → OfficeStrategy) |
| Inner MP4 metadata | nested `moov`/`udta`/`meta` | (no strategy) | not touched | partial (mat2's video coverage) | dropped | **dropped** (recursion via selectStrategy → VideoStrategy) |
| Inner HEIC/AVIF/WebP/GIF metadata | nested boxes/chunks | (no strategy) | not touched | depends on mat2 backend | dropped | **dropped** for formats with a registered strategy (currently HEIC unsupported; AVIF/WebP/GIF via ExifToolFallbackStrategy) |
| Inner nested .zip | recursive archive | (no strategy) | not touched | dropped (recursive) | dropped | **dropped** (recursion: selectStrategy → ZipStrategy → walks again) |
| Encrypted entry content | LFH GP-bit 0 set | n/a | n/a | refused (mat2 fails on encrypted entries) | not strippable without password | **refused** with `invalid-file-format` directing user to a decryption-capable tool — original direction was "pass-through with per-file warning" but JSZip's `loadAsync` refuses any archive containing encrypted entries, blocking the partial-passthrough path at the library level. Implementation note: a byte-level walker bypassing JSZip would unblock passthrough; deferred to a follow-up. |
| Self-extracting EXE stub | bytes before first LFH | n/a | preserved | refused (mat2 won't process SFX) | preserved (modifying breaks SFX) | **preserved** (gap; documented in `PRIVACY_GAPS.md`) |
| Per-entry filename | CD entry `file name` | n/a | preserved | preserved (content, not metadata) | preserved | **preserved** (content, not metadata) |
| Per-entry CRC32 | LFH/CD `crc-32` | n/a | preserved | preserved | preserved (structural) | **preserved** (structural; JSZip recomputes) |
| Per-entry compression method + level | LFH/CD `compression method` | n/a | preserved | normalized to DEFLATE | preserved (don't surprise users with size profile changes) | **preserved** (match input method per-entry) |
| Per-entry internal/external file attributes | CD entry `internal/external file attributes` | n/a | preserved | preserved (filesystem permissions) | preserved (Unix mode bits + DOS attributes are filesystem-level, not user identity) | **preserved** |
## Honest gap summary
**Current v5 (no strategy) vs reference (mat2):** total gap. Generic ZIPs route to "unsupported" today, bypassing every privacy guarantee MetaScrub makes for the inner files. Recursive cleaning is the only architecturally coherent fix.
**ExifTool `-all=` vs mat2:** ExifTool is **not** a viable reference. It writes only the archive-level comment for generic ZIPs and refuses to recurse into entries (it special-cases Office/EPUB/APK/JAR for read-only metadata enumeration only). The ~95% of the surface that matters (inner-file metadata + per-entry timestamps) is untouched. The ExifTool comparison row exists in the forensic battery to make this visible, not as a target to match.
**mat2 vs theoretical:** mat2 is genuinely close to the theoretical maximum on the recursive case. Its weaknesses are inherited from per-format backends (its PDF clean is partial in the same way ExifTool's is; same applies to MP4). On the ZIP-level work (per-entry epoch timestamps, scrubbed comments/extras), mat2 is essentially equivalent to a thorough rewrite. **The shipping policy in this PR matches mat2 at the ZIP level and meets-or-exceeds it on inner formats where MetaScrub has dedicated hand-rolled walkers** (JPEG, PNG, Office, MP4, PDF beat ExifTool at the per-format level — see the per-format forensic docs).
**This PR vs mat2:** identical at the ZIP layer; identical or better at the inner-format layer (we re-use our existing strategies). The one case where mat2 wins outright is encrypted-entry handling — mat2 refuses encrypted archives; we pass them through with a warning. The maintainer chose pass-through-with-warning explicitly (see spec §3); the rationale is that refusing the whole archive is a worse user outcome when most entries are unencrypted, and we surface the encrypted ones honestly via the warning + the inline UI message.
## Recommendation
Hand-rolled walker over JSZip:
- JSZip is already a production dep (`OfficeStrategy` uses it); no new dependency.
- The library handles the structural concerns we don't want to re-implement (DEFLATE round-trip, central-directory rebuild, Zip64 promotion when needed).
- The metadata we care about (timestamps, comments, extra fields) is reachable via JSZip's per-entry options (`date`, `comment`) or by re-emitting the entry without the metadata payload.
- Encrypted-entry detection: JSZip's `loadAsync` throws `"Encrypted zip are not supported"` on any archive containing encrypted entries. We catch that error and surface a structured `invalid-file-format` result. An earlier implementation used a hand-rolled LFH GP-flag scanner (~30 lines), but it had two blind spots — ZIP64 entries (compressed size = 0xFFFFFFFF in the LFH, real size in the Zip64 extra field) desynchronised the stride math, and data-descriptor entries (GP-flag bit 3) broke the scan on any streaming entry preceding an encrypted one. JSZip's detection runs on the same bytes it parses and has neither blind spot.
Library evaluations explicitly ruled out:
- **Rewriting ZIP from scratch.** ~3000 lines of bytewise PKWARE APPNOTE compliance, including DEFLATE, Zip64 promotion thresholds, and central-directory rebuild. Not worth it when JSZip handles the structural surface for us.
- **fflate** (alternative JS zip lib, ~12 KB gzip). Smaller than JSZip but doesn't expose the per-entry comment or extra-field options we need; we'd be writing the same byte-walking code we'd write anyway, just on top of a less-featured library. Adding a second zip library is also a fresh prod dep against the 4-dep ceiling.
## Phase plan
This PR ships the full shipping policy in §"Per-source policy table" plus the per-leaf diff UI tree. Deferred items:
- **Streaming MP4/Office strip for large archives** — out of scope; tracked in #34 (which would also benefit standalone large-file processing).
- **Self-extracting EXE stub scrubbing** — documented gap, requires distinguishing stub-PE bytes from arbitrary leading garbage. Not worth the engineering cost for the audience.
- **Decryption of encrypted entries** — out of scope (no password prompts; see invariants).
- **Multi-disk / spanned archives** — JSZip rejects them; surfaces as `parse-failed`.
- **ZIP64 archives > 4 GB** — Zip64 is supported in pass-through; not sentinel-tested at that scale (would need a multi-GB fixture).

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,416 @@
# Issue #184 — Generic ZIP support (recursive cleaning)
- **Status**: Draft — awaiting maintainer review
- **Date**: 2026-05-22
- **Authors**: Randa (with Claude assistance)
- **Implementation plan**: TBD (will live at `docs/superpowers/plans/2026-05-22-issue-184-zip-rollout.md` after this spec is approved)
- **Parent spec**: none — net-new format strategy; obeys [`.claude/rules/format-strategy-workflow.md`](../../../.claude/rules/format-strategy-workflow.md) Phases 13
- **Forgejo issue**: #184 ("Support \*.zip and \*.md files")
## 1. Problem
Issue #184 asks for `.zip` and `.md` support. Neither extension is in `SUPPORTED_EXTENSIONS` or any `FormatStrategy` today. Files of either type drop into the UI as "unsupported" and pass through with no processing.
The two extensions are not symmetric:
- **`.md`** is plain text. Markdown has no embedded metadata in the format-structure sense — no EXIF, no docProps, no PDF Info dictionary. The only conceivable target is YAML/TOML frontmatter, which is content, not metadata; stripping it would change the file's meaning. There is nothing for a `FormatStrategy` to do.
- **`.zip`** is a container with real metadata to scrub: per-entry timestamps in both local file headers and the central directory, per-entry comments, per-entry extra fields, and an archive-level comment. Privacy invariant §6 already mandates epoch (1980-01-01) for ZIP central-directory timestamps. Additionally, ZIPs commonly carry user content whose metadata MetaScrub already knows how to strip (JPEGs, PDFs, DOCX, etc.) — without recursive cleaning, dropping an archive bypasses every privacy guarantee the app makes for its inner files.
This spec covers ZIP support only. `.md` is closed as out-of-scope (see §3).
## 2. Scope
Single-PR delivery of generic-ZIP support:
- **New strategy** `ZipStrategy` in `src/infrastructure/wasm/strategies/zip_strategy.ts`, registered in `strategy_registry.ts`.
- **Magic-byte verification**`PK\x03\x04` (local file header) or `PK\x05\x06` (empty archive EOCD only).
- **Per-entry recursive cleaning** — each file entry's bytes are re-dispatched through `selectStrategy()`. Nested `.zip` entries naturally recurse through `ZipStrategy` again.
- **Zip-level metadata scrub** — per-entry timestamps → epoch, per-entry comments → empty, per-entry extra fields → stripped, archive comment → empty. Applied to every entry kind (directory, encrypted, supported, unsupported).
- **Encrypted-entry pass-through with warning** — content unmodified, zip-level metadata still normalized, per-entry warning surfaced in the UI.
- **Warning surface** — extend `StripResult` with an optional `warnings: readonly string[]`; propagate through `WasmProcessor``use_process_files` → an inline disclosure in the file row.
- **Per-inner-file diff tree** — new `ZipExpansion` component that renders the archive's entries as a tree, each leaf independently expandable to its own metadata diff. Reuses an extracted `MetadataDiffTable` sub-component (split out from `MetadataDiffExpansion`). Nested ZIPs recurse: a `nested.zip` row expands to its own `ZipExpansion`. Extend `StripResult` with optional `archiveEntries: readonly ArchiveEntryResult[]`.
- **Forensic verification**`tools/forensic/zip.ts` runner with sentinel battery across nine surfaces; results in `docs/forensic/zip.md`.
- **Gap analysis**`docs/gap-analysis/zip.md` documents per-source policy and comparison to ExifTool's generic-ZIP write (which is documented as "very limited" — recursive cleaning is the differentiator).
- **Privacy gaps documented** — encrypted-entry content pass-through and self-extracting EXE stub bytes added to `docs/PRIVACY_GAPS.md`.
- **Issue cleanup** — close #184 with a comment explaining `.md` is out-of-scope.
## 3. Non-goals
- **`.md` support** — closed as out-of-scope. No embedded metadata exists. (Discussed and accepted by maintainer 2026-05-22.)
- **ZIP-bomb / decompression cap** — explicitly declined by maintainer. Trust JSZip defaults. Documented limitation: a malicious archive can OOM the tab.
- **Encrypted archives** — out of scope. **Deviation from the original brainstorming direction:** the user-approved policy was "pass through encrypted entries, normalize zip-level metadata, surface a warning per encrypted entry." JSZip's `loadAsync` refuses any archive containing encrypted entries at the library level (`"Encrypted zip are not supported"`), blocking the partial-passthrough path. To unblock it would require a byte-level ZIP walker bypassing JSZip — significant additional code and a parallel maintenance surface. Shipping policy is therefore: refuse encrypted archives with a clear `invalid-file-format` error directing the user to a decryption-capable tool. Tracked as a follow-up for a future PR if demand surfaces.
- **Multi-disk / spanned archives** — JSZip doesn't support them; `loadAsync` will fail and surface `parse-failed`.
- **Self-extracting EXE stub scrubbing** — bytes before the first local file header are preserved (modifying them breaks the SFX behavior). Documented as a gap.
- **Top-level walker entries / `MetadataDiffExpansion` for the ZIP row itself**`walkerEntries: []` and `diffDocument: null` at the ZIP-row level. Per-tag detail lives inside each inner file's leaf row in the new `ZipExpansion` tree, not as a flat aggregate. Inner files dropped directly still get the existing single-file diff treatment.
- **Office retrofit to the tree model** — Office files (`.docx`/`.xlsx`/`.pptx`/`.odt`) keep their current flat `walkerEntries`-driven view inside `MetadataDiffExpansion`. Their internal structure is fixed (always `docProps/core.xml`, `customXml/`, etc.), so the source-labelled flat list works. The tree model exists for ZIP's open-ended user-created structure.
- **Translated warning content** — strategy-emitted strings are English. Surrounding UI chrome (`"N warnings"`, chevron label) localizes via `i18nLookup()`. Structured warning codes are a refactor not worth blocking this PR.
- **Virtualized rendering of inner-entry rows** — archives with > 100 entries collapse extras behind a "Show N more entries" button. Full virtualization (react-virtual / windowing) is out of scope; pagination is sufficient.
## 4. Architecture
### 4.1 New module
`src/infrastructure/wasm/strategies/zip_strategy.ts` — implements `FormatStrategy`:
```ts
export class ZipStrategy implements FormatStrategy {
readonly extensions: ReadonlySet<string> = new Set([".zip"]);
verifyMagicBytes({ bytes }: { bytes: Uint8Array }): boolean {
// PK\x03\x04 (local file header) or PK\x05\x06 (empty archive EOCD)
return (
bytes.length >= 4 &&
bytes[0] === 0x50 && bytes[1] === 0x4b &&
((bytes[2] === 0x03 && bytes[3] === 0x04) ||
(bytes[2] === 0x05 && bytes[3] === 0x06))
);
}
async strip({ bytes, options }: {
bytes: Uint8Array;
options: StripOptions;
}): Promise<Result<StripResult, ExifError>> {
// 1. JSZip.loadAsync(bytes)
// 2. For each entry, apply per-entry policy (§5)
// 3. Build output via generateAsync with epoch dates and empty comment
// 4. Return { bytes, walkerEntries: [], diffDocument: null, warnings }
}
}
```
### 4.2 Registry placement
Order in `STRATEGIES`:
```ts
const STRATEGIES: readonly FormatStrategy[] = [
new OfficeStrategy(), // claims .docx/.xlsx/.pptx/.odt
new VideoStrategy(),
new JpegStrategy(),
new PngStrategy(),
new PdfStrategy(),
new ZipStrategy(), // NEW — claims .zip
...(ENABLE_EXIFTOOL_FALLBACK ? [new ExifToolFallbackStrategy()] : []),
];
```
`OfficeStrategy` is listed first, so a file with extension `.docx` (even if magic-byte verified as a ZIP) routes to Office, not ZIP. A file with extension `.zip` always routes to `ZipStrategy`. A renamed `.docx → .zip` therefore gets the recursive Office-aware treatment — strictly more aggressive than `OfficeStrategy`'s targeted scrub, never less clean.
### 4.3 Recursion model
ZIPs are flat: the entry list contains all paths inline (e.g. `folder/sub/photo.jpg`). There is no descend-into-directories step. `ZipStrategy` walks the flat list once. Recursion happens at two points:
1. **File entry whose bytes match another strategy.** `selectStrategy({ filename, bytes })` is called on the decompressed entry. If a strategy matches, its `.strip()` is awaited and the cleaned bytes go back into the output ZIP under the same name.
2. **Nested `.zip`.** A file entry named `inner.zip` whose bytes are a valid ZIP routes to `ZipStrategy` again via the same `selectStrategy()` call. Real recursion at the archive level.
There is no explicit recursion depth limit. JS call stack is the implicit bound; in practice we'd run out of memory before stack.
### 4.4 Warning propagation + archive-entry tree
**Important architectural fact:** today's `StripResult` has no `metadataRemoved` field, and `diffDocument` is always `null` when returned from a strategy — `WasmProcessor` builds it out-of-band by stashing source + stripped bytes and running `ExifToolDiffStrategy` (the ~7 MB WebPerl-ExifTool WASM) on both, then merging the strategy's `walkerEntries` into the `before` set. The "Cleaned" status pill is binary today; there is no count of removed tags at the row level. The diff itself is what shows what changed.
For ZIP we follow this same pattern, scaled to inner files: per-leaf diffs are built lazily by extending `WasmProcessor` to expose a new method that runs `ExifToolDiffStrategy` on a single archive leaf's source + stripped bytes when the UI asks for it.
Extend `StripResult` and add the archive-entry types in `src/infrastructure/wasm/format_strategy.ts`:
```ts
export type ArchiveEntryStatus =
| "cleaned" // supported file entry, recursive strip ran
| "passed-through-unsupported" // selectStrategy() returned null
| "passed-through-encrypted" // encrypted bit set
| "directory"; // zero-byte directory placeholder
export interface ArchiveEntryResult {
readonly path: string; // "folder/photo.jpg"
readonly status: ArchiveEntryStatus;
// For "cleaned" leaves: pre-strip and post-strip bytes for the
// deferred per-leaf ExifTool diff. Consumed by WasmProcessor and
// surfaced via the lazy on-expand diff build (§4.5). Null for
// non-cleaned statuses (no diff to build).
readonly sourceBytes: Uint8Array | null;
readonly strippedBytes: Uint8Array | null;
// Walker entries from the inner strategy (Office docProps, PDF
// annotations, etc.). Surfaced in the leaf's diff as "removed"
// source-grouped sections, same as the top-level pattern.
readonly walkerEntries: readonly MetadataEntry[];
// RECURSIVE — non-null when the entry is itself a ZIP. Null for
// all other statuses.
readonly entries: readonly ArchiveEntryResult[] | null;
readonly warnings: readonly string[];
}
export interface StripResult {
readonly bytes: Uint8Array;
readonly walkerEntries: readonly MetadataEntry[];
readonly diffDocument: MetadataDocument | null; // unchanged — always null from strategies
readonly warnings?: readonly string[]; // NEW
readonly archiveEntries?: readonly ArchiveEntryResult[]; // NEW
}
```
Both new fields are optional: existing strategies (JPEG/PNG/PDF/Office/Video/ExifToolFallback) do not need touching. Only `ZipStrategy` populates them for now.
**Leaf vs. nested-ZIP entries:** a leaf has `entries === null` and (for `cleaned`) carries its own `sourceBytes` + `strippedBytes` so the deferred diff can be built. A nested-ZIP entry has `entries` populated and `sourceBytes === null` (the diffs for that subtree live inside the children's own pairs, one level deeper).
**`WasmProcessor` extension:** add `buildArchiveLeafDiff({ entryId, path }): Promise<MetadataDocument | null>`. After `process()` resolves with `archiveEntries`, WasmProcessor walks the tree and stashes each leaf's `(sourceBytes, strippedBytes, walkerEntries, extension)` into a new `pendingLeafDiffs: Map<string, PendingLeafDiff>` keyed by `${entryId}:${path}`. On `buildArchiveLeafDiff` call:
1. Look up the stash; if absent, return `null` (defensive — caller raced, or the path was already drained).
2. Drain the stash so retries can't double-spend.
3. Run `ExifToolDiffStrategy.readDocument` on the source + stripped bytes (same fire-and-forget pattern as the top-level diff).
4. Merge `walkerEntries` into `before`; return `{ before, after }`.
The existing top-level `pendingDiffs` for the ZIP as a whole gets `diffDocument: null` permanently — ZIP rows don't surface a flat diff. The flag `ENABLE_EXIFTOOL_DIFF` continues to gate the entire diff feature; when off, archive-leaf rows show no expansion content (same fallback as the existing top-level path).
**Warning + archive-entry propagation path:**
- `ZipStrategy.strip()` builds the `archiveEntries` tree while walking entries. For each supported file entry it calls `selectStrategy(…).strip(…)` synchronously, stashes the inner's `walkerEntries` and the source + stripped bytes on the leaf, and forwards inner `archiveEntries` (for nested ZIPs) onto the corresponding `ArchiveEntryResult.entries`. Accumulates flat top-level `warnings` (own warnings + forwarded ones prefixed with `<entry-name>: `).
- `WasmProcessor.process()` surfaces `warnings` + `archiveEntries` in its `ProcessOutcome`. Stashes per-leaf diff inputs into `pendingLeafDiffs`.
- `use_process_files.ts` stores `warnings` + `archiveEntries` on the `FileEntry` state. Per-leaf `diffDocument` and `diffPending` flags live in `ZipExpansion`'s component state (set on first expand of each leaf), not in the global `FileEntry` reducer — this keeps the existing reducer's shape stable.
- **UI**: see §4.5.
### 4.5 Diff tree UI
**New component** `src/web/components/file-list/ZipExpansion.tsx`. Props: `{ entryId: string; entries: readonly ArchiveEntryResult[] }`. Renders the tree; lazy-loads each leaf's diff on first expand via `window.api.wasm.buildArchiveLeafDiff`.
**Visual layout** (BEM classes prefixed `zip-expansion__`):
```
photo.zip ✓ Cleaned ⓘ 2 warnings ▾
└─ folder/photo.jpg Cleaned ▾
│ [MetadataDiffTable: EXIF/Make removed, …]
└─ folder/sub/document.pdf Cleaned ▸ (collapsed)
└─ folder/ Directory
└─ secret.txt Encrypted — passed through
└─ archive.zip Cleaned ▾
└─ inner/photo.jpg Cleaned ▾
│ [MetadataDiffTable: …]
└─ inner/file.docx Cleaned ▸
```
(Status is binary "Cleaned" — same as the top-level row. A per-leaf tag count would require strategy-side counting that doesn't exist today; deferred.)
**Per-row expansion behavior:**
- **Cleaned leaf** (`status: "cleaned"` + `entries === null`) — has a chevron. First click:
1. Dispatches `await window.api.wasm.buildArchiveLeafDiff({ entryId, path })`.
2. Renders `DiffSkeleton` (reused from `MetadataDiffExpansion.tsx`) while the promise is in flight. Subsequent leaves in the same session don't pay the WASM warm-up (cached by the `ExifToolDiffStrategy` instance).
3. On resolve, if the document is non-null and non-empty, swaps the skeleton for `<MetadataDiffTable document={doc} />`. If null/empty, shows "No metadata detected" inline.
4. The resolved document is cached in `ZipExpansion`'s local state; subsequent collapse/expand is instant.
- **Cleaned nested-ZIP entry** (`status: "cleaned"` + `entries != null`) — has a chevron. Expanding opens `<ZipExpansion entryId={entryId} entries={entry.entries} />` recursively. No diff build for the nested-ZIP row itself; its leaves trigger their own builds when expanded.
- **Encrypted / unsupported / directory rows** — no chevron. Status message rendered inline (`"Encrypted — passed through"`, `"Unsupported — passed through"`, `"Directory"`).
**Indent:** no depth cap on recursion (matches strategy-side recursion). Visual indent caps at level 5; deeper levels inherit level-5 indent (avoids horizontal squeeze on the mobile target). Adversarial-archive safety: render-time depth counter bails out with a `"[depth limit reached — drop the inner file directly]"` row at depth 20.
**State management:** per-row expand state + per-leaf cached diff doc both live in `ZipExpansion`'s local state as a `Map<path, { open: boolean; doc: MetadataDocument | null | "pending" | "failed" }>`. Closed by default. Not persisted across reloads.
**Scale guard:** first 100 entries render eagerly. Archives with > 100 entries get a `Show {count} more entries` button at the bottom; clicked repeatedly to walk through the rest.
**Status icons (BEM):**
- `.zip-expansion__row--cleaned` — green check (existing `cleaned` color)
- `.zip-expansion__row--encrypted` — neutral muted + small lock icon
- `.zip-expansion__row--unsupported` — neutral muted + existing unsupported icon
- `.zip-expansion__row--directory` — neutral muted + folder icon
**`MetadataDiffTable` extraction:** today `MetadataDiffExpansion.tsx` wraps a `TwoPaneView` component with skeleton-vs-table logic plus an outer `file-table__expansion file-table__diff` chrome div. The refactor:
- Rename `TwoPaneView``MetadataDiffTable` and export it from `MetadataDiffTable.tsx` (split from `MetadataDiffExpansion.tsx`).
- `MetadataDiffExpansion` stays as a thin wrapper: skeleton-vs-table decision + outer expansion chrome. Behavior at the top-level FileRow is unchanged.
- `ZipExpansion` leaf rows render `MetadataDiffTable` inside a slimmer leaf wrapper (`zip-expansion__leaf-diff`), and reuse `DiffSkeleton` for the pending state.
**FileRow integration:** in `FileRow.tsx`, when `entry.archiveEntries != null && entry.archiveEntries.length > 0`, the expansion area renders `<ZipExpansion entryId={entry.id} entries={entry.archiveEntries} />` instead of `<MetadataDiffExpansion … />`. The two are mutually exclusive (a ZIP doesn't get a top-level diff view of itself; an Office/JPEG/PDF doesn't get an archive-entry tree).
**`window.api.wasm.buildArchiveLeafDiff`:** new method on the WASM API surface (`src/infrastructure/web/web_api.ts`), wraps `WasmProcessor.buildArchiveLeafDiff`. Signature: `({ entryId: string; path: string }): Promise<MetadataDocument | null>`. Returns null when `ENABLE_EXIFTOOL_DIFF` is off, when the stash was drained, or when the ExifTool read fails.
## 5. Per-entry policy
| Entry kind | Detection | Bytes action | Zip-level metadata action | Warning emitted |
|---|---|---|---|---|
| **Directory entry** | Name ends with `/`; zero data | Pass through (no bytes) | Timestamp → epoch; comment → empty; extra field → strip | No |
| **Encrypted file entry** | General-purpose bit 0 set in local file header | Pass through unchanged | Timestamp → epoch; comment → empty; extra field → strip | Yes: `"Encrypted entry '<name>' — content not cleaned, only zip-level metadata normalized."` |
| **Supported file entry** | `selectStrategy({ filename: entry.name, bytes })` returns non-null | Decompress → recursive strip → use cleaned bytes | Timestamp → epoch; comment → empty; extra field → strip | Forward warnings from recursive call, prefixed with `<entry-name>: ` |
| **Unsupported file entry** | `selectStrategy()` returns null | Pass through unchanged | Timestamp → epoch; comment → empty; extra field → strip | No |
**Archive-level scrub (applied once):**
- Archive comment → empty
- Zip64 EOCD comment → empty
- Bytes prepended before the first local file header (self-extracting stubs) → preserved (modifying breaks SFX; gap)
**Preserved (not metadata):**
- Filenames (content)
- Internal/external file attributes (Unix mode bits, DOS attribute byte — filesystem permissions)
- Compression method + level (structural; we match input mode per-entry)
- CRC32 (structural; recomputed by JSZip on emit)
**Epoch literal:** `new Date(1980, 0, 1)`, passed explicitly to JSZip's `date` option on every entry write. Default JSZip behavior is `new Date()` — a privacy bug — per [`.claude/rules/privacy-invariants.md`](../../../.claude/rules/privacy-invariants.md) §6.
## 6. Output
**Re-emit** via `JSZip.generateAsync({ type: "uint8array" })`. JSZip rebuilds the central directory; we don't preserve input byte layout. Per-entry compression method matches input (`DEFLATE` stays `DEFLATE`, `STORE` stays `STORE`) so cleaned archives don't surprise users by changing size profile.
**No `metadataRemoved` count.** Today's `StripResult` has no such field and the FileRow doesn't render a per-file count — it shows binary `"Cleaned"` / `"Already clean"` pills. ZIPs follow the same shape: a successful strip produces `"Cleaned"`. The actual "what changed" surface is the `ZipExpansion` tree (§4.5), where each leaf's diff is the user-visible record of removed metadata. Adding a top-level "Cleaned · N entries" count is a follow-up not blocking this PR.
**Result shape:**
```ts
{
ok: true,
value: {
bytes: outputZipBytes,
walkerEntries: [], // ZipStrategy doesn't contribute to a flat walker view
diffDocument: null, // No top-level diff for the ZIP itself (per-leaf diffs lazy-build via ZipExpansion)
warnings: [...],
archiveEntries: [...], // Tree of inner entries — see §4.4 / §4.5
}
}
```
**Error variants returned:**
- `invalid-file-format` — magic-byte mismatch.
- `parse-failed` — JSZip's `loadAsync` throws (truncated central directory, malformed entry, multi-disk archive). Detail carries JSZip's message.
- `file-io-error``generateAsync` throws (e.g. OOM on huge archives, since no expansion cap by maintainer direction).
## 7. Forensic verification
Per Phase 3 of [`format-strategy-workflow.md`](../../../.claude/rules/format-strategy-workflow.md), this is the shipping gate.
### 7.1 Sentinel surfaces
| # | Surface | Sentinel | Recovery commands | Expected |
|---|---|---|---|---|
| 1 | Archive comment | `SENTINEL-ARCHIVE-CMNT-A1B2C3` | `unzip -z file.zip`, `zipinfo -z`, `strings \| grep SENTINEL` | DROPPED |
| 2 | Per-entry comment | `SENTINEL-ENTRY-CMNT-D4E5F6` | `unzip -lv`, `zipinfo -v` | DROPPED |
| 3 | Per-entry extra field (custom 0x7878 record) | `SENTINEL-EXTRA-7G8H9I` | `strings`, raw byte scan of central dir | DROPPED |
| 4 | Per-entry timestamp (LFH + central) | `2023-04-15 14:32:11` (non-epoch) | `unzip -l`, `zipinfo`, `bsdtar -tvf` | NORMALIZED → 1980-01-01 |
| 5 | Inner JPEG EXIF `Artist` | `SENTINEL-JPEG-EXIF-J1K2L3` | Extract `inner.jpg``exiftool -a -G1 -s \| grep SENTINEL` | DROPPED |
| 6 | Inner PDF `/Author` | `SENTINEL-PDF-INFO-M4N5O6` | Extract `inner.pdf``exiftool -a`, `qpdf --qdf`, `strings` | DROPPED |
| 7 | Inner DOCX `<dc:creator>` | `SENTINEL-DOCX-P7Q8R9` | Extract `inner.docx` → unzip → grep `docProps/core.xml` | DROPPED |
| 8 | Nested-zip archive comment (recursion test) | `SENTINEL-NESTED-S1T2U3` | Extract `inner.zip` from cleaned outer → `unzip -z inner.zip` | DROPPED |
| 9 | **Encrypted entry's inner EXIF (KNOWN GAP)** | `SENTINEL-ENCRYPTED-V4W5X6` | Extract encrypted entry (pass through) → sentinel survives by design | SURVIVES (documented gap) |
**Bar:** zero sentinel survivors across every recovery command for surfaces 18. Surface 9 goes into `KNOWN_GAPS` in `tools/forensic/zip.ts` and `docs/PRIVACY_GAPS.md`.
### 7.2 Cross-tool reference run
For each fixture, also pipe input through `exiftool -all= -Time:All= -overwrite_original` and run the same recovery battery. Expected failures: surfaces 2, 3, 4, 5, 6, 7, 8 (ExifTool's generic-ZIP write is documented as limited and does not recurse into entries). The result table cites this verbatim — same pattern as `docs/forensic/pdf.md`.
### 7.3 Files produced
- `docs/gap-analysis/zip.md` — Phase 1: ZIP/APPNOTE walkthrough (LFH, CD, EOCD, Zip64 EOCD, extra-field records 0x000a/0x5455/0x7875/etc.), per-source policy table, ExifTool comparison, recommendation = hand-rolled.
- `docs/forensic/zip.md` — Phase 3: results table, interpretation paragraph, caveats (encrypted pass-through, no expansion cap, SFX stubs).
- `tools/forensic/zip.ts` — reproducible runner. Builds fixtures via a **self-contained ZIP builder** independent of JSZip (adversarial-independence per format-strategy-workflow.md — same rationale as `tools/forensic/video.ts`'s `walkAtoms`). Runs strip, runs recovery battery, prints per-surface verdict, exits non-zero on UNEXPECTED survivors.
### 7.4 Fixture sources
- **Synthetic** — built by the runner with controlled sentinel placement. Sufficient for v1.
- **Real-world (optional)**`tools/forensic/fetch-zip-fixtures.sh` (parallels the video script): one CC0 archive from archive.org. Skip if it adds friction.
### 7.5 Explicitly not verified
- Multi-disk / spanned archives (JSZip rejects them).
- Self-extracting EXEs with prepended stubs (stub preserved; gap).
- ZIP64 archives > 4 GB (theoretically supported; not sentinel-tested at that scale).
## 8. Files touched
### New
- `src/infrastructure/wasm/strategies/zip_strategy.ts`
- `tests/infrastructure/wasm/zip_strategy.test.ts`
- `src/web/components/file-list/ZipExpansion.tsx` — recursive tree component.
- `src/web/components/file-list/MetadataDiffTable.tsx` — extracted from `MetadataDiffExpansion.tsx` so `ZipExpansion` leaves can reuse the two-pane table without the outer expansion chrome.
- `tests/web/components/file-list/ZipExpansion.test.tsx` — render + expansion-state tests.
- `src/web/styles/zip-expansion.css` — BEM classes for the tree.
- `tools/forensic/zip.ts`
- `docs/gap-analysis/zip.md`
- `docs/forensic/zip.md`
### Modified
- `src/infrastructure/wasm/strategy_registry.ts` — register `ZipStrategy`.
- `src/domain/files/file_types.ts` — add `.zip` to `SUPPORTED_EXTENSIONS`.
- `src/infrastructure/wasm/format_strategy.ts` — add optional `warnings: readonly string[]` and `archiveEntries: readonly ArchiveEntryResult[]` to `StripResult`; define the `ArchiveEntryResult` + `ArchiveEntryStatus` types.
- `src/infrastructure/wasm/wasm_processor.ts` — propagate `warnings` + `archiveEntries` through `ProcessOutcome`; add `pendingLeafDiffs: Map<string, PendingLeafDiff>` keyed by `${entryId}:${path}`; add `buildArchiveLeafDiff({ entryId, path })` method.
- `src/application/ports/metadata_processor_port.ts` — extend `MetadataProcessorPort` and `ProcessOutcome` to include the new fields and `buildArchiveLeafDiff` method.
- `src/infrastructure/web/web_api.ts` — add `wasm.buildArchiveLeafDiff` to the `WebApi` surface.
- `src/web/env.d.ts` — type the new `window.api.wasm.buildArchiveLeafDiff`.
- `src/web/hooks/use_process_files.ts` — store `warnings` + `archiveEntries` on `FileEntry`.
- `src/web/components/file-list/FileRow.tsx` — inline `ⓘ N warnings ▾` disclosure; when `archiveEntries != null && archiveEntries.length > 0`, render `<ZipExpansion>` in the expansion area instead of `<MetadataDiffExpansion>`.
- `src/web/components/file-list/MetadataDiffExpansion.tsx` — extract its `TwoPaneView` into the new `MetadataDiffTable.tsx`; this file becomes a thin wrapper (skeleton handling + outer chrome). Also export `DiffSkeleton` so `ZipExpansion` can reuse it.
- `src/web/styles/file-list.css` — BEM classes `.file-list__warnings`, `.file-list__warnings-toggle`, `.file-list__warnings-list`, `.file-list__warning-item`.
- `src/web/main.tsx` — import the new `zip-expansion.css`.
- `.resources/strings.json` — keys `warnings.label` ("warning" / "warnings"), `warnings.toggleAria` ("Show warnings for {name}" / "Hide warnings for {name}"), `zipExpansion.statusCleaned` ("Cleaned"), `zipExpansion.statusEncrypted` ("Encrypted — passed through"), `zipExpansion.statusUnsupported` ("Unsupported — passed through"), `zipExpansion.statusDirectory` ("Directory"), `zipExpansion.showMore` ("Show {count} more entries"), `zipExpansion.depthLimit` ("Depth limit reached — drop the inner file directly"), `zipExpansion.noMetadata` ("No metadata detected"), `zipExpansion.diffFailed` ("Couldn't load diff — internal error").
- `docs/PRIVACY_GAPS.md` — encrypted-entry content pass-through; SFX stub bytes.
- `README.md` (Format Support Matrix) — add `.zip` row.
## 9. Test plan
### Vitest unit tests (`tests/infrastructure/wasm/zip_strategy.test.ts`)
- Magic-byte verification: accepts `PK\x03\x04`, accepts empty-archive `PK\x05\x06`, rejects junk.
- Empty archive: round-trips with archive comment scrubbed.
- Single-entry archive with non-epoch timestamp: emitted timestamp is 1980-01-01.
- Archive with non-empty archive comment: emitted comment is empty.
- Archive with entry that has non-empty entry comment / extra field: scrubbed.
- Archive containing a JPEG with EXIF sentinel: inner JPEG re-emitted without sentinel.
- Archive containing an encrypted entry: passes through; warning emitted with entry name.
- Archive containing a nested `archive.zip` whose inner archive has a comment: outer cleaning recurses; inner archive comment also scrubbed.
- Truncated central directory: returns `parse-failed`.
- Renamed `.docx → .zip` routes to `ZipStrategy` (sanity, not a regression).
- ZIP containing a JPEG: result's `archiveEntries[0]` has `status: "cleaned"`, populated `sourceBytes` + `strippedBytes`, and matching `path`.
- ZIP containing a nested ZIP: outer `archiveEntries[0].entries` is non-null and recursively contains the inner archive's entries; nested entry's `sourceBytes` is null (nested-ZIP node, not a leaf).
### Vitest unit tests for `StripResult` shape + `WasmProcessor.buildArchiveLeafDiff`
- Existing strategy `strip()` returns can omit `warnings` and `archiveEntries`; defaults at consumption sites are `[]` / `null`.
- `WasmProcessor.process()` propagates `warnings` + `archiveEntries` through `ProcessOutcome`.
- After processing a ZIP, `WasmProcessor.pendingLeafDiffs` contains one entry per cleaned leaf, keyed by `${entryId}:${path}`.
- `buildArchiveLeafDiff({ entryId, path })` returns a `MetadataDocument` with walker entries merged into `before`; second call for the same path returns null (stash drained).
- When `ENABLE_EXIFTOOL_DIFF` is off, `buildArchiveLeafDiff` returns null regardless.
### Vitest unit tests (`tests/web/components/file-list/ZipExpansion.test.tsx`)
- Renders one row per `ArchiveEntryResult` with correct status icon/label.
- Cleaned-leaf row: clicking the chevron first dispatches `buildArchiveLeafDiff`, renders `DiffSkeleton` during the await, then swaps in `MetadataDiffTable` when the promise resolves with a non-empty doc. Subsequent expand/collapse re-renders the cached table instantly.
- Cleaned-leaf row with null diff result: renders "No metadata detected" inline instead of the skeleton-then-table.
- Nested-ZIP row: clicking the chevron expands and renders a recursive `<ZipExpansion>` showing the inner entries (no diff build triggered for the nested node itself).
- Encrypted / unsupported / directory rows: no chevron rendered.
- > 100 entries: first 100 render eagerly; `Show {count} more entries` button appears; click reveals the next 100.
- Indent depth at level 6+ matches level 5 (no further squeeze). At depth 20, render the depth-limit row in place of further children.
### Playwright web e2e (`tests/e2e/web/file-processing.spec.ts`)
- Drop a fixture ZIP with one EXIF-tagged JPEG inside; assert download triggered with a cleaned `.zip`; download the result, re-open via `JSZip` from the test runner, assert inner JPEG no longer has the sentinel EXIF tag.
- Drop a fixture ZIP with an encrypted entry; assert UI shows `ⓘ 1 warnings` disclosure with the expected English text.
- Drop a fixture ZIP with two cleaned JPEGs + one nested ZIP containing a third cleaned JPEG; expand the row, assert three top-level entry rows + one expandable nested row; expand the nested row, assert the inner JPEG row is visible; expand the inner JPEG row, wait for diff skeleton → table transition, assert a `MetadataDiffTable` renders with at least one `removed`-status pair containing the sentinel value.
### Forensic runner (gated, not in CI)
- `npx tsx tools/forensic/zip.ts` — exit code 0 with no UNEXPECTED survivors for surfaces 18. Attached output goes into the PR description.
## 10. Risks + open questions
- **JSZip's central-directory rebuild is opaque.** We assume passing `date: new Date(1980, 0, 1)` produces a 1980-01-01 entry in both LFH and CD; the forensic battery (surface 4) will confirm. If JSZip writes "now" in LFH and only honors `date` in CD, we'll add a post-emit byte-patch pass before merging.
- **Encrypted-entry detection edge cases.** General-purpose bit 0 covers ZipCrypto + AES; some archivers use the `0x9901` AE-x extra-field record without setting bit 0. The unit test will include both; if the second class isn't caught we'll add explicit AE-x detection.
- **Routing collision: renamed Office docs.** Documented in §4.2 — strictly more aggressive than Office routing, never less clean. Adding a regression test in the Playwright suite to catch any future cleaning-regression.
- **Bundle size impact.** JSZip is already a production dep (OfficeStrategy uses it). No new dep; bundle change should be < 5 KB minified (just the new strategy + warning UI). Confirm in PR description against `dist/web-standalone/index.html` size.
- **Performance on big archives.** A 500-MB archive with 1000 JPEGs sequentially strips through `processFileEntries` will block the main thread for a while. We don't have worker-thread offloading yet (#34). Not a blocker for v1; documented in the unsupported-format / size-cap track.
- **`pendingLeafDiffs` memory cost.** Holding source + stripped bytes for every supported entry in every ZIP in a batch ≈ 2× the supported-content size of the batch in RAM. Existing top-level `pendingDiffs` has the same shape (peak ≈ batch_size); this just multiplies by inner-file count. Mitigation: drain the leaf stash on first `buildArchiveLeafDiff` call (same as top-level). For users who never expand a leaf, the stash lives until the batch unmounts. Document this in `PRIVACY_GAPS.md` alongside the existing batch-size note.
- **Lazy-load UX surprise.** First click on a cleaned leaf in a session pays the WASM warm-up (~100300ms warm, 35s cold from `docs/poc/webperl-exiftool.md`). The existing `dispatchExifToolDiffLoading` toast handles this for top-level diffs; reuse the same mechanism so the user gets the same passive cue when expanding a leaf for the first time.
- **`MetadataDiffTable` extraction touches a recently-shipped surface.** The two-pane diff (#177 / chunk B.1) is from 2026-05-21. Risk: extraction breaks a still-stabilising component. Mitigation: keep `MetadataDiffExpansion`'s public API identical (same props, same render output at the top-level FileRow), only move the inner `TwoPaneView` to a new file. Add a snapshot/render regression test on `MetadataDiffExpansion` to catch any visual delta.
- **Deeply nested archives render-loop risk.** `ZipExpansion` recursively rendering itself for nested `.zip` entries could pathologically stack if a malicious archive declares itself as containing itself (zip quine). The recursion depth is bounded by JS stack, but render churn could hang the tab. Mitigation: cap the indent at depth 5 visually and add a render-time depth counter that surfaces a `[depth limit reached — drop the inner file directly]` row at depth 20. Strategy-side recursion is unchanged.
- **Mobile / APK touch UX on the tree.** Inline chevrons at multiple indent levels can be hard to hit. The existing `FileRow` chevron is already there; new `zip-expansion__row__chevron` follows the same hit-target size (44×44 minimum per the existing touch UX track #49). Verify on the APK target before merging.
## 11. Out of scope (deferred or declined)
- Streaming / chunked ZIP processing for archives larger than RAM (would intersect with #34).
- A structured warning type with i18n codes + params (refactor; not blocking).
- Office retrofit to the new `archiveEntries` tree model (Office files keep their flat source-labelled view inside `MetadataDiffExpansion`; the tree is for ZIP's open-ended structure).
- Virtualized rendering of inner-entry rows (pagination handles the 99% case).
- ZIP-bomb decompression cap (explicitly declined by maintainer).
- Decryption of encrypted entries (out of scope — no password prompts).
- Repairs to self-extracting EXE stub bytes (out of scope — documented gap).
- `.md` support (closed as out-of-scope; markdown has no embedded metadata).

View file

@ -20,6 +20,7 @@ const STANDALONE_HTML = resolve(
);
const SAMPLE_JPG = resolve(__dirname, "../tests/e2e/fixtures/sample.jpg");
const SAMPLE_DOCX = resolve(__dirname, "../tests/e2e/fixtures/sample.docx");
const SAMPLE_ZIP = resolve(__dirname, "../tests/e2e/fixtures/sample-zip.zip");
async function captureBreakpoint({
label,
@ -27,6 +28,7 @@ async function captureBreakpoint({
fixture,
emulate,
screenshotHeight,
expandZipLeaf,
}: {
label: string;
viewport: { width: number; height: number };
@ -35,6 +37,11 @@ async function captureBreakpoint({
? never
: typeof devices.iPhone14;
screenshotHeight?: number;
// When set, treat the dropped fixture as a ZIP archive and ALSO
// expand the named inner-leaf row after the top-level expansion,
// so the screenshot shows the per-leaf MetadataDiffTable inside
// the ZipExpansion tree (per the spec's lazy-on-expand UI).
expandZipLeaf?: string;
}): Promise<void> {
const browser = await chromium.launch();
const context = await browser.newContext({
@ -73,9 +80,31 @@ async function captureBreakpoint({
await row.click();
}
await page
.locator(".file-table__diff--two-pane")
.waitFor({ state: "visible", timeout: 10_000 });
if (expandZipLeaf !== undefined) {
// ZIP archives render a tree instead of the two-pane diff. Wait
// for the tree, then click the named inner row to lazy-load its
// per-leaf MetadataDiffTable.
await page
.locator(".zip-expansion")
.waitFor({ state: "visible", timeout: 10_000 });
const leafRow = page
.locator(".zip-expansion__row--cleaned", { hasText: expandZipLeaf })
.first();
await leafRow.waitFor({ state: "visible", timeout: 10_000 });
if (emulate !== undefined) {
await leafRow.tap();
} else {
await leafRow.click();
}
await page
.locator(".zip-expansion__leaf-diff")
.first()
.waitFor({ state: "visible", timeout: 30_000 });
} else {
await page
.locator(".file-table__diff--two-pane")
.waitFor({ state: "visible", timeout: 10_000 });
}
// Small pause for animations + toast to settle.
await page.waitForTimeout(3500);
@ -125,6 +154,15 @@ async function main(): Promise<void> {
emulate: devices["iPhone 14"],
});
// ZIP: ZipExpansion tree + per-leaf diff for the inner JPEG. Two
// captures — wider tree for the desktop view; narrower for mobile.
await captureBreakpoint({
label: "desktop-zip",
viewport: { width: 1280, height: 1100 },
fixture: SAMPLE_ZIP,
expandZipLeaf: "photo.jpg",
});
console.log("done.");
}

View file

@ -1,6 +1,7 @@
// Application layer barrel file — re-exports commands, queries, ports, and use cases.
export type {
LeafDiffResult,
MetadataProcessorPort,
ProcessOutcome,
} from "./ports/metadata_processor_port";

View file

@ -1,10 +1,29 @@
import type { Result } from "../../common";
import type { ExifError, MetadataDocument, StripOptions } from "../../domain";
import type {
ArchiveEntryResult,
ExifError,
MetadataDocument,
StripOptions,
} from "../../domain";
// Discriminated result of a leaf-diff build. The "failed" variant lets the
// UI render a "Diff failed" message instead of conflating an internal
// error with a successfully-empty diff (which should render "Already
// clean").
export type LeafDiffResult =
| { readonly kind: "ok"; readonly doc: MetadataDocument }
| { readonly kind: "failed" };
export interface ProcessOutcome {
readonly outputPath: string;
readonly outputBytes: number;
readonly diffDocument: MetadataDocument | null;
// Strategy-emitted non-fatal warnings (currently only ZipStrategy).
readonly warnings?: readonly string[];
// Recursive tree of inner archive entries (currently only ZipStrategy
// populates). The per-leaf diffs are built lazily on-demand via
// `buildArchiveLeafDiff`.
readonly archiveEntries?: readonly ArchiveEntryResult[];
}
export interface MetadataProcessorPort {
@ -26,4 +45,23 @@ export interface MetadataProcessorPort {
buildDiffDocumentForEntry(args: {
entryId: string;
}): Promise<MetadataDocument | null>;
// Out-of-band ExifTool read for a single leaf inside a ZIP archive.
// Source + stripped bytes per leaf are stashed during `process()` when
// the strategy returns `archiveEntries`. UI calls this lazily on first
// expand of a leaf row in `ZipExpansion`. Caches the result so re-opens
// (including after the parent ZIP collapses and unmounts ZipExpansion)
// return the same answer instead of falling off the drained stash.
// Returns `{kind: "failed"}` for "couldn't build diff" (read error or
// cache miss); the UI renders "Diff failed" — distinct from the
// successful-but-empty case rendered as "Already clean".
buildArchiveLeafDiff(args: {
entryId: string;
path: string;
}): Promise<LeafDiffResult>;
// Evicts cached leaf diffs (and pending bytes) for a given entryId so
// parsed metadata doesn't linger after the user removes the file from
// the app state. See privacy invariant §3.
clearLeafCacheForEntry(args: { entryId: string }): void;
}

View file

@ -0,0 +1,53 @@
// Pure value types describing one entry inside an archive container
// (currently only produced by ZipStrategy). The recursive tree is
// rendered by ZipExpansion.tsx; per-leaf diffs are built lazily by
// WasmProcessor.buildArchiveLeafDiff using the stashed source +
// stripped bytes.
//
// Types live in the domain layer so the application port can reference
// them without importing from infrastructure.
import type { MetadataEntry } from "../exif/metadata_document";
// Per spec §3, v1 refuses encrypted archives outright; no entry ever
// reaches the per-entry walk with encrypted bytes. The
// "passed-through-encrypted" status that earlier drafts of the spec
// referenced is therefore not part of this union — adding it back is
// a follow-up that should accompany a byte-level walker capable of
// processing the archive without going through JSZip.
// "cleaned" = strategy ran and the output bytes differ from the input bytes
// in any way (length OR content). Covers explicit metadata
// removal (shrinks) AND in-place normalisation (ZIP CDH
// timestamps, PDF re-encode, MP4 re-mux — same length,
// different content).
// "already-clean" = strategy ran and the output bytes are byte-identical to
// the input bytes. The file had nothing to remove and the
// strategy emitted its input verbatim (typical for JPEG/
// PNG with no removable metadata). Still expandable so the
// user can see the ExifTool diff confirming the file is
// clean. Mirrors the top-level Complete vs NoMetadataFound
// pattern.
export type ArchiveEntryStatus =
| "cleaned"
| "already-clean"
| "passed-through-unsupported"
| "directory";
export interface ArchiveEntryResult {
readonly path: string;
readonly status: ArchiveEntryStatus;
// For "cleaned" leaves: pre-strip and post-strip bytes for the
// deferred per-leaf ExifTool diff. Consumed by
// WasmProcessor.buildArchiveLeafDiff. Null for non-cleaned statuses
// (no diff to build).
readonly sourceBytes: Uint8Array | null;
readonly strippedBytes: Uint8Array | null;
// Walker entries from the inner strategy (Office docProps, PDF
// annotations, etc.). Merged into the leaf's diff `before` set,
// same as the top-level pattern in
// WasmProcessor.buildDiffDocumentForEntry.
readonly walkerEntries: readonly MetadataEntry[];
// RECURSIVE — non-null when this entry is itself a ZIP; null otherwise.
readonly entries: readonly ArchiveEntryResult[] | null;
readonly warnings: readonly string[];
}

View file

@ -40,6 +40,8 @@ export const SUPPORTED_EXTENSIONS: ReadonlySet<string> = new Set([
".xlsx",
".pptx",
".odt",
// Archives
".zip",
]);
interface IsSupportedFileParams {

View file

@ -27,6 +27,10 @@ export { middleTruncatePath } from "./path_truncation";
export type { ExifError } from "./exif/exif_errors";
export { formatExifError } from "./exif/exif_errors";
export type { MetadataEntry, MetadataDocument } from "./exif/metadata_document";
export type {
ArchiveEntryResult,
ArchiveEntryStatus,
} from "./files/archive_entry";
export type { SettingsError } from "./settings_errors";
export { formatSettingsError } from "./settings_errors";
export type { FolderError } from "./files/folder_errors";

View file

@ -1,5 +1,6 @@
import type { Result } from "../../common";
import type {
ArchiveEntryResult,
ExifError,
MetadataDocument,
MetadataEntry,
@ -7,6 +8,7 @@ import type {
} from "../../domain";
export type { StripOptions };
export type { ArchiveEntryResult, ArchiveEntryStatus } from "../../domain";
export interface StripResult {
readonly bytes: Uint8Array;
@ -24,6 +26,21 @@ export interface StripResult {
// WasmProcessor (not the strategy) after a successful strip, via
// ExifToolDiffStrategy. Strategies themselves return null here.
readonly diffDocument: MetadataDocument | null;
// Non-fatal per-file warnings emitted by the strategy. Surfaced as
// an inline disclosure on the FileRow. Reserved for future use:
// the originally-planned encrypted-entry passthrough messages
// would have populated this, but v1 refuses encrypted archives
// outright (spec §3) so no current strategy emits warnings. Kept
// optional + threaded through so adding the byte-level walker
// follow-up doesn't require touching every reducer + the WebApi
// surface again.
readonly warnings?: readonly string[];
// Recursive tree of inner entries for archive formats. Currently only
// ZipStrategy populates this. UI: src/web/components/file-list/ZipExpansion.tsx
// renders the tree; per-leaf diffs are lazy-loaded via
// WasmProcessor.buildArchiveLeafDiff. See
// docs/superpowers/specs/2026-05-22-issue-184-zip-support-design.md.
readonly archiveEntries?: readonly ArchiveEntryResult[];
}
export interface FormatStrategy {

View file

@ -0,0 +1,605 @@
import JSZip from "jszip";
import type { Result } from "../../../common";
import type { ExifError, MetadataEntry } from "../../../domain";
import type {
ArchiveEntryResult,
ArchiveEntryStatus,
FormatStrategy,
StripOptions,
StripResult,
} from "../format_strategy";
// The inner-entry router is injected by strategy_registry.ts after both
// modules have evaluated, via `setZipStrategyRouter(selectStrategy)`.
// Avoids the static circular import (registry → ZipStrategy + ZipStrategy
// → registry) that would otherwise observe an uninitialized export
// binding. The slot is set during module-init side effects of the
// registry, so any consumer that imports strategy_registry (the
// production path; test files do this too) has it ready by call time.
type InnerRouter = (args: {
filename: string;
bytes: Uint8Array;
}) => FormatStrategy | null;
let injectedRouter: InnerRouter | null = null;
// Module-level flag so the dev/test "router not injected" warning fires
// at most once per session even if many strip() calls happen — avoids
// drowning test output when a fixture battery runs through a directly-
// constructed ZipStrategy. The first call surfaces the issue; subsequent
// calls stay silent.
let routerMissingWarned = false;
export function setZipStrategyRouter(router: InnerRouter): void {
injectedRouter = router;
}
// Privacy-invariant §6 canonical output values, applied uniformly to
// every entry regardless of what the source archive carried:
//
// ZIP_EPOCH — DOS-time minimum for all timestamps.
// Constructed in UTC at noon so JSZip's getUTC*
// accessors land on 1980-01-01 in every timezone.
// Local-time construction (e.g. new Date(1980, 0, 1))
// wraps under negative UTC offsets where the UTC year
// is 1979 → DOS year underflow → year 2108 on
// read-back (confirmed via JSZip round-trip POC,
// May 2026).
//
// unixPermissions — Canonical 0o644 (files) / 0o755 (dirs). Unix mode
// bits in the ZIP central directory leak the
// producer's umask; we normalise them like timestamps.
//
// dosPermissions — 0 (normal file). DOS attribute flags can carry
// archive/hidden/system bits that also identify the
// producing tool/OS; zeroed in the output.
//
// compression — DEFLATE via generateAsync. Pinned explicitly so the
// output codec is a stated choice rather than an
// accident of the JSZip default (STORE), which would
// expand any entry that was DEFLATE-compressed in the
// source. Uniform codec also avoids fingerprinting by
// per-entry compression-method variance.
const ZIP_EPOCH = new Date(Date.UTC(1980, 0, 1, 12, 0, 0));
const UNIX_PERMS_FILE = 0o644;
const UNIX_PERMS_DIR = 0o755;
const DOS_PERMS_NORMAL = 0;
// Hard cap on ZIP nesting depth. A chain of ZIPs-in-ZIPs recurses
// without bound — each level fully decompresses + re-zips into memory,
// and classic ZIP-bomb ratios make this a real DoS surface. 10 levels
// covers every legitimate real-world nesting pattern we've seen.
const MAX_NESTING_DEPTH = 10;
// Hard cap on total decompressed bytes across ALL recursion levels in a
// single top-level strip. The counter is threaded through every nested
// stripAtDepth call via the shared `byteBudget` object so an adversarial
// 10-level-nested archive can't allocate 10 × cap by resetting the
// counter at each level (a real bug in the previous local-counter
// implementation).
const MAX_DECOMPRESSED_BYTES = 2 * 1024 * 1024 * 1024;
export class ZipStrategy implements FormatStrategy {
readonly extensions: ReadonlySet<string> = new Set([".zip"]);
verifyMagicBytes({ bytes }: { bytes: Uint8Array }): boolean {
if (bytes.length < 4) return false;
if (bytes[0] !== 0x50 || bytes[1] !== 0x4b) return false;
// PK\x03\x04 = local file header (any archive with entries);
// PK\x05\x06 = end of central directory (empty archive only).
return (
(bytes[2] === 0x03 && bytes[3] === 0x04) ||
(bytes[2] === 0x05 && bytes[3] === 0x06)
);
}
async strip({
bytes,
options,
}: {
bytes: Uint8Array;
options: StripOptions;
}): Promise<Result<StripResult, ExifError>> {
// Fresh budget per top-level strip; threaded through recursive calls
// so cumulative decompression across nested levels is bounded.
return this.stripAtDepth({
bytes,
options,
depth: 0,
byteBudget: { used: 0 },
});
}
private async stripAtDepth({
bytes,
options,
depth,
byteBudget,
}: {
bytes: Uint8Array;
options: StripOptions;
depth: number;
// Shared mutable counter across the entire recursive call graph for
// one top-level strip. See MAX_DECOMPRESSED_BYTES comment.
byteBudget: { used: number };
}): Promise<Result<StripResult, ExifError>> {
if (!this.verifyMagicBytes({ bytes })) {
return {
ok: false,
error: {
code: "invalid-file-format",
detail: "Not a ZIP archive (magic bytes don't match)",
},
};
}
if (depth >= MAX_NESTING_DEPTH) {
return {
ok: false,
error: {
code: "invalid-file-format",
detail: `ZIP nesting limit (${MAX_NESTING_DEPTH}) exceeded — possible ZIP bomb or adversarially-nested archive`,
},
};
}
if (injectedRouter === null && !routerMissingWarned) {
// Dev/test invariant — production always imports strategy_registry.ts
// which calls setZipStrategyRouter(). Warn once per session so test
// mis-setup surfaces faster than "inner entries silently pass through
// uncleaned" without spamming N copies of the message.
routerMissingWarned = true;
console.warn(
"[ZipStrategy] inner router not injected — inner entries will not be " +
"cleaned. Import strategy_registry.ts or call setZipStrategyRouter() " +
"before use.",
);
}
const selectStrategy: InnerRouter = injectedRouter ?? (() => null);
// Detect encrypted entries by scanning the central directory ourselves
// (bit 0 = ZipCrypto, bit 6 = strong/AES, method 99 = WinZip AES).
// JSZip's loadAsync throws on bit 0 only; bit 6 and method 99 entries
// would silently pass through and emit garbled output. CDH scanning
// avoids the LFH-scanner blind spots (ZIP64 size-overflow markers,
// streaming-data-descriptor entries with size=0) because CDH records
// always carry the real values for these fields. ZIP64 archives with
// EOCD-level overflow markers are an explicit gap — we defer to
// JSZip in that branch (bit 0 still caught; bit 6 documented as a
// rare gap for ZIP64).
if (archiveHasEncryptedEntries(bytes)) {
return {
ok: false,
error: {
code: "invalid-file-format",
detail:
"Encrypted ZIP archives aren't supported — use a dedicated tool (7-Zip, ExifTool standalone) that can decrypt to clean inner content.",
},
};
}
let zip: JSZip;
try {
zip = await JSZip.loadAsync(bytes);
} catch (err: unknown) {
const msg = err instanceof Error ? err.message : String(err);
// Pinned to JSZip's exact throw string — fragile to a JSZip wording
// change, but bounded to the bit-0 fallback path that the CDH
// scanner above would have caught anyway. If JSZip's wording ever
// changes we'll surface parse-failed, not lose privacy.
if (msg === "Encrypted zip are not supported") {
return {
ok: false,
error: {
code: "invalid-file-format",
detail:
"Encrypted ZIP archives aren't supported — use a dedicated tool (7-Zip, ExifTool standalone) that can decrypt to clean inner content.",
},
};
}
return {
ok: false,
error: {
code: "parse-failed",
raw: msg,
},
};
}
const archiveEntries: ArchiveEntryResult[] = [];
const warnings: string[] = [];
// Collect first — we rely on JSZip's loadAsync inserting entries in
// central-directory order (a consequence of CD-parse order inside
// loadAsync; not an explicitly documented guarantee, but a stable
// implementation detail). The output's entry order mirrors the input's
// as a result.
const entries: Array<[string, JSZip.JSZipObject]> = [];
zip.forEach((path, entry) => entries.push([path, entry]));
const outputZip = new JSZip();
// Pre-emit all parent-folder entries with canonical ZIP_EPOCH BEFORE
// adding any file. Without this, JSZip's fileAdd auto-creates missing
// parent directories via folderAdd, which falls back to
// `o.date || new Date()` — leaking the current processing time into
// the output central directory in violation of privacy invariant §6.
// Info-ZIP and many archive tools omit explicit directory entries, so
// this path fires on common real-world inputs, not just adversarial
// ones.
const allPaths = entries.map(([p]) => p);
for (const folderPath of collectParentFolders(allPaths)) {
outputZip.file(folderPath, "", {
date: ZIP_EPOCH,
comment: "",
dir: true,
unixPermissions: UNIX_PERMS_DIR,
dosPermissions: DOS_PERMS_NORMAL,
});
}
for (const [path, entry] of entries) {
if (entry.dir) {
outputZip.file(path, "", {
date: ZIP_EPOCH,
comment: "",
dir: true,
unixPermissions: UNIX_PERMS_DIR,
dosPermissions: DOS_PERMS_NORMAL,
});
archiveEntries.push({
path,
status: "directory",
sourceBytes: null,
strippedBytes: null,
walkerEntries: [],
entries: null,
warnings: [],
});
continue;
}
// Pre-check declared uncompressed size from JSZip's parsed CDH
// BEFORE allocating the decompressed Uint8Array. Without this, a
// single small-compressed huge-decompressed entry (classic ZIP-bomb
// shape) would OOM the tab during entry.async() before the budget
// check below ever runs. The post-check below remains as defence
// in case the CDH lied about the size.
const reportedSize = getReportedUncompressedSize(entry);
if (
reportedSize !== null &&
byteBudget.used + reportedSize > MAX_DECOMPRESSED_BYTES
) {
return budgetExceededError();
}
const inputEntryBytes = await entry.async("uint8array");
byteBudget.used += inputEntryBytes.byteLength;
if (byteBudget.used > MAX_DECOMPRESSED_BYTES) {
return budgetExceededError();
}
const innerStrategy = selectStrategy({
filename: path,
bytes: inputEntryBytes,
});
let outputEntryBytes = inputEntryBytes;
let status: ArchiveEntryStatus = "passed-through-unsupported";
let innerWalkerEntries: readonly MetadataEntry[] = [];
let innerArchiveEntries: readonly ArchiveEntryResult[] | null = null;
const entryWarnings: string[] = [];
if (innerStrategy !== null) {
if (innerStrategy instanceof ZipStrategy) {
const inner = await innerStrategy.stripAtDepth({
bytes: inputEntryBytes,
options,
depth: depth + 1,
byteBudget,
});
if (inner.ok) {
outputEntryBytes = inner.value.bytes;
innerArchiveEntries = inner.value.archiveEntries ?? null;
// Nested-ZIP status: only "cleaned" if a deep entry actually
// had metadata removed. We deliberately DON'T use byte-shrink
// here because the inner ZIP gets re-encoded with our
// canonical perms/timestamps/compression, which can shrink
// bytes even when no inner-inner entry had any metadata
// removed (false-positive 'cleaned' on already-clean nested
// archives with non-canonical permissions).
status =
innerArchiveEntries !== null &&
hasAnyCleanedInTree(innerArchiveEntries)
? "cleaned"
: "already-clean";
for (const w of inner.value.warnings ?? []) {
const prefixed = `${path}: ${w}`;
warnings.push(prefixed);
entryWarnings.push(prefixed);
}
} else {
// Surface the refusal reason (encryption, depth limit, etc.)
// as a warning so the user knows this nested ZIP was left
// untouched rather than silently passing through. Asymmetric
// with the outer level where the caller sees the error
// directly; for inner ZIPs, warning-and-continue lets the
// rest of the outer archive still be processed.
const detail = innerErrorDetail(inner.error);
const w = `${path}: nested ZIP left untouched — ${detail}`;
warnings.push(w);
entryWarnings.push(w);
}
} else {
const inner = await innerStrategy.strip({
bytes: inputEntryBytes,
options,
});
if (inner.ok) {
outputEntryBytes = inner.value.bytes;
innerWalkerEntries = inner.value.walkerEntries;
// "cleaned" iff the strategy's output bytes differ from the
// input bytes in any way (length OR content), OR walker
// entries were emitted. Length-only comparison was too
// strict: Office, PDF, and ffmpeg-MP4 strategies routinely
// produce same-length output with different content (ZIP
// CDH timestamps normalized, PDF re-encoded, MP4 re-muxed
// with stripped mvhd boxes) — those changes ARE the
// cleaning and the diff confirms them, but the byte-length
// check rendered them as "Already clean" misleadingly.
// Byte-by-byte comparison correctly identifies these as
// "cleaned" while still showing the no-op pass-through
// case (JPEG/PNG with no removable metadata, where the
// strategy emits its input verbatim) as "already-clean".
status =
innerWalkerEntries.length > 0 ||
!bytesAreIdentical(outputEntryBytes, inputEntryBytes)
? "cleaned"
: "already-clean";
for (const w of inner.value.warnings ?? []) {
const prefixed = `${path}: ${w}`;
warnings.push(prefixed);
entryWarnings.push(prefixed);
}
}
// inner.ok === false → magic-byte mismatch on a misnamed file;
// left as passed-through-unsupported with no warning because a
// misnamed file is an input-data fact, not a privacy event.
}
}
outputZip.file(path, outputEntryBytes, {
date: ZIP_EPOCH,
comment: "",
unixPermissions: UNIX_PERMS_FILE,
dosPermissions: DOS_PERMS_NORMAL,
});
// Only stash bytes for actual LEAVES that the user can expand to
// see a diff. Nested-ZIP entries (innerArchiveEntries !== null)
// have their own leaves stashed during the recursive call, and the
// UI never builds a top-level diff for the nested ZIP itself —
// holding the full re-encoded bytes on the parent entry would be
// pure memory bloat that lives in AppContext for the FileEntry's
// lifetime.
const isExpandableLeaf =
(status === "cleaned" || status === "already-clean") &&
innerArchiveEntries === null;
archiveEntries.push({
path,
status,
sourceBytes: isExpandableLeaf ? inputEntryBytes : null,
strippedBytes: isExpandableLeaf ? outputEntryBytes : null,
walkerEntries: innerWalkerEntries,
entries: innerArchiveEntries,
warnings: entryWarnings,
});
}
let outputBytes: Uint8Array;
try {
outputBytes = await outputZip.generateAsync({
type: "uint8array",
comment: "",
// DEFLATE: see ZIP_EPOCH comment block above.
compression: "DEFLATE",
});
} catch (err: unknown) {
return {
ok: false,
error: {
code: "file-io-error",
detail: err instanceof Error ? err.message : String(err),
},
};
}
return {
ok: true,
value: {
bytes: outputBytes,
walkerEntries: [],
diffDocument: null,
warnings,
archiveEntries,
},
};
}
}
// Byte-by-byte equality check on two Uint8Arrays. Used to decide whether a
// strategy actually changed the entry's content (timestamps normalized, EXIF
// stripped, re-muxed) vs. truly passed through verbatim. Compares 4 bytes at
// a time via DataView for speed on larger entries.
function bytesAreIdentical(a: Uint8Array, b: Uint8Array): boolean {
if (a.byteLength !== b.byteLength) return false;
const len = a.byteLength;
const va = new DataView(a.buffer, a.byteOffset, len);
const vb = new DataView(b.buffer, b.byteOffset, len);
const chunks = len >>> 2;
for (let i = 0; i < chunks; i++) {
const offset = i << 2;
if (va.getUint32(offset) !== vb.getUint32(offset)) return false;
}
for (let i = chunks << 2; i < len; i++) {
if (a[i] !== b[i]) return false;
}
return true;
}
// Recursively walks an archiveEntries tree and returns true if any entry has
// status "cleaned". Used to promote a nested ZIP's own status from
// "already-clean" to "cleaned" when at least one deep entry actually had
// metadata removed — the outer ZIP shouldn't show "Already clean" just
// because its byte size didn't shrink overall.
function hasAnyCleanedInTree(entries: readonly ArchiveEntryResult[]): boolean {
for (const entry of entries) {
if (entry.status === "cleaned") return true;
if (entry.entries !== null && hasAnyCleanedInTree(entry.entries))
return true;
}
return false;
}
// Extracts a human-readable detail string from an ExifError for use in
// warning messages when an inner ZIP strip is refused.
function innerErrorDetail(error: ExifError): string {
if (
error.code === "invalid-file-format" ||
error.code === "file-io-error" ||
error.code === "exiftool-error"
) {
return error.detail;
}
if (error.code === "parse-failed") {
return error.raw;
}
return error.code;
}
// Walks every entry path and returns the set of unique parent-folder paths
// (with trailing slash). Used to pre-emit those folders into outputZip with
// canonical ZIP_EPOCH timestamps, so JSZip's fileAdd never auto-creates a
// parent with `new Date()` — privacy invariant §6 violation otherwise.
function collectParentFolders(
entryPaths: readonly string[],
): readonly string[] {
const parents = new Set<string>();
for (const path of entryPaths) {
const segments = path.split("/");
for (let i = 1; i < segments.length; i++) {
// Skip empty prefixes (paths starting with "/")
if (segments[i - 1] === "") continue;
const prefix = segments.slice(0, i).join("/") + "/";
parents.add(prefix);
}
}
return [...parents].sort();
}
// JSZipObject's parsed-CDH uncompressed size lives on the internal _data
// CompressedObject. Reading it via a typed accessor lets us pre-check
// against the byte budget BEFORE allocating the decompressed buffer.
// Returns null when the field isn't present (older JSZip versions,
// mocked instances) so the caller falls back to the post-allocation
// check.
function getReportedUncompressedSize(entry: JSZip.JSZipObject): number | null {
const data = (entry as unknown as { _data?: unknown })._data;
if (
data !== null &&
typeof data === "object" &&
data !== undefined &&
"uncompressedSize" in data
) {
const size = (data as { uncompressedSize: unknown }).uncompressedSize;
if (typeof size === "number" && Number.isFinite(size) && size >= 0) {
return size;
}
}
return null;
}
function budgetExceededError(): Result<StripResult, ExifError> {
return {
ok: false,
error: {
code: "file-io-error",
detail: `ZIP archive expands to more than ${MAX_DECOMPRESSED_BYTES / 1024 ** 3} GB when decompressed`,
},
};
}
// Scans the central directory for entries flagged as encrypted (general-
// purpose-flag bit 0 = ZipCrypto, bit 6 = strong/AES) or using compression
// method 99 (WinZip AE-1/AE-2 AES). Walks CDH records, NOT LFH records —
// CDH carries the real values for these fields even when the entry uses
// streaming data descriptors or ZIP64 size overflow at the LFH level,
// avoiding the two blind spots that made the earlier LFH-based scanner
// unreliable.
//
// ZIP64 caveat: if the standard EOCD record uses overflow markers
// (0xFFFFFFFF / 0xFFFF) for CD location, we fall back to false and rely
// on JSZip's loadAsync to throw on bit 0. Documented gap: bit-6 / method-99
// encryption inside ZIP64 archives is not detected here.
function archiveHasEncryptedEntries(bytes: Uint8Array): boolean {
if (bytes.length < 22) return false;
// Find EOCD signature 0x06054b50 (PK\x05\x06) by scanning backwards.
// ZIP archive comment max length is 65535, so EOCD is within the last
// 65557 bytes (22-byte EOCD + comment).
let eocdOffset = -1;
const minStart = Math.max(0, bytes.length - 65557);
for (let i = bytes.length - 22; i >= minStart; i--) {
if (
bytes[i] === 0x50 &&
bytes[i + 1] === 0x4b &&
bytes[i + 2] === 0x05 &&
bytes[i + 3] === 0x06
) {
eocdOffset = i;
break;
}
}
if (eocdOffset < 0) return false;
const dv = new DataView(bytes.buffer, bytes.byteOffset, bytes.byteLength);
const totalEntries = dv.getUint16(eocdOffset + 10, true);
const cdSize = dv.getUint32(eocdOffset + 12, true);
const cdOffset = dv.getUint32(eocdOffset + 16, true);
// ZIP64 overflow markers — fall back to JSZip's bit-0 check.
if (
cdSize === 0xffffffff ||
cdOffset === 0xffffffff ||
totalEntries === 0xffff
) {
return false;
}
let pos = cdOffset;
let count = 0;
while (count < totalEntries && pos + 46 <= bytes.length) {
// CDH signature 0x02014b50 (PK\x01\x02)
if (
bytes[pos] !== 0x50 ||
bytes[pos + 1] !== 0x4b ||
bytes[pos + 2] !== 0x01 ||
bytes[pos + 3] !== 0x02
) {
return false;
}
const gpFlag = dv.getUint16(pos + 8, true);
const method = dv.getUint16(pos + 10, true);
if ((gpFlag & 0x0001) !== 0) return true; // ZipCrypto
if ((gpFlag & 0x0040) !== 0) return true; // strong encryption
if (method === 99) return true; // WinZip AE-1/AE-2 AES
const nameLen = dv.getUint16(pos + 28, true);
const extraLen = dv.getUint16(pos + 30, true);
const commentLen = dv.getUint16(pos + 32, true);
pos += 46 + nameLen + extraLen + commentLen;
count++;
}
return false;
}

View file

@ -3,6 +3,7 @@ import { VideoStrategy } from "./strategies/video_strategy";
import { JpegStrategy } from "./strategies/jpeg_strategy";
import { PngStrategy } from "./strategies/png_strategy";
import { PdfStrategy } from "./strategies/pdf_strategy";
import { ZipStrategy, setZipStrategyRouter } from "./strategies/zip_strategy";
import { ExifToolFallbackStrategy } from "./strategies/exiftool_fallback_strategy";
import { FfmpegFallbackStrategy } from "./strategies/ffmpeg_fallback_strategy";
import type { FormatStrategy } from "./format_strategy";
@ -33,6 +34,7 @@ const STRATEGIES: readonly FormatStrategy[] = [
new JpegStrategy(),
new PngStrategy(),
new PdfStrategy(),
new ZipStrategy(),
// Registered last so selectStrategy() only routes to it for extensions no
// hand-rolled walker claims. Claims nothing in chunk A; format coverage
// expands in follow-up PRs against #174.
@ -72,3 +74,8 @@ export function allHandledExtensions(): ReadonlySet<string> {
}
return all;
}
// Inject the router into ZipStrategy so it can dispatch inner entries
// through the same registry. See the note at the top of
// strategies/zip_strategy.ts for why this side effect exists.
setZipStrategyRouter(selectStrategy);

View file

@ -1,11 +1,13 @@
import type {
FileBytesPort,
LeafDiffResult,
MetadataProcessorPort,
ProcessOutcome,
} from "../../application";
import type { Result } from "../../common";
import { dispatchExifToolDiffLoading } from "../../common";
import type {
ArchiveEntryResult,
ExifError,
MetadataDocument,
MetadataEntry,
@ -51,10 +53,55 @@ interface PendingDiffInputs {
readonly walkerEntries: readonly MetadataEntry[];
}
// Pending per-leaf diff inputs for ZIP archives. Same shape as
// PendingDiffInputs but keyed by `${entryId}\0${fullPath}` instead of
// entryId alone — one entry per cleaned leaf in the archive tree.
// Populated by stashArchiveLeaves() recursively after a ZipStrategy
// strip resolves; drained by `buildArchiveLeafDiff` on first expand
// of that leaf in ZipExpansion.
//
// `fullPath` is the leaf's path with each outer archive's path
// prepended (separated by \0), so leaves inside nested ZIPs with
// identical local names ("a.zip/photo.jpg" + "b.zip/photo.jpg")
// don't collide. The NUL separator matches the pattern in
// MetadataDiffTable.makeKey() — NUL is forbidden in every metadata
// grammar we route, so the composite key can't collide with a
// legitimate path that contains the separator.
//
// Unbounded for the batch lifetime — see same memory-bound note on
// `pendingDiffs` above. The leak ceiling is "user's batch size *
// number of cleaned leaves per archive"; documented as a risk in
// docs/superpowers/specs/2026-05-22-issue-184-zip-support-design.md §10.
type PendingLeafDiff = PendingDiffInputs;
function leafKey({
entryId,
fullPath,
}: {
entryId: string;
fullPath: string;
}): string {
return `${entryId}\0${fullPath}`;
}
export class WasmProcessor implements MetadataProcessorPort {
private readonly fileBytes: FileBytesPort;
private readonly diffStrategy: ExifToolDiffStrategy;
private readonly pendingDiffs = new Map<string, PendingDiffInputs>();
// Per-leaf diff inputs for ZIP archives, keyed by `${entryId}:${path}`.
// Populated recursively by stashArchiveLeaves(); drained on first
// expand of each leaf in ZipExpansion via buildArchiveLeafDiff.
private readonly pendingLeafDiffs = new Map<string, PendingLeafDiff>();
// Cache of in-flight or resolved leaf diff Promises, keyed identically to
// pendingLeafDiffs. The value is a Promise wrapping a discriminated
// `LeafDiffResult` so we can distinguish "diff returned nothing" (clean
// file) from "diff failed" (ExifTool threw / parse error) — the prior
// design cached a bare null for both, rendering a failure as "No
// metadata" instead of "Diff failed". Caching the Promise (not the
// resolved value) also makes concurrent buildArchiveLeafDiff calls for
// the same key join the same await, eliminating the
// drain-pending-before-cache-write race window.
private readonly cachedLeafDocs = new Map<string, Promise<LeafDiffResult>>();
// Instance-level "have we fired the loading event yet?" flag. The
// ExifToolDiffStrategy caches its ZeroPerl instance across calls, so
@ -135,23 +182,43 @@ export class WasmProcessor implements MetadataProcessorPort {
// the diff hasn't landed yet — the reducer flips a `diffPending`
// flag to render the skeleton in MetadataDiffExpansion until the
// async build dispatches `UPDATE_FILE_DIFF`.
//
// Archive containers (currently only ZipStrategy) take a different
// path: the FileRow renders <ZipExpansion> instead of
// <MetadataDiffExpansion>, so building a top-level diff for the
// archive bytes themselves would be wasted ExifTool work. Skip
// the per-entry stash and only walk archiveEntries for per-leaf
// stashes.
if (ENABLE_EXIFTOOL_DIFF) {
this.pendingDiffs.set(entryId, {
sourceBytes,
strippedBytes: stripResult.value.bytes,
extension: extname(filename),
walkerEntries: stripResult.value.walkerEntries,
});
if (stripResult.value.archiveEntries !== undefined) {
this.stashArchiveLeaves({
entryId,
entries: stripResult.value.archiveEntries,
});
} else {
this.pendingDiffs.set(entryId, {
sourceBytes,
strippedBytes: stripResult.value.bytes,
extension: extname(filename),
walkerEntries: stripResult.value.walkerEntries,
});
}
}
return {
ok: true,
value: {
outputPath,
outputBytes: stripResult.value.bytes.byteLength,
diffDocument: null,
},
// exactOptionalPropertyTypes: omit optional fields entirely when
// the source field is undefined rather than assigning undefined.
const outcome: ProcessOutcome = {
outputPath,
outputBytes: stripResult.value.bytes.byteLength,
diffDocument: null,
...(stripResult.value.warnings !== undefined && {
warnings: stripResult.value.warnings,
}),
...(stripResult.value.archiveEntries !== undefined && {
archiveEntries: stripResult.value.archiveEntries,
}),
};
return { ok: true, value: outcome };
}
// Out-of-band ExifTool read on source + stripped bytes for the given
@ -208,6 +275,10 @@ export class WasmProcessor implements MetadataProcessorPort {
// `T()` boot blocked one of the two reads long enough for the other
// to finish, but once perl was warm the race fired on every
// subsequent file. Serial avoids the race entirely.
// Top-level (non-archive) diff path: returns null on failure so the
// existing top-level UI's "no diff" rendering stays unchanged. Inner
// archive leaves use runLeafDiff below, which threads ok/failed
// through to the cache so the UI can distinguish those two states.
private async runDiff(
pending: PendingDiffInputs,
): Promise<MetadataDocument | null> {
@ -233,6 +304,158 @@ export class WasmProcessor implements MetadataProcessorPort {
return null;
}
}
// Per-leaf diff with explicit ok/failed discrimination. Caller (the
// cache) wraps this in a Promise; the Promise's resolved value tells
// the UI which empty-state copy to render.
private async runLeafDiff(
pending: PendingDiffInputs,
): Promise<LeafDiffResult> {
try {
const beforeResult = await this.diffStrategy.readDocument({
bytes: pending.sourceBytes,
extension: pending.extension,
});
const afterResult = await this.diffStrategy.readDocument({
bytes: pending.strippedBytes,
extension: pending.extension,
});
if (!beforeResult.ok || !afterResult.ok) {
return { kind: "failed" };
}
const before = [...pending.walkerEntries, ...beforeResult.value];
return { kind: "ok", doc: { before, after: afterResult.value } };
} catch {
return { kind: "failed" };
}
}
// Lazy per-leaf diff build for ZIP archives. Same shape +
// drain-on-call semantics as buildDiffDocumentForEntry, but keyed
// by NUL-separated `entryId\0fullPath` because one ZIP yields many
// leaves and nested-ZIP siblings can share local entry names
// (e.g. `a.zip/photo.jpg` + `b.zip/photo.jpg`). Called by
// ZipExpansion on first expand of a cleaned-leaf row, with
// `fullPath` composed from the outer-archive path prefix.
//
// Queues onto the same `diffChain` used by buildDiffDocumentForEntry
// — `@uswriting/exiftool`'s parseMetadata uses module-level
// singletons so any two readDocument calls (across entries OR
// leaves) racing on the shared Perl/StringBuilder state corrupt the
// readback. Per-leaf calls join the same chain so they serialize
// against top-level diffs and against each other.
async buildArchiveLeafDiff({
entryId,
path,
}: {
entryId: string;
path: string;
}): Promise<LeafDiffResult> {
if (!ENABLE_EXIFTOOL_DIFF) {
return { kind: "failed" };
}
const key = leafKey({ entryId, fullPath: path });
// Cache the in-flight Promise (not just the resolved value) so:
// (1) Re-opens after the bytes stash is drained return the same
// result — fixes the collapse-parent-ZIP + reopen-and-re-expand
// case where the component-level cachedDoc is lost on unmount.
// (2) Concurrent buildArchiveLeafDiff calls for the same key during
// a rapid close-while-loading + reopen cycle JOIN the same
// await instead of one draining pendingLeafDiffs and the other
// seeing an empty cache and returning a null that overrides
// the legitimate result.
const cached = this.cachedLeafDocs.get(key);
if (cached !== undefined) return cached;
const pending = this.pendingLeafDiffs.get(key);
if (pending === undefined) return { kind: "failed" };
this.pendingLeafDiffs.delete(key);
if (!this.diffWarmupSignalled) {
this.diffWarmupSignalled = true;
dispatchExifToolDiffLoading();
}
// Queue onto the singleton diff chain for serialisation, then cache
// the resulting Promise immediately so any concurrent caller sees it
// and awaits the same instance.
const computePromise: Promise<LeafDiffResult> = this.diffChain.then(() =>
this.runLeafDiff(pending),
);
this.diffChain = computePromise.catch(() => null);
this.cachedLeafDocs.set(key, computePromise);
return computePromise;
}
// Evicts all cached leaf state for a given entryId. Called by the web
// API when a FileEntry is removed from the app state, so the parsed
// metadata (potentially GPS, names, camera serials) doesn't linger on
// the processor singleton for the rest of the tab session — per
// privacy invariant §3 ("what we cannot claim to clean").
clearLeafCacheForEntry({ entryId }: { entryId: string }): void {
const prefix = `${entryId}\0`;
for (const k of this.pendingLeafDiffs.keys()) {
if (k.startsWith(prefix)) this.pendingLeafDiffs.delete(k);
}
for (const k of this.cachedLeafDocs.keys()) {
if (k.startsWith(prefix)) this.cachedLeafDocs.delete(k);
}
this.pendingDiffs.delete(entryId);
}
// Walk the archiveEntries tree and stash source + stripped bytes
// for each cleaned leaf. Only ACTUAL LEAVES (entry.entries === null)
// get stashed — nested-ZIP parent entries also carry sourceBytes
// for their re-emitted archive bytes, but the UI's `isNestedZip`
// branch never calls buildArchiveLeafDiff on them, so stashing
// would leak the bytes forever. The nested ZIP's actual leaves
// reach this function via the recursive call below.
//
// `outerPath` is the parent archive's full path (with trailing \0
// separator) used to compose the leaf key. Empty at the top level;
// "a.zip\0" inside the first nested-ZIP, "a.zip\0b.zip\0" two
// levels deep, etc. Without the prefix, sibling nested zips with
// identical local entry names would collide on the same key.
private stashArchiveLeaves({
entryId,
entries,
outerPath = "",
}: {
entryId: string;
entries: readonly ArchiveEntryResult[];
outerPath?: string;
}): void {
for (const entry of entries) {
const fullPath = outerPath + entry.path;
const isLeaf = entry.entries === null;
// Stash both "cleaned" and "already-clean" leaves — the user can
// still expand an "Already clean" row to see ExifTool's diff
// confirming what's in the file (or that nothing changed).
const isExpandable =
entry.status === "cleaned" || entry.status === "already-clean";
if (
isLeaf &&
isExpandable &&
entry.sourceBytes !== null &&
entry.strippedBytes !== null
) {
this.pendingLeafDiffs.set(leafKey({ entryId, fullPath }), {
sourceBytes: entry.sourceBytes,
strippedBytes: entry.strippedBytes,
extension: extname(entry.path),
walkerEntries: entry.walkerEntries,
});
}
if (entry.entries !== null) {
this.stashArchiveLeaves({
entryId,
entries: entry.entries,
outerPath: `${fullPath}\0`,
});
}
}
}
}
function extname(filename: string): string {

View file

@ -1,8 +1,10 @@
import type {
ArchiveEntryResult,
MetadataDocument,
Settings,
I18nStringsDictionary,
} from "../../domain";
import type { LeafDiffResult } from "../../application";
import {
DEFAULT_SETTINGS,
validateSettings,
@ -87,6 +89,8 @@ export interface WasmApi {
outputPath: string | null;
outputBytes: number | null;
diffDocument: MetadataDocument | null;
warnings: readonly string[];
archiveEntries: readonly ArchiveEntryResult[] | null;
error: string | null;
}>;
@ -97,6 +101,17 @@ export interface WasmApi {
// the diff couldn't be built — caller falls back to the legacy delta
// view (already MetadataDiffExpansion's default).
buildDiffDocument(entryId: string): Promise<MetadataDocument | null>;
// Lazy per-leaf diff build for ZIP archives. Called by ZipExpansion
// on first expand of a cleaned-leaf row. Returns a discriminated
// LeafDiffResult so the UI can distinguish empty-but-successful
// (render "Already clean") from failed (render "Diff failed").
buildArchiveLeafDiff(entryId: string, path: string): Promise<LeafDiffResult>;
// Evicts cached leaf diffs + pending bytes for an entryId. Called from
// AppContext when a FileEntry is removed so parsed metadata doesn't
// linger on the processor singleton.
clearLeafCacheForEntry(entryId: string): void;
}
export interface WebApi {
@ -287,6 +302,8 @@ export function makeWebApi(): WebApi {
outputPath: null,
outputBytes: null,
diffDocument: null,
warnings: [],
archiveEntries: null,
error: formatExifError(result.error),
};
}
@ -295,11 +312,17 @@ export function makeWebApi(): WebApi {
outputPath: result.value.outputPath,
outputBytes: result.value.outputBytes,
diffDocument: result.value.diffDocument,
warnings: result.value.warnings ?? [],
archiveEntries: result.value.archiveEntries ?? null,
error: null,
};
},
buildDiffDocument: async (entryId) =>
processor.buildDiffDocumentForEntry({ entryId }),
buildArchiveLeafDiff: async (entryId, path) =>
processor.buildArchiveLeafDiff({ entryId, path }),
clearLeafCacheForEntry: (entryId) =>
processor.clearLeafCacheForEntry({ entryId }),
},
folder: {

View file

@ -35,6 +35,17 @@ function AppContent(): React.JSX.Element {
const hasFiles = state.files.length > 0;
// "Clean more" clears the file list. Also evict each FileEntry's cached
// leaf diffs from the WasmProcessor so parsed metadata (GPS, names,
// camera serials etc.) doesn't linger on the processor singleton for
// the rest of the tab session. Privacy invariant §3.
const handleClearFiles = useCallback((): void => {
for (const file of state.files) {
window.api.wasm.clearLeafCacheForEntry(file.id);
}
dispatch({ type: "CLEAR_FILES" });
}, [state.files, dispatch]);
const { cleanedCount, errorCount, totalCount, totalTagsRemoved, allDone } =
useFileStats(state.files);
@ -75,9 +86,7 @@ function AppContent(): React.JSX.Element {
totalTagsRemoved={hasFiles ? totalTagsRemoved : undefined}
elapsedSeconds={hasFiles ? elapsedSeconds : undefined}
errorCount={hasFiles ? errorCount : undefined}
onCleanMore={
hasFiles ? () => dispatch({ type: "CLEAR_FILES" }) : undefined
}
onCleanMore={hasFiles ? handleClearFiles : undefined}
/>
<OfflineIndicator />
<SettingsDrawer isOpen={isSettingsOpen} onClose={handleClose} />

View file

@ -1,8 +1,14 @@
// Single file row with 6 columns: STATUS, NAME, TYPE, BEFORE, AFTER, RESULT.
// BEFORE = file.size, AFTER = file.afterBytes (post-strip size, null until
// the strategy returns), RESULT renders the textual pill ('Cleaned' /
// 'Already clean') via ResultPill. Supports row expansion for error
// details and the "no metadata found" notice.
// 'Already clean') via ResultPill. Supports four expansion modes:
// - Error details (ErrorExpansion)
// - "No metadata found" notice
// - Archive entry tree (ZipExpansion) — when archiveEntries is populated
// - Metadata diff (MetadataDiffExpansion) — otherwise, when a diff
// exists or is pending
// ZipExpansion and MetadataDiffExpansion are mutually exclusive — see the
// hasArchiveEntries gate below.
import { useRef } from "react";
import type { FileEntry } from "../../contexts/AppContext";
@ -14,6 +20,7 @@ import { ChevronIcon } from "../icons/ChevronIcon";
import { ErrorExpansion } from "./ErrorExpansion";
import { MetadataDiffExpansion } from "./MetadataDiffExpansion";
import { ResultPill } from "./ResultPill";
import { ZipExpansion } from "./ZipExpansion";
import { formatFileSize } from "../../utils/format_file_size";
import { useI18n } from "../../hooks/use_i18n";
@ -44,11 +51,13 @@ export function FileRow({
const hasDiffDocument =
file.diffDocument !== null && file.diffDocument.before.length > 0;
const diffPending = file.diffPending === true;
const hasArchiveEntries =
file.archiveEntries !== undefined && file.archiveEntries.length > 0;
const isExpandable =
isError ||
file.status === FileProcessingStatus.NoMetadataFound ||
(file.status === FileProcessingStatus.Complete &&
(hasDiffDocument || diffPending));
(hasDiffDocument || diffPending || hasArchiveEntries));
const rowClasses = [
"file-table__row",
@ -171,15 +180,31 @@ export function FileRow({
)}
{isExpanded &&
isComplete &&
file.status === FileProcessingStatus.NoMetadataFound && (
file.status === FileProcessingStatus.NoMetadataFound &&
!hasArchiveEntries &&
(file.diffDocument !== null || diffPending ? (
// Diff available or in-flight: show the diff table / skeleton so
// the user can see what was intentionally preserved (orientation,
// color profile). Without this, "Already clean" files with
// preserve-flags on would show a blank expansion even though
// ExifTool found metadata that we chose to keep.
<MetadataDiffExpansion
diffDocument={file.diffDocument}
diffPending={diffPending}
/>
) : (
<div className="file-table__expansion">
<span className="file-table__expansion-empty">
{t("noMetadataFound")}
</span>
</div>
)}
))}
{isExpanded && isComplete && hasArchiveEntries && (
<ZipExpansion entryId={file.id} entries={file.archiveEntries!} />
)}
{isExpanded &&
file.status === FileProcessingStatus.Complete &&
!hasArchiveEntries &&
(hasDiffDocument || diffPending) && (
<MetadataDiffExpansion
diffDocument={file.diffDocument}

View file

@ -1,4 +1,8 @@
// Expandable per-file metadata diff.
// Expandable per-file metadata diff. Top-level wrapper that chooses
// between the two-pane table and the loading skeleton based on the
// diff's async state. Table + skeleton bodies live in MetadataDiffTable
// so ZipExpansion can reuse them per-leaf without inheriting the outer
// file-table expansion chrome.
//
// Two-pane render: ExifTool's full metadata dump from the source on the
// left, dump from the stripped file on the right. Rows are aligned across
@ -16,13 +20,10 @@
// Skeleton mode: while the async ExifTool read is in flight (diffPending)
// and no diffDocument is on the entry yet, render a wayfinding cue so the
// expansion area isn't blank when the user opens the row early.
//
// `t()` is the live i18n hook. The diff keys carry a `{count}` placeholder
// interpolated locally (mirrors the ErrorExpansion.tsx pattern), since the
// live `t` signature is `(key: string) => string` and does not interpolate.
import { useI18n } from "../../hooks/use_i18n";
import type { MetadataDocument, MetadataEntry } from "../../../domain";
import type { MetadataDocument } from "../../../domain";
import { MetadataDiffTable, DiffSkeleton } from "./MetadataDiffTable";
export function MetadataDiffExpansion({
diffDocument,
@ -38,8 +39,11 @@ export function MetadataDiffExpansion({
}): React.JSX.Element | null {
const { t } = useI18n();
if (diffDocument != null && diffDocument.before.length > 0) {
return <TwoPaneView document={diffDocument} t={t} />;
if (
diffDocument != null &&
(diffDocument.before.length > 0 || diffDocument.after.length > 0)
) {
return <MetadataDiffTable document={diffDocument} t={t} />;
}
// Diff still in flight — skeleton wayfinding cue.
@ -47,284 +51,18 @@ export function MetadataDiffExpansion({
return <DiffSkeleton t={t} />;
}
// diff resolved but both sides are empty — file was already clean.
if (diffDocument != null) {
return (
<div className="file-table__expansion">
<span className="file-table__expansion-empty">
{t("zipExpansion.alreadyClean")}
</span>
</div>
);
}
// Nothing to show. The `isExpandable` gate in FileRow normally prevents
// reaching this branch.
return null;
}
// =========================================================================
// Two-pane view
// =========================================================================
type DiffRowStatus = "removed" | "added" | "modified" | "kept";
interface DiffRow {
readonly status: DiffRowStatus;
readonly source: string;
readonly name: string;
readonly before: string | null;
readonly after: string | null;
}
function TwoPaneView({
document,
t,
}: {
document: MetadataDocument;
t: (key: string) => string;
}): React.JSX.Element {
const rows = computeDiffRows(document);
const grouped = groupRowsBySource(rows);
return (
<div className="file-table__expansion file-table__diff file-table__diff--two-pane">
<div className="file-table__diff-pane-header">
<span className="file-table__diff-pane-label file-table__diff-pane-label--before">
{t("diffPaneBefore")}
</span>
<span className="file-table__diff-pane-label file-table__diff-pane-label--after">
{t("diffPaneAfter")}
</span>
</div>
{grouped.map(({ source, rows: groupRows }) => (
<section key={source} className="file-table__diff-group">
<h4 className="file-table__diff-group-header">
{source} {makePaneGroupSummary(groupRows, t)}
</h4>
<div className="file-table__diff-pane-list">
{groupRows.map((row, idx) => (
<PaneRow
key={`${row.source}-${row.name}-${idx}`}
row={row}
t={t}
/>
))}
</div>
</section>
))}
</div>
);
}
// Wayfinding skeleton shown while the out-of-band ExifTool diff build is
// in flight. Reuses existing diff-group classes so the geometry matches the
// loaded view (avoids layout shift when the skeleton swaps for the real
// two-pane on `diffPending: false`). Uses its own i18n key
// (`diffSkeletonLoading`) rather than reusing `diffLoadingToast` — the toast
// and skeleton are different render contexts and translators may want to
// diverge (e.g., shorter form for the toast where space is constrained).
function DiffSkeleton({
t,
}: {
t: (key: string) => string;
}): React.JSX.Element {
return (
<div
className="file-table__expansion file-table__diff file-table__diff--skeleton"
role="status"
aria-live="polite"
aria-busy="true"
>
<span className="file-table__diff-value file-table__diff-value--placeholder">
{t("diffSkeletonLoading")}
</span>
</div>
);
}
// Count-shaped walker values like "3 files" / "1 attribute" / "5 items"
// don't represent a single field value — they're aggregate summaries
// from the Office walker's structural deletions (comments, embeddings,
// rsids, etc.). The legacy single-pane diff (pre chunk B.1) rendered
// these as a pill badge instead of strikethrough text; the two-pane
// view preserves that affordance.
//
// Pattern: leading digit(s) + a single space + one word (singular or
// plural noun). Matches "3 files", "1 attribute", "12 items" but not
// "Apple iPhone 14" or strings containing spaces past the noun.
const COUNT_VALUE_RE = /^\d+ \w+$/;
function isCountValue(s: string): boolean {
return COUNT_VALUE_RE.test(s);
}
function PaneRow({
row,
t,
}: {
row: DiffRow;
t: (key: string) => string;
}): React.JSX.Element {
const empty = t("diffEmptyValue");
// Count-shaped removed values render as a badge (pill) instead of
// strikethrough — visually distinct because "3 files" is an aggregate
// removed count, not a value that was scrubbed.
const beforeIsCount = row.before !== null && isCountValue(row.before);
const beforeClass =
row.status === "removed" && beforeIsCount
? "file-table__diff-value--count-badge"
: row.status === "removed" || row.status === "modified"
? "file-table__diff-value file-table__diff-value--strike"
: "file-table__diff-value";
const afterIsCount = row.after !== null && isCountValue(row.after);
const afterClass =
row.status === "added" && afterIsCount
? "file-table__diff-value--count-badge file-table__diff-value--added"
: row.status === "added" || row.status === "modified"
? "file-table__diff-value file-table__diff-value--added"
: "file-table__diff-value";
return (
<div
className={`file-table__diff-pair file-table__diff-pair--${row.status}`}
>
<div className="file-table__diff-name">{row.name}</div>
<div className="file-table__diff-pane-cell file-table__diff-pane-cell--before">
{row.before !== null ? (
<span className={beforeClass} title={row.before}>
{row.before}
</span>
) : (
<span className="file-table__diff-value file-table__diff-value--placeholder">
{empty}
</span>
)}
</div>
<div className="file-table__diff-pane-cell file-table__diff-pane-cell--after">
{row.after !== null ? (
<span className={afterClass} title={row.after}>
{row.after}
</span>
) : (
<span className="file-table__diff-value file-table__diff-value--placeholder">
{empty}
</span>
)}
</div>
</div>
);
}
function computeDiffRows(document: MetadataDocument): readonly DiffRow[] {
const afterByKey = new Map<string, MetadataEntry>();
for (const entry of document.after) {
afterByKey.set(makeKey(entry.source, entry.name), entry);
}
const beforeKeys = new Set<string>();
const rows: DiffRow[] = [];
for (const entry of document.before) {
const key = makeKey(entry.source, entry.name);
beforeKeys.add(key);
const after = afterByKey.get(key);
if (after === undefined) {
rows.push({
status: "removed",
source: entry.source,
name: entry.name,
before: entry.value,
after: null,
});
} else if (after.value === entry.value) {
rows.push({
status: "kept",
source: entry.source,
name: entry.name,
before: entry.value,
after: after.value,
});
} else {
rows.push({
status: "modified",
source: entry.source,
name: entry.name,
before: entry.value,
after: after.value,
});
}
}
for (const entry of document.after) {
const key = makeKey(entry.source, entry.name);
if (!beforeKeys.has(key)) {
rows.push({
status: "added",
source: entry.source,
name: entry.name,
before: null,
after: entry.value,
});
}
}
return rows;
}
// NUL separator (not a space, not a colon) so the composed key can't
// collide with a tag name that legitimately contains the separator.
// Tag names from ExifTool -G1 are mostly ASCII identifiers, but spaces
// have shown up in extended XMP namespaces. NUL is forbidden in every
// metadata grammar we route, so it's a safe sentinel.
function makeKey(source: string, name: string): string {
return `${source}\0${name}`;
}
function makePaneGroupSummary(
rows: readonly DiffRow[],
t: (key: string) => string,
): string {
let removed = 0;
let modified = 0;
let added = 0;
let kept = 0;
for (const r of rows) {
switch (r.status) {
case "removed":
removed += 1;
break;
case "modified":
modified += 1;
break;
case "added":
added += 1;
break;
case "kept":
kept += 1;
break;
}
}
const parts: string[] = [];
if (removed > 0)
parts.push(t("diffGroupRemoved").replace("{count}", String(removed)));
if (modified > 0)
parts.push(t("diffGroupModified").replace("{count}", String(modified)));
if (added > 0)
parts.push(t("diffGroupAdded").replace("{count}", String(added)));
if (kept > 0) parts.push(t("diffGroupKept").replace("{count}", String(kept)));
if (parts.length === 0) return "";
return `· ${parts.join(t("diffGroupSeparator"))}`;
}
interface SourceRowGroup {
readonly source: string;
readonly rows: readonly DiffRow[];
}
function groupRowsBySource(
rows: readonly DiffRow[],
): readonly SourceRowGroup[] {
const order: string[] = [];
const byKey = new Map<string, DiffRow[]>();
for (const row of rows) {
const existing = byKey.get(row.source);
if (existing === undefined) {
order.push(row.source);
byKey.set(row.source, [row]);
} else {
existing.push(row);
}
}
return order.map((source) => ({
source,
rows: byKey.get(source) as DiffRow[],
}));
}

View file

@ -0,0 +1,284 @@
// Two-pane metadata diff table — extracted from MetadataDiffExpansion
// so ZipExpansion can reuse the table body inside per-leaf rows without
// inheriting the outer expansion chrome (see
// docs/superpowers/specs/2026-05-22-issue-184-zip-support-design.md §4.5).
//
// `wrapperClassName` lets the consumer swap the outer wrapper class:
// - MetadataDiffExpansion uses the default
// "file-table__expansion file-table__diff file-table__diff--two-pane".
// - ZipExpansion leaf renders pass "zip-expansion__leaf-diff" to get a
// slimmer wrapper without the file-table expansion padding.
import type { MetadataDocument, MetadataEntry } from "../../../domain";
type DiffRowStatus = "removed" | "added" | "modified" | "kept";
interface DiffRow {
readonly status: DiffRowStatus;
readonly source: string;
readonly name: string;
readonly before: string | null;
readonly after: string | null;
}
export function MetadataDiffTable({
document,
t,
wrapperClassName = "file-table__expansion file-table__diff file-table__diff--two-pane",
}: {
document: MetadataDocument;
t: (key: string) => string;
wrapperClassName?: string;
}): React.JSX.Element {
const rows = computeDiffRows(document);
const grouped = groupRowsBySource(rows);
return (
<div className={wrapperClassName}>
<div className="file-table__diff-pane-header">
<span className="file-table__diff-pane-label file-table__diff-pane-label--before">
{t("diffPaneBefore")}
</span>
<span className="file-table__diff-pane-label file-table__diff-pane-label--after">
{t("diffPaneAfter")}
</span>
</div>
{grouped.map(({ source, rows: groupRows }) => (
<section key={source} className="file-table__diff-group">
<h4 className="file-table__diff-group-header">
{source} {makePaneGroupSummary(groupRows, t)}
</h4>
<div className="file-table__diff-pane-list">
{groupRows.map((row, idx) => (
<PaneRow
key={`${row.source}-${row.name}-${idx}`}
row={row}
t={t}
/>
))}
</div>
</section>
))}
</div>
);
}
// Wayfinding skeleton shown while the out-of-band ExifTool diff build is
// in flight. Reuses existing diff-group classes so the geometry matches
// the loaded view (avoids layout shift when the skeleton swaps for the
// real two-pane). Exported so ZipExpansion can render the same skeleton
// while a leaf's diff is loading.
export function DiffSkeleton({
t,
wrapperClassName = "file-table__expansion file-table__diff file-table__diff--skeleton",
}: {
t: (key: string) => string;
wrapperClassName?: string;
}): React.JSX.Element {
return (
<div
className={wrapperClassName}
role="status"
aria-live="polite"
aria-busy="true"
>
<span className="file-table__diff-value file-table__diff-value--placeholder">
{t("diffSkeletonLoading")}
</span>
</div>
);
}
// Count-shaped walker values like "3 files" / "1 attribute" / "5 items"
// don't represent a single field value — they're aggregate summaries
// from the Office walker's structural deletions (comments, embeddings,
// rsids, etc.). The legacy single-pane diff (pre chunk B.1) rendered
// these as a pill badge instead of strikethrough text; the two-pane
// view preserves that affordance.
//
// Pattern: leading digit(s) + a single space + one word (singular or
// plural noun). Matches "3 files", "1 attribute", "12 items" but not
// "Apple iPhone 14" or strings containing spaces past the noun.
const COUNT_VALUE_RE = /^\d+ \w+$/;
function isCountValue(s: string): boolean {
return COUNT_VALUE_RE.test(s);
}
function PaneRow({
row,
t,
}: {
row: DiffRow;
t: (key: string) => string;
}): React.JSX.Element {
const empty = t("diffEmptyValue");
const beforeIsCount = row.before !== null && isCountValue(row.before);
const beforeClass =
row.status === "removed" && beforeIsCount
? "file-table__diff-value--count-badge"
: row.status === "removed" || row.status === "modified"
? "file-table__diff-value file-table__diff-value--strike"
: "file-table__diff-value";
const afterIsCount = row.after !== null && isCountValue(row.after);
const afterClass =
row.status === "added" && afterIsCount
? "file-table__diff-value--count-badge file-table__diff-value--added"
: row.status === "added" || row.status === "modified"
? "file-table__diff-value file-table__diff-value--added"
: "file-table__diff-value";
return (
<div
className={`file-table__diff-pair file-table__diff-pair--${row.status}`}
>
<div className="file-table__diff-name">{row.name}</div>
<div className="file-table__diff-pane-cell file-table__diff-pane-cell--before">
{row.before !== null ? (
<span className={beforeClass} title={row.before}>
{row.before}
</span>
) : (
<span className="file-table__diff-value file-table__diff-value--placeholder">
{empty}
</span>
)}
</div>
<div className="file-table__diff-pane-cell file-table__diff-pane-cell--after">
{row.after !== null ? (
<span className={afterClass} title={row.after}>
{row.after}
</span>
) : (
<span className="file-table__diff-value file-table__diff-value--placeholder">
{empty}
</span>
)}
</div>
</div>
);
}
function computeDiffRows(document: MetadataDocument): readonly DiffRow[] {
const afterByKey = new Map<string, MetadataEntry>();
for (const entry of document.after) {
afterByKey.set(makeKey(entry.source, entry.name), entry);
}
const beforeKeys = new Set<string>();
const rows: DiffRow[] = [];
for (const entry of document.before) {
const key = makeKey(entry.source, entry.name);
beforeKeys.add(key);
const after = afterByKey.get(key);
if (after === undefined) {
rows.push({
status: "removed",
source: entry.source,
name: entry.name,
before: entry.value,
after: null,
});
} else if (after.value === entry.value) {
rows.push({
status: "kept",
source: entry.source,
name: entry.name,
before: entry.value,
after: after.value,
});
} else {
rows.push({
status: "modified",
source: entry.source,
name: entry.name,
before: entry.value,
after: after.value,
});
}
}
for (const entry of document.after) {
const key = makeKey(entry.source, entry.name);
if (!beforeKeys.has(key)) {
rows.push({
status: "added",
source: entry.source,
name: entry.name,
before: null,
after: entry.value,
});
}
}
return rows;
}
// NUL separator (not a space, not a colon) so the composed key can't
// collide with a tag name that legitimately contains the separator.
// Tag names from ExifTool -G1 are mostly ASCII identifiers, but spaces
// have shown up in extended XMP namespaces. NUL is forbidden in every
// metadata grammar we route, so it's a safe sentinel.
function makeKey(source: string, name: string): string {
return `${source}\0${name}`;
}
function makePaneGroupSummary(
rows: readonly DiffRow[],
t: (key: string) => string,
): string {
let removed = 0;
let modified = 0;
let added = 0;
let kept = 0;
for (const r of rows) {
switch (r.status) {
case "removed":
removed += 1;
break;
case "modified":
modified += 1;
break;
case "added":
added += 1;
break;
case "kept":
kept += 1;
break;
}
}
const parts: string[] = [];
if (removed > 0)
parts.push(t("diffGroupRemoved").replace("{count}", String(removed)));
if (modified > 0)
parts.push(t("diffGroupModified").replace("{count}", String(modified)));
if (added > 0)
parts.push(t("diffGroupAdded").replace("{count}", String(added)));
if (kept > 0) parts.push(t("diffGroupKept").replace("{count}", String(kept)));
if (parts.length === 0) return "";
return `· ${parts.join(t("diffGroupSeparator"))}`;
}
interface SourceRowGroup {
readonly source: string;
readonly rows: readonly DiffRow[];
}
function groupRowsBySource(
rows: readonly DiffRow[],
): readonly SourceRowGroup[] {
const order: string[] = [];
const byKey = new Map<string, DiffRow[]>();
for (const row of rows) {
const existing = byKey.get(row.source);
if (existing === undefined) {
order.push(row.source);
byKey.set(row.source, [row]);
} else {
existing.push(row);
}
}
return order.map((source) => ({
source,
rows: byKey.get(source) as DiffRow[],
}));
}

View file

@ -0,0 +1,345 @@
// Recursive tree view of inner ZIP entries. Renders the
// `archiveEntries` from a ZipStrategy strip; each cleaned-leaf row
// lazy-loads its per-leaf metadata diff on first expand via
// window.api.wasm.buildArchiveLeafDiff.
//
// See docs/superpowers/specs/2026-05-22-issue-184-zip-support-design.md
// §4.5 for the UI contract.
import { useCallback, useState } from "react";
import type { LeafDiffResult } from "../../../application";
import { assertNever } from "../../../common";
import type { ArchiveEntryResult } from "../../../domain";
import { useI18n } from "../../hooks/use_i18n";
import { ChevronIcon } from "../icons/ChevronIcon";
import { MetadataDiffTable, DiffSkeleton } from "./MetadataDiffTable";
// Pagination — render the first N entries eagerly; surface a button
// to reveal the next N for archives larger than this.
const VISIBLE_PAGE_SIZE = 100;
// Visual indent caps at level 5 to avoid horizontal squeeze on mobile.
// Recursion itself is unbounded; this only affects padding-left.
const INDENT_CAP = 5;
// Render-time depth limit — beyond this we stop rendering further
// recursion to prevent adversarial archives (zip quine, deeply nested
// archive of archives) from hanging the tab. Strategy-side recursion
// caps at MAX_NESTING_DEPTH (10) so a well-formed result never reaches
// this UI cap; we keep it as a defense-in-depth ceiling in case the
// strategy ever evolves to ship deeper trees.
const MAX_RENDER_DEPTH = 20;
// Per-leaf expansion state. `cachedResult` on "closed" lets a re-opened
// leaf within the same ZipExpansion mount skip the API call. Across
// outer-ZIP collapse + reopen (which unmounts ZipExpansion entirely) the
// WasmProcessor's cachedLeafDocs is the authoritative cache; both layers
// coexist deliberately.
type LeafState =
| { kind: "closed"; cachedResult?: LeafDiffResult }
| { kind: "loading" }
| { kind: "loaded"; result: LeafDiffResult };
export function ZipExpansion({
entryId,
entries,
depth = 0,
pathPrefix = "",
}: {
entryId: string;
entries: readonly ArchiveEntryResult[];
depth?: number;
// Composite full-path prefix from the outer archive(s), with
// trailing NUL separator. Empty at the top level; for a nested
// ZIP at path `a.zip` it's `a.zip\0`. The leaf row uses
// `pathPrefix + entry.path` as the full path when calling
// `buildArchiveLeafDiff` so the lookup matches the key composed
// in WasmProcessor.stashArchiveLeaves. Without this prefix,
// nested zips with same-named leaves would collide.
pathPrefix?: string;
}): React.JSX.Element {
const { t } = useI18n();
const [visible, setVisible] = useState(VISIBLE_PAGE_SIZE);
const [leafStates, setLeafStates] = useState<Map<string, LeafState>>(
new Map(),
);
// Stable callbacks: depend only on setLeafStates' setter identity
// (constant across renders), so the rows can rely on referential
// identity and we avoid recreating fresh closures per parent render.
// Crucial for ZIPs with many entries — without this, a state update
// re-renders every row (perf finding #14).
const setLeafState = useCallback(
(stateKey: string, next: LeafState): void => {
setLeafStates((prev) => {
const out = new Map(prev);
out.set(stateKey, next);
return out;
});
},
[],
);
const setLeafStateIfStill = useCallback(
(args: {
stateKey: string;
fromKind: LeafState["kind"];
next: LeafState;
}): void => {
setLeafStates((prev) => {
const current = prev.get(args.stateKey);
if ((current ?? { kind: "closed" }).kind !== args.fromKind) {
return prev;
}
const out = new Map(prev);
out.set(args.stateKey, args.next);
return out;
});
},
[],
);
if (depth >= MAX_RENDER_DEPTH) {
return (
<div className="zip-expansion__depth-limit">
{t("zipExpansion.depthLimit")}
</div>
);
}
const shown = entries.slice(0, visible);
const remaining = Math.max(0, entries.length - visible);
const indentLevel = Math.min(depth, INDENT_CAP);
return (
<div className="zip-expansion" data-depth={indentLevel}>
{shown.map((entry, idx) => {
// Index suffix makes the state key stable across duplicate
// filenames in the same archive (legal per the ZIP spec).
const stateKey = `${idx}\0${entry.path}`;
const fullPath = pathPrefix + entry.path;
return (
<ZipExpansionRow
key={stateKey}
stateKey={stateKey}
entryId={entryId}
entry={entry}
fullPath={fullPath}
depth={depth}
state={leafStates.get(stateKey) ?? { kind: "closed" }}
setLeafState={setLeafState}
setLeafStateIfStill={setLeafStateIfStill}
t={t}
/>
);
})}
{remaining > 0 && (
<button
type="button"
className="zip-expansion__show-more"
onClick={() => setVisible((v) => v + VISIBLE_PAGE_SIZE)}
>
{t("zipExpansion.showMore").replace("{count}", String(remaining))}
</button>
)}
</div>
);
}
function ZipExpansionRow({
stateKey,
entryId,
entry,
fullPath,
depth,
state,
setLeafState,
setLeafStateIfStill,
t,
}: {
stateKey: string;
entryId: string;
entry: ArchiveEntryResult;
fullPath: string;
depth: number;
state: LeafState;
setLeafState: (stateKey: string, next: LeafState) => void;
setLeafStateIfStill: (args: {
stateKey: string;
fromKind: LeafState["kind"];
next: LeafState;
}) => void;
t: (key: string) => string;
}): React.JSX.Element {
// Both "cleaned" and "already-clean" are processable (the bytes stash
// was populated; the diff is meaningful). The distinction is only for
// the displayed status label.
const wasProcessed =
entry.status === "cleaned" || entry.status === "already-clean";
const isCleanedLeaf = wasProcessed && entry.entries === null;
const isNestedZip = wasProcessed && entry.entries !== null;
const isExpandable = isCleanedLeaf || isNestedZip;
const isExpanded =
(isCleanedLeaf && state.kind !== "closed") ||
(isNestedZip && state.kind === "loaded");
async function handleToggle(): Promise<void> {
if (!isExpandable) return;
if (isCleanedLeaf) {
if (state.kind === "closed") {
// Re-open: if we already fetched the result earlier in this
// mount, use the cached value. The processor-level cache also
// covers this case across unmounts; both are correct, this
// just saves an IPC round-trip.
if (state.cachedResult !== undefined) {
setLeafState(stateKey, {
kind: "loaded",
result: state.cachedResult,
});
return;
}
setLeafState(stateKey, { kind: "loading" });
try {
const result = await window.api.wasm.buildArchiveLeafDiff(
entryId,
fullPath,
);
setLeafStateIfStill({
stateKey,
fromKind: "loading",
next: { kind: "loaded", result },
});
} catch {
setLeafStateIfStill({
stateKey,
fromKind: "loading",
next: { kind: "loaded", result: { kind: "failed" } },
});
}
} else {
// Close: carry the result so a subsequent open can skip the
// API call.
setLeafState(
stateKey,
state.kind === "loaded"
? { kind: "closed", cachedResult: state.result }
: { kind: "closed" },
);
}
} else if (isNestedZip) {
// Nested-zip rows toggle between closed and loaded; the "loaded"
// payload is unused — the real content comes from the recursive
// <ZipExpansion>. Using "failed" as the sentinel keeps the loaded
// payload uniform (LeafDiffResult shape).
setLeafState(
stateKey,
state.kind === "loaded"
? { kind: "closed" }
: { kind: "loaded", result: { kind: "failed" } },
);
}
}
function handleKeyDown(e: React.KeyboardEvent): void {
if (!isExpandable) return;
if (e.key === "Enter" || e.key === " ") {
e.preventDefault();
void handleToggle();
}
}
const statusLabel = renderStatus(entry, t);
const rowClass = [
"zip-expansion__row",
`zip-expansion__row--${entry.status}`,
isExpandable ? "zip-expansion__row--expandable" : "",
]
.filter(Boolean)
.join(" ");
return (
<div className="zip-expansion__entry">
<div
className={rowClass}
role={isExpandable ? "button" : undefined}
tabIndex={isExpandable ? 0 : -1}
onClick={isExpandable ? () => void handleToggle() : undefined}
onKeyDown={handleKeyDown}
aria-expanded={isExpandable ? isExpanded : undefined}
>
<div className="zip-expansion__chevron">
{isExpandable && <ChevronIcon expanded={isExpanded} />}
</div>
<div className="zip-expansion__path">{entry.path}</div>
<div className="zip-expansion__status">{statusLabel}</div>
</div>
{isExpanded && isCleanedLeaf && state.kind === "loading" && (
<DiffSkeleton
t={t}
wrapperClassName="zip-expansion__leaf-diff zip-expansion__leaf-diff--skeleton"
/>
)}
{isExpanded &&
isCleanedLeaf &&
state.kind === "loaded" &&
renderLeafBody(state.result, t)}
{isExpanded && isNestedZip && entry.entries !== null && (
<ZipExpansion
entryId={entryId}
entries={entry.entries}
depth={depth + 1}
pathPrefix={`${fullPath}\0`}
/>
)}
</div>
);
}
// Maps a LeafDiffResult to the rendered body — either the two-pane diff,
// an "already clean" message, or a "diff failed" error. The discriminated
// shape makes the failed-vs-empty distinction explicit so an internal
// error is never rendered as "Already clean".
function renderLeafBody(
result: LeafDiffResult,
t: (key: string) => string,
): React.JSX.Element {
if (result.kind === "failed") {
return (
<div className="zip-expansion__leaf-empty">
{t("zipExpansion.diffFailed")}
</div>
);
}
const { doc } = result;
if (doc.before.length > 0 || doc.after.length > 0) {
return (
<MetadataDiffTable
document={doc}
t={t}
wrapperClassName="zip-expansion__leaf-diff file-table__diff--two-pane"
/>
);
}
return (
<div className="zip-expansion__leaf-empty">
{t("zipExpansion.alreadyClean")}
</div>
);
}
function renderStatus(
entry: ArchiveEntryResult,
t: (key: string) => string,
): string {
switch (entry.status) {
case "cleaned":
return t("zipExpansion.statusCleaned");
case "already-clean":
return t("zipExpansion.statusAlreadyClean");
case "passed-through-unsupported":
return t("zipExpansion.statusUnsupported");
case "directory":
return t("zipExpansion.statusDirectory");
default:
return assertNever({ value: entry.status });
}
}

View file

@ -1,7 +1,7 @@
import { createContext, useContext, useReducer } from "react";
import type { Dispatch, ReactNode } from "react";
import { FileProcessingStatus } from "../../domain";
import type { MetadataDocument } from "../../domain";
import type { ArchiveEntryResult, MetadataDocument } from "../../domain";
import { assertNever } from "../../common/types";
export type FolderDiscoveryStatus =
@ -34,6 +34,13 @@ export interface FileEntry {
// a loading skeleton when this is true. Optional so existing entry
// initializers don't need to set it; treated as `false` when absent.
diffPending?: boolean;
// Strategy-emitted non-fatal warnings (currently only ZipStrategy).
// Surfaced as an inline disclosure on the FileRow.
warnings?: readonly string[];
// Recursive tree of inner archive entries (currently only ZipStrategy
// populates). When non-null and non-empty, FileRow's expansion area
// renders <ZipExpansion> instead of <MetadataDiffExpansion>.
archiveEntries?: readonly ArchiveEntryResult[];
}
export interface AppState {
@ -53,6 +60,8 @@ export type AppAction =
afterBytes: number;
diffDocument: MetadataDocument | null;
diffPending: boolean;
warnings: readonly string[];
archiveEntries: readonly ArchiveEntryResult[] | null;
}
| {
type: "UPDATE_FILE_DIFF";
@ -100,6 +109,11 @@ export function appReducer(state: AppState, action: AppAction): AppState {
afterBytes: action.afterBytes,
diffDocument: action.diffDocument,
diffPending: action.diffPending,
warnings: action.warnings,
// exactOptionalPropertyTypes: omit when null.
...(action.archiveEntries !== null && {
archiveEntries: action.archiveEntries,
}),
}
: file,
),

View file

@ -4,6 +4,7 @@ import type { FileEntry, AppAction } from "../contexts/AppContext";
import { useAppContext } from "../contexts/AppContext";
import { useI18n } from "./use_i18n";
import { FileProcessingStatus, exceedsCap } from "../../domain";
import type { ArchiveEntryResult } from "../../domain";
import { getCurrentSizeCap } from "../utils/get_size_cap";
import { formatFileSize } from "../utils/format_file_size";
@ -179,14 +180,17 @@ async function processViaWasm({
}
const outputBytes = result.outputBytes ?? 0;
// A file is "cleaned" if the output is smaller than the input. Post-B.1
// the per-strategy MetadataItem[] enumeration is gone (ExifTool's read
// is the diff source); bytesReduced is the only synchronous signal we
// have at this point. If the file had no removable metadata, output ≥
// input and we tag NoMetadataFound. The async diff (when it arrives via
// UPDATE_FILE_DIFF) provides the per-row detail, but doesn't drive the
// status pill.
// A file is "cleaned" if the output is smaller than the input, OR if
// it's an archive and at least one inner entry was cleaned. The byte-
// size comparison alone is wrong for ZIPs: even with DEFLATE re-
// encoding, a ZIP containing only binary-compressed files (JPEG, PNG)
// won't shrink because those entries are already incompressible. Using
// cleaned-entry count as the authoritative signal for archives gives
// correct Complete/NoMetadataFound status independent of compression.
const bytesReduced = outputBytes > 0 && outputBytes < entry.size;
const archiveEntries = result.archiveEntries ?? null;
const hasCleanedArchiveEntry =
archiveEntries !== null && hasAnyCleanedEntry(archiveEntries);
// `result.diffDocument` is null when the flag is on (the build will fire
// async, see below) or undefined when the API surface doesn't include it
// (defensive — older shape, older mocks). Treat both as "no diff yet".
@ -201,13 +205,19 @@ async function processViaWasm({
// fired below and dispatches UPDATE_FILE_DIFF when it lands. While
// pending, MetadataDiffExpansion shows a skeleton.
diffPending: ENABLE_EXIFTOOL_DIFF && !hasDiffNow,
// Strategy-emitted fields (currently ZipStrategy only). `result`
// is the WasmApi return shape which defaults warnings to [] and
// archiveEntries to null when the underlying outcome omitted them.
warnings: result.warnings ?? [],
archiveEntries: result.archiveEntries ?? null,
});
dispatch({
type: "UPDATE_FILE_STATUS",
id: entry.id,
status: bytesReduced
? FileProcessingStatus.Complete
: FileProcessingStatus.NoMetadataFound,
status:
bytesReduced || hasCleanedArchiveEntry
? FileProcessingStatus.Complete
: FileProcessingStatus.NoMetadataFound,
});
window.api.files.notifyFileProcessed();
@ -215,7 +225,15 @@ async function processViaWasm({
// inline — zeroperl.wasm runs on the main thread; a "background"
// diff fired here would steal CPU from the next strip iteration.
// See the comment on processFileEntries for the full rationale.
if (ENABLE_EXIFTOOL_DIFF && !hasDiffNow) {
//
// Archive rows (currently only ZIP) don't have a top-level pending
// diff — WasmProcessor stashes per-leaf inputs instead and the UI
// builds those lazily via buildArchiveLeafDiff. Enqueuing them here
// would cause buildDiffDocumentForEntry to return null for every ZIP
// and dispatch a wasted UPDATE_FILE_DIFF(null) — 100 ZIPs would issue
// 100 useless IPC round-trips. Skip them.
const isArchive = archiveEntries !== null;
if (ENABLE_EXIFTOOL_DIFF && !hasDiffNow && !isArchive) {
diffEntries.push(entry);
}
}
@ -262,19 +280,49 @@ async function buildDiffInBackground({
}
}
// Recursively checks whether any entry in an archive tree has status
// "cleaned" (NOT "already-clean"). Used to determine Complete vs
// NoMetadataFound for ZIPs whose inner files are binary-compressed
// (JPEG, PNG) — those don't shrink the outer ZIP's byte count even when
// their EXIF is stripped, so byte-size comparison alone misclassifies
// them as "Already clean". "already-clean" entries are explicitly excluded:
// the whole point of that status is that nothing was actually removed.
function hasAnyCleanedEntry(entries: readonly ArchiveEntryResult[]): boolean {
for (const entry of entries) {
if (entry.status === "cleaned") return true;
if (entry.entries !== null && hasAnyCleanedEntry(entry.entries))
return true;
}
return false;
}
// A diff "has changes" when any entry is removed (present in `before`, absent
// from `after`), modified (same source+name, different value), or added (only
// in `after`). Matches the classification done at render time in
// MetadataDiffExpansion's computeDiffRows.
//
// Uses a multiset comparison (count of each source+name+value tuple) so
// duplicate entries are handled correctly. A naive Map-based check
// collapsed duplicates and produced false negatives — e.g. before=[(A,X),
// (A,X)] vs after=[(A,X),(B,Y)] would report "no changes" because the
// duplicate match consumed both before entries while the (B,Y) addition
// was never inspected.
function diffHasChanges(doc: {
before: readonly { source: string; name: string; value: string }[];
after: readonly { source: string; name: string; value: string }[];
}): boolean {
if (doc.before.length !== doc.after.length) return true;
const afterByKey = new Map<string, string>();
for (const e of doc.after) afterByKey.set(`${e.source}\0${e.name}`, e.value);
const key = (e: { source: string; name: string; value: string }): string =>
`${e.source}\0${e.name}\0${e.value}`;
const counts = new Map<string, number>();
for (const e of doc.before) {
if (afterByKey.get(`${e.source}\0${e.name}`) !== e.value) return true;
counts.set(key(e), (counts.get(key(e)) ?? 0) + 1);
}
for (const e of doc.after) {
const k = key(e);
const c = counts.get(k);
if (c === undefined || c === 0) return true;
counts.set(k, c - 1);
}
return false;
}

View file

@ -11,6 +11,7 @@ import "./styles/file_browse_button.css";
import "./styles/error_boundary.css";
import "./styles/file_list.css";
import "./styles/file_table.css";
import "./styles/zip-expansion.css";
import "./styles/folder_row.css";
import "./styles/status_bar.css";
import "./styles/status_icon.css";

View file

@ -0,0 +1,119 @@
/* Tree view for ZIP inner-entry diffs. See ZipExpansion.tsx. */
.zip-expansion {
display: flex;
flex-direction: column;
gap: 4px;
padding: 8px 0;
}
.zip-expansion[data-depth="0"] {
padding-left: 0;
}
.zip-expansion[data-depth="1"] {
padding-left: 16px;
}
.zip-expansion[data-depth="2"] {
padding-left: 32px;
}
.zip-expansion[data-depth="3"] {
padding-left: 48px;
}
.zip-expansion[data-depth="4"] {
padding-left: 64px;
}
.zip-expansion[data-depth="5"] {
padding-left: 80px;
}
.zip-expansion__entry {
display: flex;
flex-direction: column;
}
.zip-expansion__row {
display: grid;
grid-template-columns: 32px 1fr auto;
align-items: center;
gap: 8px;
padding: 6px 12px;
border-radius: 4px;
background: var(--surface-2, rgba(127, 127, 127, 0.05));
}
.zip-expansion__row--expandable {
cursor: pointer;
}
.zip-expansion__row--expandable:hover,
.zip-expansion__row--expandable:focus {
background: var(--surface-3, rgba(127, 127, 127, 0.1));
outline: none;
}
.zip-expansion__chevron {
min-width: 24px;
min-height: 24px;
display: flex;
align-items: center;
justify-content: center;
}
.zip-expansion__path {
font-family: var(--font-mono, ui-monospace, monospace);
font-size: 0.9em;
overflow: hidden;
text-overflow: ellipsis;
white-space: nowrap;
}
.zip-expansion__status {
font-size: 0.85em;
opacity: 0.75;
}
.zip-expansion__row--passed-through-unsupported,
.zip-expansion__row--directory {
opacity: 0.7;
}
.zip-expansion__leaf-diff {
padding: 8px 12px 8px 44px;
}
.zip-expansion__leaf-diff--skeleton {
padding: 12px 12px 12px 44px;
font-style: italic;
opacity: 0.7;
}
.zip-expansion__leaf-empty {
padding: 6px 12px 6px 44px;
font-style: italic;
opacity: 0.65;
font-size: 0.9em;
}
.zip-expansion__show-more {
align-self: flex-start;
margin-top: 6px;
margin-left: 12px;
background: transparent;
border: 1px solid var(--border-1, rgba(127, 127, 127, 0.3));
border-radius: 4px;
padding: 4px 10px;
font-size: 0.85em;
cursor: pointer;
color: inherit;
}
.zip-expansion__show-more:hover {
background: var(--surface-3, rgba(127, 127, 127, 0.1));
}
.zip-expansion__depth-limit {
padding: 6px 12px;
font-style: italic;
opacity: 0.6;
font-size: 0.85em;
}

Binary file not shown.

View file

@ -0,0 +1,122 @@
// ZIP archive support — Web (issue #184)
//
// End-to-end coverage for the ZipStrategy + ZipExpansion tree UI.
// Drops a fixture .zip containing an EXIF-tagged JPEG and a directory
// entry. Asserts:
// - The row becomes expandable on completion.
// - The expansion area renders the ZipExpansion tree (NOT the
// MetadataDiffExpansion two-pane — those are mutually exclusive
// per the FileRow gate).
// - Inner JPEG row is rendered with a chevron; directory row is not.
// - Clicking the inner JPEG row triggers the lazy diff load — a
// DiffSkeleton appears, then swaps for the MetadataDiffTable.
//
// Encrypted-archive refusal is covered by the unit suite in
// tests/infrastructure/wasm/zip_strategy.test.ts; building an encrypted
// fixture for e2e adds complexity without surfacing new UI behavior.
import { test, expect } from "@playwright/test";
import { launchPage } from "./helpers/page_launcher";
import { fixturePath } from "./helpers/fixture_loader";
test.describe("ZIP archive — tree expansion + lazy per-leaf diff", () => {
test.beforeEach(async ({ page }) => {
await launchPage(page);
});
test("expandable row routes to ZipExpansion (not MetadataDiffExpansion)", async ({
page,
isMobile,
browserName,
}) => {
// Same WebKit caveat as metadata_diff.spec.ts: ExifTool diff WASM
// load is unreliable on Playwright's WebKit driver; the static
// tree render is exercised by ZipExpansion unit tests on every
// project.
test.skip(
browserName === "webkit",
"WebKit driver — zeroperl WASM load is unreliable on this driver.",
);
test.setTimeout(45_000);
const input = page.locator(".file-browse-button__input").first();
await input.setInputFiles([fixturePath("sample-zip.zip")], { force: true });
const row = page.locator(".file-table__row--complete").first();
await expect(row).toBeVisible({ timeout: 30_000 });
await expect(row).toHaveClass(/file-table__row--expandable/);
if (isMobile) {
await row.tap();
} else {
await row.click();
}
// ZipExpansion tree is rendered; MetadataDiffExpansion two-pane is NOT.
const tree = page.locator(".zip-expansion");
await expect(tree).toBeVisible();
await expect(page.locator(".file-table__diff--two-pane")).toHaveCount(0);
// Inner JPEG row visible (cleaned status, with chevron).
const jpegRow = tree.locator(".zip-expansion__row--cleaned", {
hasText: "photo.jpg",
});
await expect(jpegRow).toBeVisible();
await expect(jpegRow).toHaveClass(/zip-expansion__row--expandable/);
// Directory row visible, NOT expandable.
const dirRow = tree.locator(".zip-expansion__row--directory");
await expect(dirRow).toBeVisible();
await expect(dirRow).not.toHaveClass(/zip-expansion__row--expandable/);
});
test("clicking an inner-JPEG row loads its diff lazily (skeleton → table)", async ({
page,
isMobile,
browserName,
}) => {
test.skip(
browserName === "webkit",
"WebKit driver — zeroperl WASM load is unreliable on this driver.",
);
test.setTimeout(60_000);
const input = page.locator(".file-browse-button__input").first();
await input.setInputFiles([fixturePath("sample-zip.zip")], { force: true });
const row = page.locator(".file-table__row--complete").first();
await expect(row).toBeVisible({ timeout: 30_000 });
if (isMobile) {
await row.tap();
} else {
await row.click();
}
const jpegRow = page
.locator(".zip-expansion__row--cleaned", { hasText: "photo.jpg" })
.first();
await expect(jpegRow).toBeVisible();
if (isMobile) {
await jpegRow.tap();
} else {
await jpegRow.click();
}
// On a cold session the skeleton appears first; on a warm session
// (subsequent leaf in the same test) the table appears directly.
// Either way, the table eventually shows. Allow up to 30s for the
// first-of-session WASM warm-up.
const table = page.locator(".zip-expansion__leaf-diff").first();
await expect(table).toBeVisible({ timeout: 30_000 });
// The two-pane diff content uses the same .file-table__diff-pair
// classes as the top-level diff. At least one row should classify
// as "removed" because the sample.zip's inner JPEG has an EXIF
// Artist sentinel that the JPEG strategy drops.
await expect(
page.locator(".file-table__diff-pair--removed").first(),
).toBeVisible({ timeout: 30_000 });
});
});

Binary file not shown.

View file

@ -0,0 +1,126 @@
import { describe, it, expect } from "vitest";
import { renderToStaticMarkup } from "react-dom/server";
import { I18nContext } from "../../../src/web/contexts/I18nContext";
import { FileRow } from "../../../src/web/components/file-list/FileRow";
import { FileProcessingStatus } from "../../../src/domain";
import type { FileEntry } from "../../../src/web/contexts/AppContext";
import type { MetadataDocument } from "../../../src/domain";
// Minimal i18n stub; FileRow only looks up a small set of keys for the
// expansion area, none of which are interpolated.
function wrap(children: React.ReactNode): React.JSX.Element {
return (
<I18nContext.Provider
value={{
t: (key: string) => key,
locale: "en",
isLoading: false,
}}
>
{children}
</I18nContext.Provider>
);
}
function makeEntry(overrides: Partial<FileEntry>): FileEntry {
return {
id: "test-id",
path: "/test.jpg",
name: "test.jpg",
extension: ".jpg",
size: 100,
folder: null,
relativePath: null,
status: FileProcessingStatus.NoMetadataFound,
afterBytes: 100,
error: null,
diffDocument: null,
...overrides,
};
}
function render(file: FileEntry): string {
// FileRow reads `.current` synchronously during render to check
// shouldAnimateCheck — must be initialised to a Set, not null.
const animatedCheckRef: React.RefObject<Set<string>> = {
current: new Set<string>(),
};
return renderToStaticMarkup(
wrap(
<FileRow
file={file}
isExpanded={true}
onToggleExpand={() => {}}
staggerIndex={0}
animatedCheckRef={animatedCheckRef}
onCopyToast={() => {}}
/>,
),
);
}
describe("FileRow — NoMetadataFound expansion routing", () => {
it("renders generic 'noMetadataFound' text when no diff is available or pending", () => {
const html = render(
makeEntry({
status: FileProcessingStatus.NoMetadataFound,
diffDocument: null,
diffPending: false,
}),
);
// The static-text branch uses the file-table__expansion-empty class.
expect(html).toContain("file-table__expansion-empty");
expect(html).toContain("noMetadataFound");
});
it("renders MetadataDiffExpansion (skeleton) when diffPending is true", () => {
// Skeleton path: diffPending true and diffDocument null.
const html = render(
makeEntry({
status: FileProcessingStatus.NoMetadataFound,
diffDocument: null,
diffPending: true,
}),
);
// Skeleton renders a known marker class.
expect(html).toContain("file-table__diff--skeleton");
});
it("renders MetadataDiffExpansion (diff table) when diffDocument has content", () => {
// File preserved orientation: diff has 'Orientation: 6' as a kept row.
const diffDocument: MetadataDocument = {
before: [{ source: "EXIF", name: "Orientation", value: "Rotate 90 CW" }],
after: [{ source: "EXIF", name: "Orientation", value: "Rotate 90 CW" }],
};
const html = render(
makeEntry({
status: FileProcessingStatus.NoMetadataFound,
diffDocument,
diffPending: false,
}),
);
// Two-pane table marker class — confirms MetadataDiffExpansion rendered.
expect(html).toContain("file-table__diff--two-pane");
// Must NOT fall through to the generic text branch.
expect(html).not.toContain("file-table__expansion-empty");
});
it("renders MetadataDiffExpansion (already-clean message) when diff has empty before+after", () => {
// File was truly clean: ExifTool returned empty arrays.
const diffDocument: MetadataDocument = { before: [], after: [] };
const html = render(
makeEntry({
status: FileProcessingStatus.NoMetadataFound,
diffDocument,
diffPending: false,
}),
);
// MetadataDiffExpansion renders the already-clean message in its own
// expansion-empty span (different from FileRow's generic noMetadataFound).
expect(html).toContain("file-table__expansion-empty");
expect(html).toContain("zipExpansion.alreadyClean");
// The generic noMetadataFound key from FileRow's fallback branch
// must NOT appear — routing should send us to MetadataDiffExpansion.
expect(html).not.toContain("noMetadataFound");
});
});

View file

@ -151,9 +151,11 @@ describe("MetadataDiffExpansion — two-pane diff", () => {
expect(html).toBe("");
});
it("renders nothing when diffDocument has empty before", () => {
it("renders the alreadyClean message when diffDocument has empty before and after", () => {
const html = renderTwoPane({ before: [], after: [] });
expect(html).toBe("");
// Both sides empty means the file was already clean — show a message
// rather than a blank expansion area (avoids ambiguity with diff failure).
expect(html).toContain("file-table__expansion-empty");
});
it("renders a skeleton while diffPending is true and diffDocument is null", () => {

View file

@ -0,0 +1,179 @@
import { describe, it, expect } from "vitest";
import { renderToStaticMarkup } from "react-dom/server";
import { I18nContext } from "../../../src/web/contexts/I18nContext";
import { ZipExpansion } from "../../../src/web/components/file-list/ZipExpansion";
import type { ArchiveEntryResult } from "../../../src/domain";
const DICT: Record<string, string> = {
"zipExpansion.statusCleaned": "Cleaned",
"zipExpansion.statusUnsupported": "Unsupported — passed through",
"zipExpansion.statusDirectory": "Directory",
"zipExpansion.showMore": "Show {count} more entries",
"zipExpansion.depthLimit": "Depth limit reached — drop the inner file directly",
"zipExpansion.noMetadata": "No metadata detected",
"zipExpansion.diffFailed": "Couldn't load diff — internal error",
diffPaneBefore: "Before",
diffPaneAfter: "After",
diffSkeletonLoading: "Loading metadata reader…",
};
function wrap(children: React.ReactNode): React.JSX.Element {
return (
<I18nContext.Provider
value={{
t: (key: string) => DICT[key] ?? key,
locale: "en",
isLoading: false,
}}
>
{children}
</I18nContext.Provider>
);
}
function makeLeaf(
path: string,
overrides: Partial<ArchiveEntryResult> = {},
): ArchiveEntryResult {
return {
path,
status: "cleaned",
sourceBytes: new Uint8Array([1, 2, 3]),
strippedBytes: new Uint8Array([1, 2]),
walkerEntries: [],
entries: null,
warnings: [],
...overrides,
};
}
function render(entries: readonly ArchiveEntryResult[], depth = 0): string {
return renderToStaticMarkup(
wrap(<ZipExpansion entryId="entry-1" entries={entries} depth={depth} />),
);
}
describe("ZipExpansion — entry rendering", () => {
it("renders one row per entry with the correct path", () => {
const html = render([
makeLeaf("photo.jpg"),
makeLeaf("folder/doc.pdf"),
]);
expect(html).toContain("photo.jpg");
expect(html).toContain("folder/doc.pdf");
});
it("renders cleaned-leaf rows with a chevron and 'Cleaned' status", () => {
const html = render([makeLeaf("photo.jpg")]);
expect(html).toContain("zip-expansion__row--cleaned");
expect(html).toContain("Cleaned");
// Chevron icon should be present (SVG path inside the chevron slot)
expect(html).toContain("zip-expansion__chevron");
});
// Encrypted archives are refused upfront in v1 (see spec §3 +
// gap-analysis encrypted-entry row), so no ArchiveEntryResult is
// ever produced with that status. The "passed-through-encrypted"
// variant is intentionally absent from ArchiveEntryStatus — a
// future byte-level walker would re-add both the variant and the
// corresponding render branch.
it("renders unsupported-entry rows with no chevron", () => {
const html = render([
makeLeaf("data.bin", {
status: "passed-through-unsupported",
sourceBytes: null,
strippedBytes: null,
}),
]);
expect(html).toContain("zip-expansion__row--passed-through-unsupported");
expect(html).toContain("Unsupported — passed through");
expect(html).not.toContain("zip-expansion__row--expandable");
});
it("renders directory-entry rows with no chevron", () => {
const html = render([
makeLeaf("folder/", {
status: "directory",
sourceBytes: null,
strippedBytes: null,
}),
]);
expect(html).toContain("zip-expansion__row--directory");
expect(html).toContain("Directory");
expect(html).not.toContain("zip-expansion__row--expandable");
});
});
describe("ZipExpansion — pagination", () => {
it("renders the first 100 entries eagerly with no show-more button when ≤100", () => {
const entries = Array.from({ length: 50 }, (_, i) =>
makeLeaf(`file-${i}.txt`),
);
const html = render(entries);
expect(html).not.toContain("zip-expansion__show-more");
});
it("emits a 'Show N more entries' button when the archive has > 100 entries", () => {
const entries = Array.from({ length: 150 }, (_, i) =>
makeLeaf(`file-${i}.txt`),
);
const html = render(entries);
expect(html).toContain("zip-expansion__show-more");
// 150 - 100 = 50 remaining; interpolated into the label.
expect(html).toContain("Show 50 more entries");
// First 100 entries are rendered eagerly; entry index 99 is the
// last visible, index 100 is hidden.
expect(html).toContain("file-99.txt");
expect(html).not.toContain(">file-100.txt<");
});
});
describe("ZipExpansion — depth limit", () => {
it("renders the depth-limit message instead of rows at depth >= 20", () => {
const html = render([makeLeaf("photo.jpg")], 20);
expect(html).toContain("zip-expansion__depth-limit");
expect(html).toContain(
"Depth limit reached — drop the inner file directly",
);
expect(html).not.toContain("photo.jpg");
});
it("renders entries normally at depth 19", () => {
const html = render([makeLeaf("photo.jpg")], 19);
expect(html).not.toContain("zip-expansion__depth-limit");
expect(html).toContain("photo.jpg");
});
});
describe("ZipExpansion — nested ZIP entries", () => {
it("renders nested-zip rows with chevron (no diff load on render)", () => {
const html = render([
makeLeaf("inner.zip", {
status: "cleaned",
entries: [makeLeaf("inner/photo.jpg")],
}),
]);
expect(html).toContain("inner.zip");
// Nested zip rows are expandable.
expect(html).toContain("zip-expansion__row--expandable");
// The inner row is NOT rendered until the parent is expanded
// (state.kind starts as "closed"). In a static render, "inner/photo.jpg"
// stays hidden.
expect(html).not.toContain("inner/photo.jpg");
});
});
describe("ZipExpansion — indent depth", () => {
it("caps the indent at level 5 even for deeper trees", () => {
const html6 = render([makeLeaf("photo.jpg")], 6);
// data-depth attribute caps at INDENT_CAP=5.
expect(html6).toContain('data-depth="5"');
});
it("uses depth as data-depth for levels 0..5", () => {
expect(render([makeLeaf("a")], 0)).toContain('data-depth="0"');
expect(render([makeLeaf("a")], 3)).toContain('data-depth="3"');
expect(render([makeLeaf("a")], 5)).toContain('data-depth="5"');
});
});

View file

@ -162,6 +162,10 @@ describe("processFileEntries", () => {
// hook marks pending=true so the row will render a skeleton until
// UPDATE_FILE_DIFF lands.
diffPending: true,
// New strategy-emitted fields default to [] / null when the
// mocked WasmApi return omits them.
warnings: [],
archiveEntries: null,
});
});

1340
tools/forensic/zip.ts Normal file

File diff suppressed because it is too large Load diff