Phase B ships a static-asset web build that processes files entirely in the browser via WASM and pure-TS strategies — no server, no upload, fully offline. Same renderer code path as Electron; WASM strategies plug in via the existing `FormatStrategy` interface, and `platform.isWeb` drives routing.
## Web build infrastructure
- `vite.config.web.ts` + dedicated entry `src/web/main.tsx`
- Web adapters (`FileRegistry`, `BrowserFileBytes`, `makeWebApi`) mirror the Electron preload contract so the renderer is unchanged
- File picker button (`FileBrowseButton`) gated on `platform.isWeb`; rendered both in the empty state and in the StatusBar so users can add to a mid-batch without clearing
## PWA + deployment
- `manifest.webmanifest`, service worker via `vite-plugin-pwa` (Workbox)
- `Dockerfile` (multi-stage Node 22 → nginx:alpine) with `nginx.conf` carrying COOP/COEP/CSP/cache headers
- Cloudflare Pages: `public/_headers` mirrors the nginx config; deploy workflow shipped at `.github/workflows/deploy-web.yml` (gated to `workflow_dispatch` until secrets are wired)
- Unified deploy guide at `docs/deploying.md` covering Cloudflare Pages, Docker + Caddy, Docker + nginx + certbot, Cloudflare Tunnel, and Tailscale Funnel
## JPEG strategy — replaces piexifjs entirely
Hand-rolled marker walker at `src/infrastructure/wasm/strategies/jpeg_strategy.ts`. Mirrors ExifTool's `-all=` policy with two deliberate exceptions: APP14 (Adobe DCT) always kept, APP2 (ICC) kept on opt-in via `preserveColorProfile`. Drops APP0/APP1/APP3–APP13/APP15 + COM. Entropy-stream byte-stuffing and RST markers preserved verbatim.
- Fill-byte tolerance per T.81 §B.1.1.2
- Truncation surfaces as `Result<_, ExifError>` rather than silently emitting a no-EOI file
- Removes `piexifjs` and `@types/piexifjs` as production deps
- Closes the prior `TextDecoder("latin1")` corruption bug (WHATWG aliases `latin1` to windows-1252, so 0x80–0x9F bytes were silently rewritten)
## PDF strategy — rewrites the previous "setX('')" pass
`pdf-lib` with `updateMetadata: false` to defeat auto-stamp of Producer / ModDate / Creator / CreationDate. Direct Info-dict key deletion. Drops the catalog `/Metadata` reference *and* the indirect XMP stream object (orphan-free). Scrubs annotation `/T`, `/Contents`, `/M`, `/CreationDate`, `/RC`, `/Subj`. Drops `/Lang`, `/PageLabels`, `/OutputIntents`, per-page `/Metadata`, per-page `/Thumb`. AcroForm and `/EmbeddedFiles` deferred behind opt-ins.
Also corrects the previous "pdf-lib re-injects Producer" claim in source comments + README — it does not (verified empirically; the string is an in-memory `getProducer()` fallback, not file content).
## Documentation pattern
Three companion folders for each format:
- `docs/gap-analysis/<format>.md` — current vs reference vs theoretical, *before* implementation. Contains `pdf.md` and `jpeg.md`.
- `docs/poc/<approach>.md` — library evaluations. Contains `little-exif-wasm.md` and `exiv2-wasm.md` (both ruled out — full size + coverage data).
- `docs/forensic/<format>.md` — adversarial recovery tests *after* implementation, with reproducible runner under `tools/forensic/`. Contains `pdf.md` (zero sentinel survival across `strings`, `exiftool -PDF-update:all=`, `qpdf --qdf`, and in-process pdf-lib indirect-object walk; ExifTool's own output leaks 8/10 sentinels via the same battery).
## CI
New workflow runs lint + typecheck + tests + electron compile + Playwright e2e on PRs and master. Platform builds compile on PRs but skip artifact upload to avoid the GitHub Actions storage quota; uploads only happen on master pushes.
## Deferred (tracked as issues)
- HEIC strategy for mobile (issue #48 — most-hit format on iPhone)
- Mobile touch UX pass (#49)
- iOS Photos picker UX note (#50)
- Unsupported-format messaging (#51)
- PWA install prompt UX (#52)
- `preserveOrientation` honoring inside JPEG (Phase 2 of `docs/gap-analysis/jpeg.md`)
- Forensic comparison-corpus runs for JPEG and other formats
## Stats
- 369/369 unit tests, lint clean, typecheck clean
- Electron build: ~640 KB renderer bundle
- Web build: 1.1 MB precache (PWA shell)
- 30 commits on the branch (squashed into this one)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
18 KiB
PDF metadata-stripping gap analysis
Date: 2026-05-06
Goal: Compare what pdf-lib actually clears today (in src/infrastructure/wasm/strategies/pdf_strategy.ts) against what ExifTool clears, and against what is theoretically possible with a hand-rolled rewrite. Drives the decision on whether to replace pdf-lib, hand-roll, or keep it for an explicitly-scoped first PR.
Methodology
Read:
src/infrastructure/wasm/strategies/pdf_strategy.ts— the current implementation.tests/infrastructure/wasm/pdf_strategy.test.ts— what is currently asserted.tests/fixtures/wasm/pdf/sample.pdf— the test fixture (901 bytes; produced by pdf-lib; Title/Author/Subject/Creator/Producer set).README.md"File writer limitations" + Format Support Matrix footnote 3.- ExifTool docs at https://exiftool.org/#limitations and https://exiftool.org/TagNames/PDF.html.
Ran (all in /tmp/pdf-poc/, nothing added to package.json):
- Generated rich PDFs with pdf-lib 1.17.1 + supplemented with
exiftool -XMP-*=to inject an XMP stream. - Generated a PDF with annotations (
/Tauthor,/Contents,/CreationDate) via low-level pdf-libctx.obj(). - Stripped each fixture three ways: pdf-lib (replicating
pdf_strategy.tsexactly),exiftool -all= -overwrite_original, andgs -sDEVICE=pdfwrite(as a "structurally clean" baseline). - Diffed each output with
exiftool -a -G1 -s, rawstrings,xxd, and a custom inflater that walks/Length-declared streams and decompresses the FlateDecode object streams to reveal the actual on-disk dictionary contents. - Bypass test: tried
doc.catalog.delete(PDFName.of('Metadata'))to see if pdf-lib can drop the XMP catalog reference, and what that does to the orphaned stream object.
Verified facts about the current pdf-lib behaviour
The existing pdf_strategy.ts calls setTitle/setAuthor/setSubject/setKeywords/setProducer/setCreator with empty values and saves. Empirically, on pdf-lib 1.17.1:
-
All six Info-dict fields end up as
<FEFF>(UTF-16 BOM with no data) on disk. Verified by inflating the object stream:/Producer <FEFF> /ModDate (D:20260505214006Z) /Creator <FEFF> /CreationDate (D:20240115103000Z) /Title <FEFF> /Author <FEFF> /Subject <FEFF> /Keywords <FEFF> -
The
pdf_strategy.tssource comment ("pdf-lib re-injects 'pdf-lib (...)' as Producer on every save") and thepdf_strategy.test.tsassertionexpect(cleaned.getProducer()).toContain("pdf-lib")are misleading.setProducer("")in pdf-lib 1.17.1 does write an empty Producer to the on-disk Info dict —exiftool -Producer fixture-stripped.pdfreturns an empty string. What the test is observing isdoc.getProducer()returning the string"pdf-lib (...)"— that's pdf-lib's in-memory default fallback when Producer is read back from a previously-saved-and-reloaded doc, not a value present in the file. README footnote 3 inherits the same misconception. Neither blocks the strip in practice, but the comment + test + footnote should be corrected. -
/CreationDateis not clearable through pdf-lib's API — there is nosetCreationDate(undefined)orclearCreationDate. It survives every strip from the current code. -
/ModDateis rewritten to the current time on every save. This is a privacy regression: the cleaned file leaks "this file was metadata-stripped at YYYY-MM-DD hh:mm:ss" — a fingerprint that wasn't in the input. The old ModDate is gone, but a new one is silently added. -
pdf-lib does not use incremental updates on save — it rewrites the file structure cleanly with a single
xrefand one%%EOF. This is a structural advantage over ExifTool. (No/Prevkey in the trailer, no%BeginExifToolUpdatemarker.) -
pdf-lib does not write a
/IDarray in the trailer at all. Neither input fingerprint propagates, and no new ID is added. -
pdf-lib preserves any
/Metadata(XMP) stream referenced from the catalog as-is — every Dublin Core, XMP, and pdf:* property survives untouched. -
doc.catalog.delete(PDFName.of('Metadata'))removes the catalog reference but leaves the XMP stream as an orphaned (unreferenced) object that still exists in the file body.exiftoolno longer surfaces it, butstringsand any forensic walker reading the raw object table will. Without a true "garbage-collect orphans + rewrite" pass, dereferencing alone is no better than ExifTool's incremental-update trick. -
pdf-lib does not touch annotation
/T(author),/Contents,/M, or/CreationDate. It does not touch page-level/Metadata,/Thumb,/Names,/EmbeddedFiles,/AcroForm, or/PageLabels.
Verified facts about ExifTool
exiftool -all= -overwrite_original on the same fixture:
- Emits the warning:
Warning: [minor] ExifTool PDF edits are reversible. Deleted tags may be recovered! - Adds an incremental update: original objects (containing every Info-dict and XMP value) stay in the file, plus a new Catalog object that hides them. The file ends with two
%%EOFmarkers, twostartxref, and the literal markers%BeginExifToolUpdate ... %EndExifToolUpdate. strings cleaned.pdfreturns clean output because the original Info dict is still in a FlateDecoded stream (and the XMP stream is also still present) — but inflating reveals everything intact:/Producer <FEFF...SecretInternalToolProducerStamp>,/Title <FEFF...OriginalTitle>, etc.-PDF-update:all=(an ExifTool-specific pseudo-tag) only removes ExifTool's own update layers — i.e. it reverts a previous strip rather than performing a stronger one.- Annotation
/T,/Contents, etc., are unaffected.
Quoting ExifTool's docs verbatim:
"PDF - The original metadata is never actually deleted." (https://exiftool.org/#limitations)
"All metadata edits are reversible. While this would normally be considered an advantage, it is a potential security problem because old information is never actually deleted from the file." (https://exiftool.org/TagNames/PDF.html)
"[To permanently remove old information,] use the 'qpdf' utility with linearization."
What's theoretically possible
A hand-rolled approach can do everything ExifTool refuses to do, plus catch sources neither tool today addresses. The reference is qpdf --linearize input output (rewrite the cross-reference, drop unreferenced objects, no incremental updates) plus targeted dictionary scrubbing. In a JS implementation that means: parse the xref, walk every indirect object, drop or scrub the offending ones, regenerate the xref + trailer, and emit a new file with no /Prev chain. ~600–1000 lines of TypeScript for a hand-rolled minimal parser, considerably less for a pdf-lib-assisted hybrid that uses pdf-lib's parser and emits via its serializer.
Per-source comparison
| Source | What pdf-lib does today | What ExifTool does | What's theoretically possible |
|---|---|---|---|
/Info Title |
Sets to UTF-16 empty (<FEFF>) — key remains, value gone |
Adds new empty Info object via incremental update; original Title still in old object | Drop /Info reference from trailer entirely + rewrite without the old Info object |
/Info Author |
Same as Title | Same as Title | Same as Title |
/Info Subject |
Same | Same | Same |
/Info Keywords |
Same (empty UTF-16 string instead of empty array) | Same | Same |
/Info Creator |
Same | Same | Same |
/Info Producer |
Same — Info dict has <FEFF> on disk; in-memory getProducer() falls back to "pdf-lib (...)" |
Same | Same |
/Info CreationDate |
Survives — no API to clear | Hidden via incremental update; original survives in old object | Drop entire Info object + rewrite |
/Info ModDate |
Rewritten to NOW on every save — adds a new fingerprint | Same as CreationDate | Drop entire Info object + rewrite (no new ModDate written) |
/Metadata XMP stream |
Preserved untouched | Hidden via incremental update; original stream still in file | Drop both catalog reference and the stream object via rewrite |
/Catalog /Lang |
Preserved | Preserved | Drop key from catalog dict |
/Catalog /PageLabels |
Preserved | Preserved | Drop key |
/Catalog /Names (incl. /EmbeddedFiles) |
Preserved | Preserved (and ExifTool will not delete file attachments) | Drop the /Names tree, drop attached file streams |
/Catalog /OutputIntents (color profile metadata) |
Preserved | Preserved | Drop key (caveat: may affect color reproduction) |
Page-level /Metadata (per-page XMP) |
Preserved | Preserved | Walk pages, drop key + stream |
Page /Thumb (page thumbnails) |
Preserved | Preserved | Walk pages, drop key + stream |
Annotations /Annots /T author |
Preserved | Preserved | Walk every page's /Annots, scrub /T, /Contents, /M, /CreationDate, /RC, /AP |
Annotations /Contents |
Preserved | Preserved | Same |
AcroForm field defaults / /DA / /DR |
Preserved | Preserved (limited write support) | Walk /AcroForm, scrub field metadata |
Trailer /ID array |
Not written (pdf-lib emits no /ID) |
Updated as part of incremental update; old /ID retrievable from old trailer |
Generate fresh random /ID pair on rewrite (or drop entirely) |
| Encryption dictionary | Decrypted on ignoreEncryption: true then output is unencrypted |
Same | Same — output is unencrypted by definition once we strip |
| Linearization hint stream | Stripped (pdf-lib doesn't preserve linearization) | Stripped when ExifTool rewrites; preserved when only updating | Either drop or regenerate via qpdf-style pass |
| Cross-reference comments | None to begin with (pdf-lib emits clean xref) | Adds %BeginExifToolUpdate / %EndExifToolUpdate literal comments — a clear "this file was processed by ExifTool" fingerprint |
Emit a single clean xref with no commentary |
Extra trailer dictionaries (incremental update history /Prev) |
None — single trailer | Adds one every time; if input already had history, it survives | Walk and merge to a single trailer; drop /Prev chain |
| Hidden / replaced objects (orphans from prior incremental updates) | Pdf-lib's parser only retains objects referenced from the new catalog → orphans dropped on save (in our tests, an explicit catalog.delete('Metadata') left the XMP object orphaned but still in the file, so this needs verification per object class) |
Preserved by design — that's what makes ExifTool's strip reversible | Garbage-collect: keep only objects reachable from the new catalog, write only those |
Honest gap summary
pdf-lib vs ExifTool: roughly even on the Info dictionary; pdf-lib is better in two underrated ways and worse in one:
- pdf-lib: better — emits a clean single-trailer rewrite, no
%BeginExifToolUpdatefingerprint, original Info-dict bytes are not retained in the output (because pdf-lib's serializer rebuilds the object table from the parsed in-memory model, not from the original byte stream — hidden objects from prior incremental updates do get dropped during this rebuild). - pdf-lib: better — does not write a
/IDto the trailer (no document fingerprint propagates). - pdf-lib: worse — adds a fresh
/ModDateof "now" every save (ExifTool also updates ModDate, but the gap relative to a hand-rolled strip is identical). - pdf-lib: comparable — leaves XMP stream, annotations, /Lang, /PageLabels, /Names, page Thumbs, page-level Metadata, AcroForm exactly as ExifTool does.
- pdf-lib: comparable — has no API surface for
/CreationDateremoval.
ExifTool vs theoretical: ExifTool is fundamentally limited by its design choice to use incremental updates. Per its own docs the recommended workaround is qpdf --linearize. This is not a fixable limitation in ExifTool — it's a documented "this format is not supported for true deletion."
Theoretical vs both: a rewrite-based hand-rolled strategy can additionally close the annotation, AcroForm, page-level Metadata, /Lang, /Names, /Thumb, embedded-files, and ModDate-leak gaps. None of these are addressed by either tool today.
Top three sources that matter most for actual privacy use cases:
/MetadataXMP stream — this is where authoring-tool fingerprints, original Title/Author/Subject from Word/Indesign/Acrobat live. Most "leaked" PDF metadata in the real world (lawyers' redaction failures, author names on government documents, internal-build identifiers) ships in XMP, not the Info dict. Both tools fail here. Producer reveals authoring software ("Microsoft Word for Microsoft 365", "Adobe Acrobat Pro DC 22.x", "skia/PDF m118 Google Docs Renderer") which fingerprints the user's environment.- Annotations
/Tand/Contents— review comments and authorial markup carry reviewer names, internal review timestamps, and the actual review text. Neither tool clears these. This is the source most likely to embarrass an end-user (e.g. "John Reviewer: this number is wrong, change to X" surviving a strip). /Info/CreationDate+ new/ModDate— pdf-lib leaves the original CreationDate and adds a "now" ModDate, which together fingerprint both the document's age and the strip event. Defensible to clear both.
Secondary but worth covering in the same pass:
/Catalog/Lang— leaks user locale./Names/EmbeddedFiles— embedded attachments can carry whole spreadsheets or originals./Catalog/Metadataorphans from prior tool chains — anything that has been ExifTool'd or repeatedly updated has shadow metadata.
Recommendation
Keep pdf-lib as the parser/serializer. Do not replace it with a from-scratch PDF parser; that is a 600+ line project with high crash-on-malformed-input surface area. Instead, extend the strip with three targeted catalog/dict mutations using pdf-lib's low-level API.
The hand-rolled-everything path is unattractive specifically for PDF (unlike JPEG/PNG, where a marker walker is ~80 lines): PDFs require xref parsing, cross-reference streams, FlateDecode/Predictor filters, encryption negotiation, object stream parsing. pdf-lib already does all of that correctly. We don't need a different library; we need to use the one we have more aggressively.
The interesting question is whether to replace pdf-lib with qpdf-wasm (a port of the canonical PDF rewrite tool ExifTool itself recommends). It would provide true garbage-collection of orphan objects and a rewrite-based strip. Bundle size cost is significant (qpdf is large; published wasm builds are ~1.5–3 MB). Worth evaluating in a separate POC if Phase 1 below proves insufficient. Given the gzip-size patterns observed in docs/poc/exiv2-wasm.md (925 KB exiv2-wasm rejected), qpdf-wasm is unlikely to clear the size bar unless it ships a much smaller pdf-strip-only build.
For Electron (where ExifTool is bundled), pdf-lib is already the right answer for PDF — ExifTool's PDF strip is provably worse (incremental updates retain everything) and we should consider routing PDFs to the WASM strategy on Electron too for any case where speed allows. That's a larger architectural call; out of scope here.
Phase 1 plan: tightly-scoped first PR
Goal: close the three highest-impact gaps (XMP, annotations, dates) within pdf_strategy.ts, without bringing in a new library.
In scope:
- Drop the catalog
/Metadatareference + verify the orphan is gone after save. If pdf-lib's serializer doesn't garbage-collect it (our scratch testing suggests it does NOT — the orphaned XMP stream remained), implement a manual "build a fresh PDFDocument and copy only the page content + non-metadata catalog entries" rebuild, which forces a clean object table. - Walk every page's
/Annotsarray, mutate each annotation dict to drop/T,/Contents,/M,/CreationDate,/RC(rich content),/Subj(annotation subject). Leave/Type,/Subtype,/Rect,/AP(appearance — visual-only), and structural geometry. This preserves annotation visibility but scrubs authorship. - Drop catalog-level metadata fingerprints:
/Lang,/PageLabels,/Names(with caveat below),/OutputIntents, page-level/Metadata, page-level/Thumb. Behind individualStripOptionsflags so users can opt back in if they need (e.g./OutputIntentsmay matter for color-managed workflows). - Drop
/Info/CreationDateand/ModDateentries by reaching intodoc.context.lookup(infoRef, PDFDict).delete(PDFName.of('CreationDate'))rather than relying on pdf-lib's high-level setters which always emit a new ModDate. - Fix the Producer comment + test + README footnote 3 to match observed behaviour. Replace the
expect(cleaned.getProducer()).toContain("pdf-lib")assertion with a check that the on-disk Info dict has no Producer key at all (or an empty one), and clarify inpdf_strategy.tsthat the in-memorygetProducer()returns a default fallback that is not present in the file. - Test fixture upgrade: replace the 901-byte minimal fixture with a richer fixture (or add a second fixture) that exercises XMP, annotations, embedded files, and
/Lang. Generated via a smalltests/fixtures/wasm/pdf/build script, not committed in binary form unless small enough.
Deferred:
/Names/EmbeddedFilesremoval: risky because some PDFs use embedded files for legitimate functionality (PDF/A archive sidecars, attached spreadsheets). Add behind an explicitdropEmbeddedFilesoption, default off.- AcroForm scrubbing: form data may legitimately need to round-trip (e.g. invoices with form fields). Default off; opt-in.
- True orphan-object garbage collection: only pursue if step 1 above shows orphans surviving the save. If that happens, evaluate
qpdf-wasmas a separate POC. /IDregeneration: pdf-lib already omits/ID, so no action — but document this in the strategy file.- Linearized PDF input: pdf-lib already de-linearizes on save, so cleaned files are non-linearized. Acceptable.
- Encrypted-input handling:
ignoreEncryption: truealready covers the common case (we strip, output is unencrypted). Re-encryption is out of scope.
Effort estimate: ~150 lines of TypeScript additions in pdf_strategy.ts + ~50 lines of test scaffolding + 1 new fixture. One PR, mostly sequential edits to a single strategy file.