exifcleaner-web/docs/gap-analysis/pdf.md
obuvuyoviz26-lab dfbd680737
B: Deployable webapp — Vite build, web adapters, JPEG/PDF strategies, PWA, Docker, CI
Phase B ships a static-asset web build that processes files entirely in the browser via WASM and pure-TS strategies — no server, no upload, fully offline. Same renderer code path as Electron; WASM strategies plug in via the existing `FormatStrategy` interface, and `platform.isWeb` drives routing.

## Web build infrastructure

- `vite.config.web.ts` + dedicated entry `src/web/main.tsx`
- Web adapters (`FileRegistry`, `BrowserFileBytes`, `makeWebApi`) mirror the Electron preload contract so the renderer is unchanged
- File picker button (`FileBrowseButton`) gated on `platform.isWeb`; rendered both in the empty state and in the StatusBar so users can add to a mid-batch without clearing

## PWA + deployment

- `manifest.webmanifest`, service worker via `vite-plugin-pwa` (Workbox)
- `Dockerfile` (multi-stage Node 22 → nginx:alpine) with `nginx.conf` carrying COOP/COEP/CSP/cache headers
- Cloudflare Pages: `public/_headers` mirrors the nginx config; deploy workflow shipped at `.github/workflows/deploy-web.yml` (gated to `workflow_dispatch` until secrets are wired)
- Unified deploy guide at `docs/deploying.md` covering Cloudflare Pages, Docker + Caddy, Docker + nginx + certbot, Cloudflare Tunnel, and Tailscale Funnel

## JPEG strategy — replaces piexifjs entirely

Hand-rolled marker walker at `src/infrastructure/wasm/strategies/jpeg_strategy.ts`. Mirrors ExifTool's `-all=` policy with two deliberate exceptions: APP14 (Adobe DCT) always kept, APP2 (ICC) kept on opt-in via `preserveColorProfile`. Drops APP0/APP1/APP3–APP13/APP15 + COM. Entropy-stream byte-stuffing and RST markers preserved verbatim.

- Fill-byte tolerance per T.81 §B.1.1.2
- Truncation surfaces as `Result<_, ExifError>` rather than silently emitting a no-EOI file
- Removes `piexifjs` and `@types/piexifjs` as production deps
- Closes the prior `TextDecoder("latin1")` corruption bug (WHATWG aliases `latin1` to windows-1252, so 0x80–0x9F bytes were silently rewritten)

## PDF strategy — rewrites the previous "setX('')" pass

`pdf-lib` with `updateMetadata: false` to defeat auto-stamp of Producer / ModDate / Creator / CreationDate. Direct Info-dict key deletion. Drops the catalog `/Metadata` reference *and* the indirect XMP stream object (orphan-free). Scrubs annotation `/T`, `/Contents`, `/M`, `/CreationDate`, `/RC`, `/Subj`. Drops `/Lang`, `/PageLabels`, `/OutputIntents`, per-page `/Metadata`, per-page `/Thumb`. AcroForm and `/EmbeddedFiles` deferred behind opt-ins.

Also corrects the previous "pdf-lib re-injects Producer" claim in source comments + README — it does not (verified empirically; the string is an in-memory `getProducer()` fallback, not file content).

## Documentation pattern

Three companion folders for each format:

- `docs/gap-analysis/<format>.md` — current vs reference vs theoretical, *before* implementation. Contains `pdf.md` and `jpeg.md`.
- `docs/poc/<approach>.md` — library evaluations. Contains `little-exif-wasm.md` and `exiv2-wasm.md` (both ruled out — full size + coverage data).
- `docs/forensic/<format>.md` — adversarial recovery tests *after* implementation, with reproducible runner under `tools/forensic/`. Contains `pdf.md` (zero sentinel survival across `strings`, `exiftool -PDF-update:all=`, `qpdf --qdf`, and in-process pdf-lib indirect-object walk; ExifTool's own output leaks 8/10 sentinels via the same battery).

## CI

New workflow runs lint + typecheck + tests + electron compile + Playwright e2e on PRs and master. Platform builds compile on PRs but skip artifact upload to avoid the GitHub Actions storage quota; uploads only happen on master pushes.

## Deferred (tracked as issues)

- HEIC strategy for mobile (issue #48 — most-hit format on iPhone)
- Mobile touch UX pass (#49)
- iOS Photos picker UX note (#50)
- Unsupported-format messaging (#51)
- PWA install prompt UX (#52)
- `preserveOrientation` honoring inside JPEG (Phase 2 of `docs/gap-analysis/jpeg.md`)
- Forensic comparison-corpus runs for JPEG and other formats

## Stats

- 369/369 unit tests, lint clean, typecheck clean
- Electron build: ~640 KB renderer bundle
- Web build: 1.1 MB precache (PWA shell)
- 30 commits on the branch (squashed into this one)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-05-07 17:56:50 +04:00

18 KiB
Raw Permalink Blame History

PDF metadata-stripping gap analysis

Date: 2026-05-06 Goal: Compare what pdf-lib actually clears today (in src/infrastructure/wasm/strategies/pdf_strategy.ts) against what ExifTool clears, and against what is theoretically possible with a hand-rolled rewrite. Drives the decision on whether to replace pdf-lib, hand-roll, or keep it for an explicitly-scoped first PR.

Methodology

Read:

  • src/infrastructure/wasm/strategies/pdf_strategy.ts — the current implementation.
  • tests/infrastructure/wasm/pdf_strategy.test.ts — what is currently asserted.
  • tests/fixtures/wasm/pdf/sample.pdf — the test fixture (901 bytes; produced by pdf-lib; Title/Author/Subject/Creator/Producer set).
  • README.md "File writer limitations" + Format Support Matrix footnote 3.
  • ExifTool docs at https://exiftool.org/#limitations and https://exiftool.org/TagNames/PDF.html.

Ran (all in /tmp/pdf-poc/, nothing added to package.json):

  • Generated rich PDFs with pdf-lib 1.17.1 + supplemented with exiftool -XMP-*= to inject an XMP stream.
  • Generated a PDF with annotations (/T author, /Contents, /CreationDate) via low-level pdf-lib ctx.obj().
  • Stripped each fixture three ways: pdf-lib (replicating pdf_strategy.ts exactly), exiftool -all= -overwrite_original, and gs -sDEVICE=pdfwrite (as a "structurally clean" baseline).
  • Diffed each output with exiftool -a -G1 -s, raw strings, xxd, and a custom inflater that walks /Length-declared streams and decompresses the FlateDecode object streams to reveal the actual on-disk dictionary contents.
  • Bypass test: tried doc.catalog.delete(PDFName.of('Metadata')) to see if pdf-lib can drop the XMP catalog reference, and what that does to the orphaned stream object.

Verified facts about the current pdf-lib behaviour

The existing pdf_strategy.ts calls setTitle/setAuthor/setSubject/setKeywords/setProducer/setCreator with empty values and saves. Empirically, on pdf-lib 1.17.1:

  • All six Info-dict fields end up as <FEFF> (UTF-16 BOM with no data) on disk. Verified by inflating the object stream:

    /Producer <FEFF>
    /ModDate  (D:20260505214006Z)
    /Creator  <FEFF>
    /CreationDate (D:20240115103000Z)
    /Title    <FEFF>
    /Author   <FEFF>
    /Subject  <FEFF>
    /Keywords <FEFF>
    
  • The pdf_strategy.ts source comment ("pdf-lib re-injects 'pdf-lib (...)' as Producer on every save") and the pdf_strategy.test.ts assertion expect(cleaned.getProducer()).toContain("pdf-lib") are misleading. setProducer("") in pdf-lib 1.17.1 does write an empty Producer to the on-disk Info dict — exiftool -Producer fixture-stripped.pdf returns an empty string. What the test is observing is doc.getProducer() returning the string "pdf-lib (...)" — that's pdf-lib's in-memory default fallback when Producer is read back from a previously-saved-and-reloaded doc, not a value present in the file. README footnote 3 inherits the same misconception. Neither blocks the strip in practice, but the comment + test + footnote should be corrected.

  • /CreationDate is not clearable through pdf-lib's API — there is no setCreationDate(undefined) or clearCreationDate. It survives every strip from the current code.

  • /ModDate is rewritten to the current time on every save. This is a privacy regression: the cleaned file leaks "this file was metadata-stripped at YYYY-MM-DD hh:mm:ss" — a fingerprint that wasn't in the input. The old ModDate is gone, but a new one is silently added.

  • pdf-lib does not use incremental updates on save — it rewrites the file structure cleanly with a single xref and one %%EOF. This is a structural advantage over ExifTool. (No /Prev key in the trailer, no %BeginExifToolUpdate marker.)

  • pdf-lib does not write a /ID array in the trailer at all. Neither input fingerprint propagates, and no new ID is added.

  • pdf-lib preserves any /Metadata (XMP) stream referenced from the catalog as-is — every Dublin Core, XMP, and pdf:* property survives untouched.

  • doc.catalog.delete(PDFName.of('Metadata')) removes the catalog reference but leaves the XMP stream as an orphaned (unreferenced) object that still exists in the file body. exiftool no longer surfaces it, but strings and any forensic walker reading the raw object table will. Without a true "garbage-collect orphans + rewrite" pass, dereferencing alone is no better than ExifTool's incremental-update trick.

  • pdf-lib does not touch annotation /T (author), /Contents, /M, or /CreationDate. It does not touch page-level /Metadata, /Thumb, /Names, /EmbeddedFiles, /AcroForm, or /PageLabels.

Verified facts about ExifTool

exiftool -all= -overwrite_original on the same fixture:

  • Emits the warning: Warning: [minor] ExifTool PDF edits are reversible. Deleted tags may be recovered!
  • Adds an incremental update: original objects (containing every Info-dict and XMP value) stay in the file, plus a new Catalog object that hides them. The file ends with two %%EOF markers, two startxref, and the literal markers %BeginExifToolUpdate ... %EndExifToolUpdate.
  • strings cleaned.pdf returns clean output because the original Info dict is still in a FlateDecoded stream (and the XMP stream is also still present) — but inflating reveals everything intact: /Producer <FEFF...SecretInternalToolProducerStamp>, /Title <FEFF...OriginalTitle>, etc.
  • -PDF-update:all= (an ExifTool-specific pseudo-tag) only removes ExifTool's own update layers — i.e. it reverts a previous strip rather than performing a stronger one.
  • Annotation /T, /Contents, etc., are unaffected.

Quoting ExifTool's docs verbatim:

"PDF - The original metadata is never actually deleted." (https://exiftool.org/#limitations)

"All metadata edits are reversible. While this would normally be considered an advantage, it is a potential security problem because old information is never actually deleted from the file." (https://exiftool.org/TagNames/PDF.html)

"[To permanently remove old information,] use the 'qpdf' utility with linearization."

What's theoretically possible

A hand-rolled approach can do everything ExifTool refuses to do, plus catch sources neither tool today addresses. The reference is qpdf --linearize input output (rewrite the cross-reference, drop unreferenced objects, no incremental updates) plus targeted dictionary scrubbing. In a JS implementation that means: parse the xref, walk every indirect object, drop or scrub the offending ones, regenerate the xref + trailer, and emit a new file with no /Prev chain. ~6001000 lines of TypeScript for a hand-rolled minimal parser, considerably less for a pdf-lib-assisted hybrid that uses pdf-lib's parser and emits via its serializer.

Per-source comparison

Source What pdf-lib does today What ExifTool does What's theoretically possible
/Info Title Sets to UTF-16 empty (<FEFF>) — key remains, value gone Adds new empty Info object via incremental update; original Title still in old object Drop /Info reference from trailer entirely + rewrite without the old Info object
/Info Author Same as Title Same as Title Same as Title
/Info Subject Same Same Same
/Info Keywords Same (empty UTF-16 string instead of empty array) Same Same
/Info Creator Same Same Same
/Info Producer Same — Info dict has <FEFF> on disk; in-memory getProducer() falls back to "pdf-lib (...)" Same Same
/Info CreationDate Survives — no API to clear Hidden via incremental update; original survives in old object Drop entire Info object + rewrite
/Info ModDate Rewritten to NOW on every save — adds a new fingerprint Same as CreationDate Drop entire Info object + rewrite (no new ModDate written)
/Metadata XMP stream Preserved untouched Hidden via incremental update; original stream still in file Drop both catalog reference and the stream object via rewrite
/Catalog /Lang Preserved Preserved Drop key from catalog dict
/Catalog /PageLabels Preserved Preserved Drop key
/Catalog /Names (incl. /EmbeddedFiles) Preserved Preserved (and ExifTool will not delete file attachments) Drop the /Names tree, drop attached file streams
/Catalog /OutputIntents (color profile metadata) Preserved Preserved Drop key (caveat: may affect color reproduction)
Page-level /Metadata (per-page XMP) Preserved Preserved Walk pages, drop key + stream
Page /Thumb (page thumbnails) Preserved Preserved Walk pages, drop key + stream
Annotations /Annots /T author Preserved Preserved Walk every page's /Annots, scrub /T, /Contents, /M, /CreationDate, /RC, /AP
Annotations /Contents Preserved Preserved Same
AcroForm field defaults / /DA / /DR Preserved Preserved (limited write support) Walk /AcroForm, scrub field metadata
Trailer /ID array Not written (pdf-lib emits no /ID) Updated as part of incremental update; old /ID retrievable from old trailer Generate fresh random /ID pair on rewrite (or drop entirely)
Encryption dictionary Decrypted on ignoreEncryption: true then output is unencrypted Same Same — output is unencrypted by definition once we strip
Linearization hint stream Stripped (pdf-lib doesn't preserve linearization) Stripped when ExifTool rewrites; preserved when only updating Either drop or regenerate via qpdf-style pass
Cross-reference comments None to begin with (pdf-lib emits clean xref) Adds %BeginExifToolUpdate / %EndExifToolUpdate literal comments — a clear "this file was processed by ExifTool" fingerprint Emit a single clean xref with no commentary
Extra trailer dictionaries (incremental update history /Prev) None — single trailer Adds one every time; if input already had history, it survives Walk and merge to a single trailer; drop /Prev chain
Hidden / replaced objects (orphans from prior incremental updates) Pdf-lib's parser only retains objects referenced from the new catalog → orphans dropped on save (in our tests, an explicit catalog.delete('Metadata') left the XMP object orphaned but still in the file, so this needs verification per object class) Preserved by design — that's what makes ExifTool's strip reversible Garbage-collect: keep only objects reachable from the new catalog, write only those

Honest gap summary

pdf-lib vs ExifTool: roughly even on the Info dictionary; pdf-lib is better in two underrated ways and worse in one:

  • pdf-lib: better — emits a clean single-trailer rewrite, no %BeginExifToolUpdate fingerprint, original Info-dict bytes are not retained in the output (because pdf-lib's serializer rebuilds the object table from the parsed in-memory model, not from the original byte stream — hidden objects from prior incremental updates do get dropped during this rebuild).
  • pdf-lib: better — does not write a /ID to the trailer (no document fingerprint propagates).
  • pdf-lib: worse — adds a fresh /ModDate of "now" every save (ExifTool also updates ModDate, but the gap relative to a hand-rolled strip is identical).
  • pdf-lib: comparable — leaves XMP stream, annotations, /Lang, /PageLabels, /Names, page Thumbs, page-level Metadata, AcroForm exactly as ExifTool does.
  • pdf-lib: comparable — has no API surface for /CreationDate removal.

ExifTool vs theoretical: ExifTool is fundamentally limited by its design choice to use incremental updates. Per its own docs the recommended workaround is qpdf --linearize. This is not a fixable limitation in ExifTool — it's a documented "this format is not supported for true deletion."

Theoretical vs both: a rewrite-based hand-rolled strategy can additionally close the annotation, AcroForm, page-level Metadata, /Lang, /Names, /Thumb, embedded-files, and ModDate-leak gaps. None of these are addressed by either tool today.

Top three sources that matter most for actual privacy use cases:

  1. /Metadata XMP stream — this is where authoring-tool fingerprints, original Title/Author/Subject from Word/Indesign/Acrobat live. Most "leaked" PDF metadata in the real world (lawyers' redaction failures, author names on government documents, internal-build identifiers) ships in XMP, not the Info dict. Both tools fail here. Producer reveals authoring software ("Microsoft Word for Microsoft 365", "Adobe Acrobat Pro DC 22.x", "skia/PDF m118 Google Docs Renderer") which fingerprints the user's environment.
  2. Annotations /T and /Contents — review comments and authorial markup carry reviewer names, internal review timestamps, and the actual review text. Neither tool clears these. This is the source most likely to embarrass an end-user (e.g. "John Reviewer: this number is wrong, change to X" surviving a strip).
  3. /Info /CreationDate + new /ModDate — pdf-lib leaves the original CreationDate and adds a "now" ModDate, which together fingerprint both the document's age and the strip event. Defensible to clear both.

Secondary but worth covering in the same pass:

  • /Catalog /Lang — leaks user locale.
  • /Names /EmbeddedFiles — embedded attachments can carry whole spreadsheets or originals.
  • /Catalog /Metadata orphans from prior tool chains — anything that has been ExifTool'd or repeatedly updated has shadow metadata.

Recommendation

Keep pdf-lib as the parser/serializer. Do not replace it with a from-scratch PDF parser; that is a 600+ line project with high crash-on-malformed-input surface area. Instead, extend the strip with three targeted catalog/dict mutations using pdf-lib's low-level API.

The hand-rolled-everything path is unattractive specifically for PDF (unlike JPEG/PNG, where a marker walker is ~80 lines): PDFs require xref parsing, cross-reference streams, FlateDecode/Predictor filters, encryption negotiation, object stream parsing. pdf-lib already does all of that correctly. We don't need a different library; we need to use the one we have more aggressively.

The interesting question is whether to replace pdf-lib with qpdf-wasm (a port of the canonical PDF rewrite tool ExifTool itself recommends). It would provide true garbage-collection of orphan objects and a rewrite-based strip. Bundle size cost is significant (qpdf is large; published wasm builds are ~1.53 MB). Worth evaluating in a separate POC if Phase 1 below proves insufficient. Given the gzip-size patterns observed in docs/poc/exiv2-wasm.md (925 KB exiv2-wasm rejected), qpdf-wasm is unlikely to clear the size bar unless it ships a much smaller pdf-strip-only build.

For Electron (where ExifTool is bundled), pdf-lib is already the right answer for PDF — ExifTool's PDF strip is provably worse (incremental updates retain everything) and we should consider routing PDFs to the WASM strategy on Electron too for any case where speed allows. That's a larger architectural call; out of scope here.

Phase 1 plan: tightly-scoped first PR

Goal: close the three highest-impact gaps (XMP, annotations, dates) within pdf_strategy.ts, without bringing in a new library.

In scope:

  1. Drop the catalog /Metadata reference + verify the orphan is gone after save. If pdf-lib's serializer doesn't garbage-collect it (our scratch testing suggests it does NOT — the orphaned XMP stream remained), implement a manual "build a fresh PDFDocument and copy only the page content + non-metadata catalog entries" rebuild, which forces a clean object table.
  2. Walk every page's /Annots array, mutate each annotation dict to drop /T, /Contents, /M, /CreationDate, /RC (rich content), /Subj (annotation subject). Leave /Type, /Subtype, /Rect, /AP (appearance — visual-only), and structural geometry. This preserves annotation visibility but scrubs authorship.
  3. Drop catalog-level metadata fingerprints: /Lang, /PageLabels, /Names (with caveat below), /OutputIntents, page-level /Metadata, page-level /Thumb. Behind individual StripOptions flags so users can opt back in if they need (e.g. /OutputIntents may matter for color-managed workflows).
  4. Drop /Info /CreationDate and /ModDate entries by reaching into doc.context.lookup(infoRef, PDFDict).delete(PDFName.of('CreationDate')) rather than relying on pdf-lib's high-level setters which always emit a new ModDate.
  5. Fix the Producer comment + test + README footnote 3 to match observed behaviour. Replace the expect(cleaned.getProducer()).toContain("pdf-lib") assertion with a check that the on-disk Info dict has no Producer key at all (or an empty one), and clarify in pdf_strategy.ts that the in-memory getProducer() returns a default fallback that is not present in the file.
  6. Test fixture upgrade: replace the 901-byte minimal fixture with a richer fixture (or add a second fixture) that exercises XMP, annotations, embedded files, and /Lang. Generated via a small tests/fixtures/wasm/pdf/ build script, not committed in binary form unless small enough.

Deferred:

  • /Names /EmbeddedFiles removal: risky because some PDFs use embedded files for legitimate functionality (PDF/A archive sidecars, attached spreadsheets). Add behind an explicit dropEmbeddedFiles option, default off.
  • AcroForm scrubbing: form data may legitimately need to round-trip (e.g. invoices with form fields). Default off; opt-in.
  • True orphan-object garbage collection: only pursue if step 1 above shows orphans surviving the save. If that happens, evaluate qpdf-wasm as a separate POC.
  • /ID regeneration: pdf-lib already omits /ID, so no action — but document this in the strategy file.
  • Linearized PDF input: pdf-lib already de-linearizes on save, so cleaned files are non-linearized. Acceptable.
  • Encrypted-input handling: ignoreEncryption: true already covers the common case (we strip, output is unencrypted). Re-encryption is out of scope.

Effort estimate: ~150 lines of TypeScript additions in pdf_strategy.ts + ~50 lines of test scaffolding + 1 new fixture. One PR, mostly sequential edits to a single strategy file.