exifcleaner-web/docs/forensic/pdf.md
forgejo_admin 398728924b
All checks were successful
CI / Lint, Typecheck & Unit Tests (push) Successful in 25s
CI / E2E (Standalone single-file) (push) Successful in 1m25s
CI / E2E (Web) (push) Successful in 2m32s
test(pdf-forensic): add mat2 comparative column (#140)
Closes #134. Mirrors PR #129 for `tools/forensic/office.ts`.

## Summary

- `tools/forensic/pdf.ts` now invokes mat2 (Poppler/Cairo) as a side-by-side reference alongside ExifTool and Ghostscript. Skip-if-missing locally; mat2 stays out of CI per the convention established in #129.
- New report fields: strip-tool `Producer` fingerprints, `/Info CreationDate` / `ModDate` stamps, and a rasterised-page-content flag (any `/Subtype /Image` XObject in the output).
- The fingerprint sweep runs over `qpdf --qdf` decompressed dump **and** a raw zlib brute-force pass over the input bytes — mat2 hides `/Info` inside a stream qpdf flags with an `unexpected xref entry type` warning and silently omits from the qdf output, so the brute-force pass is what surfaces its `cairo X.Y.Z` Producer + current `CreationDate`.
- `docs/forensic/pdf.md` gains the "Comparison reference: mat2" methodology subsection, mat2 columns in the results + per-sentinel tables, footnoted explanations of why mat2's strip is total (rasterisation), and an interpretation paragraph naming the structural differences.

## Findings vs the issue's open questions

- **XMP / annotations / etc.** — mat2 drops every original indirect object as a side effect of the Poppler/Cairo rewrite. 0 sentinels survive any of the 5 recovery channels.
- **Own ModDate / Producer fingerprint** — mat2 stamps `Producer = cairo 1.18.0 (https://cairographics.org)` and a fresh `/Info CreationDate` set to the current time. `ModDate` is absent. `PdfStrategy` writes none of these.
- **Rasterisation** — confirmed. mat2's output contains `/Subtype /Image` XObjects per page (category difference); the other three tools preserve vector/text content. Documented as the headline tradeoff in the writeup.

On this fixture, `PdfStrategy` and mat2 are equivalent for sentinel survival (0 each). No `docs/gap-analysis/pdf.md` update needed — there's no divergence to note.

## Test plan

- [x] `npx tsx tools/forensic/pdf.ts` → `PdfStrategy` column reports 0 leaked across all 5 recovery channels; mat2 column populated; comparison table prints clean
- [x] `yarn typecheck`
- [x] `yarn test tests/infrastructure/wasm/pdf_strategy.test.ts` (15/15)
- [x] `yarn test tests/infrastructure/wasm/` (244/244)
- [x] `yarn lint`
- [x] `yarn check:deps`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Randa <obuvuyoviz26@gmail.com>
Reviewed-on: http://localhost:3000/forgejo_admin/exifcleaner-web/pulls/140
2026-05-16 15:18:57 +04:00

19 KiB
Raw Permalink Blame History

PDF forensic recovery test

Date: 2026-05-06 (initial); 2026-05-16 (mat2 comparison column added — #134, mirroring PR #129 for Office) Goal: Verify that metadata stripped by PdfStrategy cannot be recovered by an attacker with standard PDF forensic tooling. Compare against ExifTool -all=, Ghostscript pdfwrite, and mat2 as reference points.

Reproducible at: tools/forensic/pdf.tsnpx tsx tools/forensic/pdf.ts from the project root.

Methodology

The runner generates a synthetic PDF fixture with 10 unique sentinel strings embedded across every metadata source the gap analysis identified. Each sentinel is a 24-character ASCII string with a unique tail (e.g. FORENSIC-AUTHOR-BBBB2222) so any survivor can be unambiguously attributed to its source.

Sources covered:

Sentinel Where it lives How it was injected
TITLE /Info /Title doc.setTitle()
AUTHOR /Info /Author doc.setAuthor()
SUBJECT /Info /Subject doc.setSubject()
PRODUCER /Info /Producer doc.setProducer()
CREATOR /Info /Creator doc.setCreator()
XMP_CREATOR XMP /Metadata stream dc:creator raw stream object
XMP_TITLE XMP /Metadata stream dc:title raw stream object
ANNOT_AUTHOR Page annotation /T low-level annotation object
ANNOT_COMMENT Page annotation /Contents low-level annotation object
LANG /Catalog /Lang catalog dict set

The fixture is then stripped four ways:

  1. PdfStrategy — our pdf-lib-based implementation
  2. exiftool -all= -overwrite_original — the canonical reference
  3. gs -sDEVICE=pdfwrite — Ghostscript clean-rewrite as a third comparison
  4. mat2 — the FOSS reference used by Tails OS (see "Comparison reference: mat2" below)

For each output, the runner applies five recovery techniques:

  1. Raw strings — finds sentinels left in unencoded form anywhere in the file
  2. exiftool -a -G1 -s — every visible metadata tag including hidden namespaces
  3. exiftool -PDF-update:all= — ExifTool's "revert my last update" pseudo-tag, which restores metadata that was hidden via incremental updates
  4. qpdf --qdf --object-streams=disable — decompresses every FlateDecode stream and disables object streams, exposing all dictionary contents in plain text
  5. Walk every indirect object via pdf-lib — decompress streams in-process and search for sentinels

Plus structural checks: presence of /Prev (incremental-update chain), presence of the literal BeginExifToolUpdate marker, qpdf --check validity, raster-page detection (any /Subtype /Image XObject in the decompressed dump — flags mat2's rasterising rewrite), and a sweep for strip-tool fingerprints (Producer stamps + /Info CreationDate / ModDate). The fingerprint sweep runs both against the qdf-decompressed output and a raw zlib brute-force walk of the input bytes — mat2 hides /Info inside a stream qpdf flags with an unexpected xref entry type warning and silently omits from the qdf output, so the brute-force pass is what surfaces its cairo X.Y.Z Producer + fresh CreationDate.

Comparison reference: mat2

Since 2026-05-16 (#134) the runner additionally invokes mat2 on the fixture as a reference. mat2 is the FOSS metadata-removal tool used by Tails OS — its PDF backend hands the file to Poppler, renders each page to a Cairo surface, and writes a new single-trailer PDF. This is a category difference from the other three tools. The privacy upside is strong: every original indirect object, including the XMP stream and annotations, is dropped. The cost is that page content is rasterised — text becomes bitmap, copy/paste and search stop working. PdfStrategy, ExifTool, and Ghostscript all preserve the original page contents intact.

mat2 also stamps its own fingerprints into the rewrite: Producer becomes cairo X.Y.Z (https://cairographics.org) and a fresh CreationDate set to the current time. We report both in the strip-tool-fingerprint sweep so the comparison is honest. Note that on the fixture used here, mat2's output triggers a qpdf --check warning about an unexpected xref entry type — the file still opens in modern viewers but has structural irregularities qpdf flags.

The comparison is informational: the strict bar for PdfStrategy is still zero sentinel survival across every recovery channel. mat2 is invoked locally (skip-if-missing); install with sudo apt install mat2 on Debian/Ubuntu. If mat2 is absent the column reports (skipped) and the runner still passes. Same convention as the Office runner (PR #129).

Results

Input fixture PdfStrategy ExifTool -all= Ghostscript pdfwrite mat2 (Poppler/Cairo)
Output size 1 958 bytes 492 bytes 2 249 bytes 3 502 bytes 2 708 bytes
Has /Prev (incremental update chain) no no yes no yes¹
BeginExifToolUpdate marker no no yes no no
qpdf --check valid yes yes yes yes warn²
Rasterised page content no no no no yes³
Raw strings sentinels 5 0 5 6 0
ExifTool visible tags 8 0 1 4 0
After -PDF-update:all= revert 8 0 8 4 0
qpdf --qdf decompressed 5 0 3 6 0
Walk all streams (pdf-lib) 2 0 2 4 0
Strip-tool Producer fingerprint n/a none none GPL Ghostscript 10.02.1 cairo 1.18.0 (https://cairographics.org)
/Info CreationDate after strip original date absent preserved current time current time
/Info ModDate after strip original date absent absent current time absent

¹ mat2's output carries an incremental-update layer because it uses an xref table shape pdf-lib refuses to load (and qpdf flags). The original /Info object remains in the file but is not reachable via the new xref entries — exiftool reports nothing, qpdf --qdf silently drops it from the qdf output. A byte-level zlib walk over the stream contents is required to surface mat2's cairo Producer + current CreationDate (see "Comparison reference: mat2" above for why the runner does both passes).

² qpdf --check emits object 14/0 has unexpected xref entry type on mat2's output and exits with the "operation succeeded with warnings" status. Modern viewers still open the file; older or stricter consumers may not.

³ "Rasterised" means at least one /Subtype /Image XObject is present in the output's resource dict. mat2's Poppler/Cairo pipeline emits two such image XObjects per page; the other three tools emit none. This is the category difference — mat2 destroys vector/text content in the rewrite.

⁴ pdf-lib refuses to load mat2's output (object 14/0 has unexpected xref entry type), so this row is vacuously zero — the walker can't enumerate any objects in the first place. The byte-level zlib walk and exiftool confirm independently that mat2's output has no surviving sentinels.

Side-by-side: which sentinels survive after each tool?

Sentinel Source PdfStrategy ExifTool Ghostscript mat2
TITLE /Info /Title removed LEAKED LEAKED removed⁷
AUTHOR /Info /Author removed LEAKED LEAKED removed⁷
SUBJECT /Info /Subject removed LEAKED LEAKED removed⁷
PRODUCER /Info /Producer removed LEAKED removed⁸ removed⁷
CREATOR /Info /Creator removed LEAKED LEAKED removed⁷
XMP_CREATOR XMP dc:creator stream removed LEAKED removed⁹ removed⁷
XMP_TITLE XMP dc:title stream removed LEAKED removed⁹ removed⁷
ANNOT_AUTHOR Page annotation /T removed LEAKED LEAKED¹⁰ removed⁷
ANNOT_COMMENT Page annotation /Contents removed LEAKED LEAKED¹⁰ removed⁷
LANG /Catalog /Lang removed LEAKED removed¹¹ removed⁷

⁵ Recovered via exiftool -PDF-update:all= revert. ExifTool's incremental-update strip is reversible by design — see Interpretation below.

⁶ Ghostscript pdfwrite copies the Info dict through unchanged; rewrites Producer only.

⁷ mat2 rasterises the page content via Cairo and writes a new single-trailer PDF that drops every original indirect object, including the XMP stream and the annotation list. The category difference (raster vs vector) is what makes the strip total.

⁸ Ghostscript replaces /Producer with its own stamp (GPL Ghostscript X.YZ).

⁹ Ghostscript drops the XMP /Metadata stream during the rewrite.

¹⁰ Ghostscript carries annotations through the rewrite unchanged; /T and /Contents survive intact.

¹¹ Ghostscript drops the catalog-level /Lang during the rewrite.

Comparison summary: 0 sentinels leaked by both PdfStrategy and mat2; 0 leaked by PdfStrategy only; 0 leaked by mat2 only. On this fixture the two tools are equivalent for sentinel survival — but reach the same outcome by different means (PdfStrategy strips structured metadata while preserving page content; mat2 rasterises the page content as a side effect of how it strips). ExifTool leaks all 10 sentinels under -PDF-update:all=; Ghostscript leaks 6.

Interpretation

PdfStrategy and mat2 both achieve zero sentinel survival on this fixture, but they get there differently:

  • PdfStrategy strips structured metadata while keeping page content intact. Direct deletion of Info-dict keys (not "set to empty" — actually removed); updateMetadata: false on load (defeats pdf-lib's auto-stamp of Producer/ModDate); both the catalog /Metadata reference and the indirect XMP stream object are dropped; pages are walked to scrub annotation /T / /Contents / /M / /CreationDate; pdf-lib's single-trailer rewrite means no /Prev chain and no incremental update. The original metadata is genuinely gone — not hidden, not pending revert, not buried in an orphan stream — and the output is also the smallest of the four (492 bytes vs ExifTool's 2 249, Ghostscript's 3 502, mat2's 2 708). No Producer fingerprint, no Info CreationDate, no ModDate — absence is the natural state per the privacy invariant in §6 of .claude/rules/privacy-invariants.md.

  • mat2 rasterises the page via Poppler/Cairo and emits a new file that drops every original indirect object. This achieves the same sentinel-removal outcome via a structural rewrite rather than targeted deletion. The cost is real: page text becomes a bitmap (copy/paste and search stop working), output is ~5.5× larger than PdfStrategy's, and mat2 stamps the file with its own Producer = cairo X.Y.Z and a fresh /Info CreationDate set to the current time — both are tool-of-strip fingerprints we deliberately avoid in PdfStrategy's output. mat2's output also triggers a qpdf --check warning ("object 14/0 has unexpected xref entry type") and pdf-lib refuses to load it altogether.

ExifTool's PDF strip is reversible by design. Per its own docs: "PDF — The original metadata is never actually deleted." The test demonstrates this concretely: a one-line exiftool -PDF-update:all= recovers all 10 original sentinels including the entire Info dictionary and the XMP dc:creator and dc:title. ExifTool also leaves the literal string BeginExifToolUpdate in the file as a fingerprint that the file was processed by ExifTool.

Versus mat2 (the FOSS reference): 0 sentinels leaked by both, 0 by PdfStrategy only, 0 by mat2 only — equivalent on raw sentinel survival for the surfaces this fixture exercises. The differences are structural, not in what PII survives: (1) mat2 rasterises page content and PdfStrategy does not; (2) mat2 stamps cairo/current-time fingerprints into the rewrite and PdfStrategy writes neither; (3) mat2's output is ~5.5× larger; (4) mat2's output trips qpdf --check warnings and pdf-lib refuses it. For a privacy-focused tool whose purpose is "strip metadata while preserving the document," PdfStrategy is the right shape; mat2's rasterising approach is a valid privacy stance with different tradeoffs.

Caveats and limits of this test

  • mat2 comparison is local-only. The runner invokes mat2 if installed; otherwise the mat2 column reports (skipped). CI doesn't have mat2; the comparison runs as part of the local pre-merge forensic pass per .claude/rules/format-strategy-workflow.md. Install: sudo apt install mat2 on Debian/Ubuntu. Same convention as the Office runner.
  • mat2's strip is rasterising. The runner reports this in the Rasterised page content row and in footnote ³ of the results table. On a PDF with real text and graphics, mat2's output will not be reusable as a vector document — copy/paste and search stop working. PdfStrategy preserves the original page content; if the user's intent is to ship a clean version of the document, PdfStrategy is the right choice.
  • AcroForm fields, JavaScript actions, embedded files. Neither mat2 nor PdfStrategy is currently exercised against these surfaces by this fixture. PdfStrategy defers AcroForm and /Names /EmbeddedFiles behind opt-in (both can carry legitimate document content). mat2's strip would drop them as a side effect of the Cairo rewrite, at the cost of also dropping any form data or attachments. Documented as a known gap in docs/gap-analysis/pdf.md; not yet a sentinel-tested surface here.
  • The fixture is synthetic — generated with pdf-lib + low-level dict manipulation. Real-world PDFs from Word, Acrobat, InDesign, etc. have richer XMP profiles (XMP MM history, prismeta, Adobe-specific extensions) that this fixture doesn't exercise. Our strip drops the entire /Metadata stream regardless of contents, so the result should still be zero survival, but extending the fixture with a captured real-world XMP is a worthwhile follow-up.
  • Page content is empty in the fixture. pdf-lib's addPage([612, 792]) creates a page with no /Contents stream. mat2 still rasterises the (empty) page and emits two image XObjects, which the rasterisation detector picks up. On a PDF with real content, mat2's rasterisation is more obviously destructive (text → bitmap); the structural detection is unchanged.
  • The 10 sentinels cover the categories from the pdf.md gap analysis. We did not test page-level /Metadata streams (covered by the strategy code but not by this fixture) or /Thumb thumbnails — both are dropped by the same code paths but not exercised by sentinel here.
  • The -PDF-update:all= revert was tried against all five files (input, our-stripped, exiftool-stripped, gs-stripped, mat2-cleaned). It only succeeded on exiftool-stripped, which is expected — the others have no ExifTool update layer to revert, and ExifTool reports File contains no previous ExifTool update.
  • The runner is informational, not strict (unlike the Office runner, which exits non-zero on any sentinel survival). Wiring tools/forensic/pdf.ts into a CI test or release-gate that fails if any sentinel survives is a natural follow-up.
  • The producer-fingerprint sweep is best-effort. PDF allows balanced unescaped parens inside string literals; our regex captures up to the first ), so a Producer like cairo 1.18.0 (https://cairographics.org) is reported as cairo 1.18.0 (https://cairographics.org (without the closing paren). The prefix is unambiguous; the cosmetic loss doesn't change the fingerprint identification.

Reproducing

# From the project root
npx tsx tools/forensic/pdf.ts

Outputs go to /tmp/pdf-forensic/:

  • input.pdf — the rich fixture
  • our-stripped.pdfPdfStrategy output
  • exiftool-stripped.pdfexiftool -all= output
  • gs-stripped.pdf — Ghostscript pdfwrite output
  • _mat2-pdf.cleaned.pdf — mat2 output (only when mat2 is installed)
  • report.json — structured per-output sentinel-survival data, including the new rasterised / producerFingerprints / creationDate / modDate fields
  • *-revert.pdf, *-qdf.pdf — intermediate files from the recovery battery

Required tools: exiftool (libimage-exiftool-perl), qpdf, gs (Ghostscript), strings (binutils). All available on Debian/Ubuntu via apt.

Optional: mat2 — comparative reference column. Skipped cleanly if absent.

Debian/Ubuntu one-liner: sudo apt install libimage-exiftool-perl qpdf ghostscript binutils mat2.

What this directory is for

docs/forensic/ documents adversarial recovery tests run after implementation lands, complementing docs/gap-analysis/ (which runs before implementation to scope what should be removed). The pattern: implement → unit-test correctness → forensic-test unrecoverability → document the result.

Each format gets its own writeup as we go: pdf.md here, jpeg.md next time we run the same battery on JPEG fixtures with embedded EXIF/XMP/IPTC, etc. The runner scripts at tools/forensic/<format>.ts stay in the repo so the tests can be re-run any time the strategy changes.