Closes #134. Mirrors PR #129 for `tools/forensic/office.ts`. ## Summary - `tools/forensic/pdf.ts` now invokes mat2 (Poppler/Cairo) as a side-by-side reference alongside ExifTool and Ghostscript. Skip-if-missing locally; mat2 stays out of CI per the convention established in #129. - New report fields: strip-tool `Producer` fingerprints, `/Info CreationDate` / `ModDate` stamps, and a rasterised-page-content flag (any `/Subtype /Image` XObject in the output). - The fingerprint sweep runs over `qpdf --qdf` decompressed dump **and** a raw zlib brute-force pass over the input bytes — mat2 hides `/Info` inside a stream qpdf flags with an `unexpected xref entry type` warning and silently omits from the qdf output, so the brute-force pass is what surfaces its `cairo X.Y.Z` Producer + current `CreationDate`. - `docs/forensic/pdf.md` gains the "Comparison reference: mat2" methodology subsection, mat2 columns in the results + per-sentinel tables, footnoted explanations of why mat2's strip is total (rasterisation), and an interpretation paragraph naming the structural differences. ## Findings vs the issue's open questions - **XMP / annotations / etc.** — mat2 drops every original indirect object as a side effect of the Poppler/Cairo rewrite. 0 sentinels survive any of the 5 recovery channels. - **Own ModDate / Producer fingerprint** — mat2 stamps `Producer = cairo 1.18.0 (https://cairographics.org)` and a fresh `/Info CreationDate` set to the current time. `ModDate` is absent. `PdfStrategy` writes none of these. - **Rasterisation** — confirmed. mat2's output contains `/Subtype /Image` XObjects per page (category difference); the other three tools preserve vector/text content. Documented as the headline tradeoff in the writeup. On this fixture, `PdfStrategy` and mat2 are equivalent for sentinel survival (0 each). No `docs/gap-analysis/pdf.md` update needed — there's no divergence to note. ## Test plan - [x] `npx tsx tools/forensic/pdf.ts` → `PdfStrategy` column reports 0 leaked across all 5 recovery channels; mat2 column populated; comparison table prints clean - [x] `yarn typecheck` - [x] `yarn test tests/infrastructure/wasm/pdf_strategy.test.ts` (15/15) - [x] `yarn test tests/infrastructure/wasm/` (244/244) - [x] `yarn lint` - [x] `yarn check:deps` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Randa <obuvuyoviz26@gmail.com> Reviewed-on: http://localhost:3000/forgejo_admin/exifcleaner-web/pulls/140
19 KiB
PDF forensic recovery test
Date: 2026-05-06 (initial); 2026-05-16 (mat2 comparison column added — #134, mirroring PR #129 for Office)
Goal: Verify that metadata stripped by PdfStrategy cannot be recovered by an attacker with standard PDF forensic tooling. Compare against ExifTool -all=, Ghostscript pdfwrite, and mat2 as reference points.
Reproducible at: tools/forensic/pdf.ts — npx tsx tools/forensic/pdf.ts from the project root.
Methodology
The runner generates a synthetic PDF fixture with 10 unique sentinel strings embedded across every metadata source the gap analysis identified. Each sentinel is a 24-character ASCII string with a unique tail (e.g. FORENSIC-AUTHOR-BBBB2222) so any survivor can be unambiguously attributed to its source.
Sources covered:
| Sentinel | Where it lives | How it was injected |
|---|---|---|
TITLE |
/Info /Title |
doc.setTitle() |
AUTHOR |
/Info /Author |
doc.setAuthor() |
SUBJECT |
/Info /Subject |
doc.setSubject() |
PRODUCER |
/Info /Producer |
doc.setProducer() |
CREATOR |
/Info /Creator |
doc.setCreator() |
XMP_CREATOR |
XMP /Metadata stream dc:creator |
raw stream object |
XMP_TITLE |
XMP /Metadata stream dc:title |
raw stream object |
ANNOT_AUTHOR |
Page annotation /T |
low-level annotation object |
ANNOT_COMMENT |
Page annotation /Contents |
low-level annotation object |
LANG |
/Catalog /Lang |
catalog dict set |
The fixture is then stripped four ways:
PdfStrategy— our pdf-lib-based implementationexiftool -all= -overwrite_original— the canonical referencegs -sDEVICE=pdfwrite— Ghostscript clean-rewrite as a third comparison- mat2 — the FOSS reference used by Tails OS (see "Comparison reference: mat2" below)
For each output, the runner applies five recovery techniques:
- Raw
strings— finds sentinels left in unencoded form anywhere in the file exiftool -a -G1 -s— every visible metadata tag including hidden namespacesexiftool -PDF-update:all=— ExifTool's "revert my last update" pseudo-tag, which restores metadata that was hidden via incremental updatesqpdf --qdf --object-streams=disable— decompresses every FlateDecode stream and disables object streams, exposing all dictionary contents in plain text- Walk every indirect object via pdf-lib — decompress streams in-process and search for sentinels
Plus structural checks: presence of /Prev (incremental-update chain), presence of the literal BeginExifToolUpdate marker, qpdf --check validity, raster-page detection (any /Subtype /Image XObject in the decompressed dump — flags mat2's rasterising rewrite), and a sweep for strip-tool fingerprints (Producer stamps + /Info CreationDate / ModDate). The fingerprint sweep runs both against the qdf-decompressed output and a raw zlib brute-force walk of the input bytes — mat2 hides /Info inside a stream qpdf flags with an unexpected xref entry type warning and silently omits from the qdf output, so the brute-force pass is what surfaces its cairo X.Y.Z Producer + fresh CreationDate.
Comparison reference: mat2
Since 2026-05-16 (#134) the runner additionally invokes mat2 on the fixture as a reference. mat2 is the FOSS metadata-removal tool used by Tails OS — its PDF backend hands the file to Poppler, renders each page to a Cairo surface, and writes a new single-trailer PDF. This is a category difference from the other three tools. The privacy upside is strong: every original indirect object, including the XMP stream and annotations, is dropped. The cost is that page content is rasterised — text becomes bitmap, copy/paste and search stop working. PdfStrategy, ExifTool, and Ghostscript all preserve the original page contents intact.
mat2 also stamps its own fingerprints into the rewrite: Producer becomes cairo X.Y.Z (https://cairographics.org) and a fresh CreationDate set to the current time. We report both in the strip-tool-fingerprint sweep so the comparison is honest. Note that on the fixture used here, mat2's output triggers a qpdf --check warning about an unexpected xref entry type — the file still opens in modern viewers but has structural irregularities qpdf flags.
The comparison is informational: the strict bar for PdfStrategy is still zero sentinel survival across every recovery channel. mat2 is invoked locally (skip-if-missing); install with sudo apt install mat2 on Debian/Ubuntu. If mat2 is absent the column reports (skipped) and the runner still passes. Same convention as the Office runner (PR #129).
Results
| Input fixture | PdfStrategy |
ExifTool -all= |
Ghostscript pdfwrite | mat2 (Poppler/Cairo) | |
|---|---|---|---|---|---|
| Output size | 1 958 bytes | 492 bytes | 2 249 bytes | 3 502 bytes | 2 708 bytes |
Has /Prev (incremental update chain) |
no | no | yes | no | yes¹ |
BeginExifToolUpdate marker |
no | no | yes | no | no |
qpdf --check valid |
yes | yes | yes | yes | warn² |
| Rasterised page content | no | no | no | no | yes³ |
Raw strings sentinels |
5 | 0 | 5 | 6 | 0 |
| ExifTool visible tags | 8 | 0 | 1 | 4 | 0 |
After -PDF-update:all= revert |
8 | 0 | 8 | 4 | 0 |
qpdf --qdf decompressed |
5 | 0 | 3 | 6 | 0 |
| Walk all streams (pdf-lib) | 2 | 0 | 2 | 4 | 0⁴ |
| Strip-tool Producer fingerprint | n/a | none | none | GPL Ghostscript 10.02.1 |
cairo 1.18.0 (https://cairographics.org) |
/Info CreationDate after strip |
original date | absent | preserved | current time | current time |
/Info ModDate after strip |
original date | absent | absent | current time | absent |
¹ mat2's output carries an incremental-update layer because it uses an xref table shape pdf-lib refuses to load (and qpdf flags). The original /Info object remains in the file but is not reachable via the new xref entries — exiftool reports nothing, qpdf --qdf silently drops it from the qdf output. A byte-level zlib walk over the stream contents is required to surface mat2's cairo Producer + current CreationDate (see "Comparison reference: mat2" above for why the runner does both passes).
² qpdf --check emits object 14/0 has unexpected xref entry type on mat2's output and exits with the "operation succeeded with warnings" status. Modern viewers still open the file; older or stricter consumers may not.
³ "Rasterised" means at least one /Subtype /Image XObject is present in the output's resource dict. mat2's Poppler/Cairo pipeline emits two such image XObjects per page; the other three tools emit none. This is the category difference — mat2 destroys vector/text content in the rewrite.
⁴ pdf-lib refuses to load mat2's output (object 14/0 has unexpected xref entry type), so this row is vacuously zero — the walker can't enumerate any objects in the first place. The byte-level zlib walk and exiftool confirm independently that mat2's output has no surviving sentinels.
Side-by-side: which sentinels survive after each tool?
| Sentinel | Source | PdfStrategy |
ExifTool | Ghostscript | mat2 |
|---|---|---|---|---|---|
TITLE |
/Info /Title |
removed | LEAKED⁵ | LEAKED⁶ | removed⁷ |
AUTHOR |
/Info /Author |
removed | LEAKED⁵ | LEAKED⁶ | removed⁷ |
SUBJECT |
/Info /Subject |
removed | LEAKED⁵ | LEAKED⁶ | removed⁷ |
PRODUCER |
/Info /Producer |
removed | LEAKED⁵ | removed⁸ | removed⁷ |
CREATOR |
/Info /Creator |
removed | LEAKED⁵ | LEAKED⁶ | removed⁷ |
XMP_CREATOR |
XMP dc:creator stream |
removed | LEAKED⁵ | removed⁹ | removed⁷ |
XMP_TITLE |
XMP dc:title stream |
removed | LEAKED⁵ | removed⁹ | removed⁷ |
ANNOT_AUTHOR |
Page annotation /T |
removed | LEAKED⁵ | LEAKED¹⁰ | removed⁷ |
ANNOT_COMMENT |
Page annotation /Contents |
removed | LEAKED⁵ | LEAKED¹⁰ | removed⁷ |
LANG |
/Catalog /Lang |
removed | LEAKED⁵ | removed¹¹ | removed⁷ |
⁵ Recovered via exiftool -PDF-update:all= revert. ExifTool's incremental-update strip is reversible by design — see Interpretation below.
⁶ Ghostscript pdfwrite copies the Info dict through unchanged; rewrites Producer only.
⁷ mat2 rasterises the page content via Cairo and writes a new single-trailer PDF that drops every original indirect object, including the XMP stream and the annotation list. The category difference (raster vs vector) is what makes the strip total.
⁸ Ghostscript replaces /Producer with its own stamp (GPL Ghostscript X.YZ).
⁹ Ghostscript drops the XMP /Metadata stream during the rewrite.
¹⁰ Ghostscript carries annotations through the rewrite unchanged; /T and /Contents survive intact.
¹¹ Ghostscript drops the catalog-level /Lang during the rewrite.
Comparison summary: 0 sentinels leaked by both PdfStrategy and mat2; 0 leaked by PdfStrategy only; 0 leaked by mat2 only. On this fixture the two tools are equivalent for sentinel survival — but reach the same outcome by different means (PdfStrategy strips structured metadata while preserving page content; mat2 rasterises the page content as a side effect of how it strips). ExifTool leaks all 10 sentinels under -PDF-update:all=; Ghostscript leaks 6.
Interpretation
PdfStrategy and mat2 both achieve zero sentinel survival on this fixture, but they get there differently:
-
PdfStrategystrips structured metadata while keeping page content intact. Direct deletion of Info-dict keys (not "set to empty" — actually removed);updateMetadata: falseon load (defeats pdf-lib's auto-stamp of Producer/ModDate); both the catalog/Metadatareference and the indirect XMP stream object are dropped; pages are walked to scrub annotation/T//Contents//M//CreationDate; pdf-lib's single-trailer rewrite means no/Prevchain and no incremental update. The original metadata is genuinely gone — not hidden, not pending revert, not buried in an orphan stream — and the output is also the smallest of the four (492 bytes vs ExifTool's 2 249, Ghostscript's 3 502, mat2's 2 708). No Producer fingerprint, no InfoCreationDate, noModDate— absence is the natural state per the privacy invariant in §6 of.claude/rules/privacy-invariants.md. -
mat2 rasterises the page via Poppler/Cairo and emits a new file that drops every original indirect object. This achieves the same sentinel-removal outcome via a structural rewrite rather than targeted deletion. The cost is real: page text becomes a bitmap (copy/paste and search stop working), output is ~5.5× larger than
PdfStrategy's, and mat2 stamps the file with its ownProducer = cairo X.Y.Zand a fresh/Info CreationDateset to the current time — both are tool-of-strip fingerprints we deliberately avoid inPdfStrategy's output. mat2's output also triggers aqpdf --checkwarning ("object 14/0 has unexpected xref entry type") and pdf-lib refuses to load it altogether.
ExifTool's PDF strip is reversible by design. Per its own docs: "PDF — The original metadata is never actually deleted." The test demonstrates this concretely: a one-line exiftool -PDF-update:all= recovers all 10 original sentinels including the entire Info dictionary and the XMP dc:creator and dc:title. ExifTool also leaves the literal string BeginExifToolUpdate in the file as a fingerprint that the file was processed by ExifTool.
Versus mat2 (the FOSS reference): 0 sentinels leaked by both, 0 by PdfStrategy only, 0 by mat2 only — equivalent on raw sentinel survival for the surfaces this fixture exercises. The differences are structural, not in what PII survives: (1) mat2 rasterises page content and PdfStrategy does not; (2) mat2 stamps cairo/current-time fingerprints into the rewrite and PdfStrategy writes neither; (3) mat2's output is ~5.5× larger; (4) mat2's output trips qpdf --check warnings and pdf-lib refuses it. For a privacy-focused tool whose purpose is "strip metadata while preserving the document," PdfStrategy is the right shape; mat2's rasterising approach is a valid privacy stance with different tradeoffs.
Caveats and limits of this test
- mat2 comparison is local-only. The runner invokes mat2 if installed; otherwise the mat2 column reports
(skipped). CI doesn't have mat2; the comparison runs as part of the local pre-merge forensic pass per.claude/rules/format-strategy-workflow.md. Install:sudo apt install mat2on Debian/Ubuntu. Same convention as the Office runner. - mat2's strip is rasterising. The runner reports this in the
Rasterised page contentrow and in footnote ³ of the results table. On a PDF with real text and graphics, mat2's output will not be reusable as a vector document — copy/paste and search stop working.PdfStrategypreserves the original page content; if the user's intent is to ship a clean version of the document,PdfStrategyis the right choice. - AcroForm fields, JavaScript actions, embedded files. Neither mat2 nor
PdfStrategyis currently exercised against these surfaces by this fixture.PdfStrategydefers AcroForm and/Names/EmbeddedFilesbehind opt-in (both can carry legitimate document content). mat2's strip would drop them as a side effect of the Cairo rewrite, at the cost of also dropping any form data or attachments. Documented as a known gap indocs/gap-analysis/pdf.md; not yet a sentinel-tested surface here. - The fixture is synthetic — generated with pdf-lib + low-level dict manipulation. Real-world PDFs from Word, Acrobat, InDesign, etc. have richer XMP profiles (XMP MM history, prismeta, Adobe-specific extensions) that this fixture doesn't exercise. Our strip drops the entire
/Metadatastream regardless of contents, so the result should still be zero survival, but extending the fixture with a captured real-world XMP is a worthwhile follow-up. - Page content is empty in the fixture.
pdf-lib'saddPage([612, 792])creates a page with no/Contentsstream. mat2 still rasterises the (empty) page and emits two image XObjects, which the rasterisation detector picks up. On a PDF with real content, mat2's rasterisation is more obviously destructive (text → bitmap); the structural detection is unchanged. - The 10 sentinels cover the categories from the
pdf.mdgap analysis. We did not test page-level/Metadatastreams (covered by the strategy code but not by this fixture) or/Thumbthumbnails — both are dropped by the same code paths but not exercised by sentinel here. - The
-PDF-update:all=revert was tried against all five files (input, our-stripped, exiftool-stripped, gs-stripped, mat2-cleaned). It only succeeded on exiftool-stripped, which is expected — the others have no ExifTool update layer to revert, and ExifTool reportsFile contains no previous ExifTool update. - The runner is informational, not strict (unlike the Office runner, which exits non-zero on any sentinel survival). Wiring
tools/forensic/pdf.tsinto a CI test or release-gate that fails if any sentinel survives is a natural follow-up. - The producer-fingerprint sweep is best-effort. PDF allows balanced unescaped parens inside string literals; our regex captures up to the first
), so a Producer likecairo 1.18.0 (https://cairographics.org)is reported ascairo 1.18.0 (https://cairographics.org(without the closing paren). The prefix is unambiguous; the cosmetic loss doesn't change the fingerprint identification.
Reproducing
# From the project root
npx tsx tools/forensic/pdf.ts
Outputs go to /tmp/pdf-forensic/:
input.pdf— the rich fixtureour-stripped.pdf—PdfStrategyoutputexiftool-stripped.pdf—exiftool -all=outputgs-stripped.pdf— Ghostscript pdfwrite output_mat2-pdf.cleaned.pdf— mat2 output (only when mat2 is installed)report.json— structured per-output sentinel-survival data, including the newrasterised/producerFingerprints/creationDate/modDatefields*-revert.pdf,*-qdf.pdf— intermediate files from the recovery battery
Required tools: exiftool (libimage-exiftool-perl), qpdf, gs (Ghostscript), strings (binutils). All available on Debian/Ubuntu via apt.
Optional: mat2 — comparative reference column. Skipped cleanly if absent.
Debian/Ubuntu one-liner: sudo apt install libimage-exiftool-perl qpdf ghostscript binutils mat2.
What this directory is for
docs/forensic/ documents adversarial recovery tests run after implementation lands, complementing docs/gap-analysis/ (which runs before implementation to scope what should be removed). The pattern: implement → unit-test correctness → forensic-test unrecoverability → document the result.
Each format gets its own writeup as we go: pdf.md here, jpeg.md next time we run the same battery on JPEG fixtures with embedded EXIF/XMP/IPTC, etc. The runner scripts at tools/forensic/<format>.ts stay in the repo so the tests can be re-run any time the strategy changes.