* feat(jpeg,pdf): route through WasmProcessor in Electron, honor preserveOrientation
Closes the JPEG orientation gap that previously blocked migrating .jpg/.jpeg
off the ExifTool path. The walker now parses IFD0 inside APP1 EXIF, finds
tag 0x0112 (Orientation, SHORT, count 1), and — when preserveOrientation
is set — emits a synthesized minimal APP1 carrying ONLY that tag. Make,
Model, GPS, SubIFDs, MakerNotes, XMP all do not survive. Both byte
orders (II / MM) are supported. Forensic check on a JPEG with
Make+Orientation+GPS confirms only Orientation remains in output.
With orientation handled, .jpg, .jpeg, and .pdf are added to
WASM_HANDLED_EXTENSIONS so Electron uses the hand-rolled walkers
instead of ExifTool. The web build was already routing them through
WASM.
ExifTool remains the path for the long tail (TIFF, HEIC/HEIF, RAW,
AVIF, BMP, GIF, SVG, WebP, MKV, AVI, WMV) until each gets its own
strategy.
Tests: 8 new JPEG orientation tests (BE/LE, drop/keep cases, XMP,
no-Orientation, garbage values), updated renderer routing tests,
sync-guard intact. Full suite: 378/378.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(processViaWasm): populate before/after metadata via ExifTool in Electron
The WASM strip path was unconditionally setting beforeMetadata and
afterMetadata to null, which gates the inspection panel
(MetadataExpansion only renders when both are non-null per FileRow.tsx).
Once .jpg/.pdf started routing through WasmProcessor in Electron, the
inspect-metadata UI silently regressed for those formats — caught by
the metadata-inspection.spec.ts e2e suite in CI.
Fix: when ExifTool is available (Electron build), read the original
file's metadata before the strip and the cleaned file's metadata after,
both via ExifTool. The strip itself still happens in the hand-rolled
WASM strategy. The web build skips this since ExifTool is unavailable.
This is a clean split of read/write responsibility — ExifTool's read
coverage is solid; only its write path had the gaps that motivated
the WASM walkers.
Updated the .docx routing test to reflect the new contract: removeMetadata
is not called (strip is WASM), but readMetadata is called twice (before
and after the strip, for the inspection UI).
* feat(forensic): JPEG forensic battery confirms zero sentinel survival
Phase 3 verification per .claude/rules/format-strategy-workflow.md.
Adds tools/forensic/jpeg.ts (the runner) and docs/forensic/jpeg.md
(the writeup) for the JpegStrategy changes in this PR.
The runner injects 10 sentinels via exiftool across every metadata-bearing
JPEG segment (APP1 EXIF: Make/Model/Software/Artist/Copyright/UserComment;
APP1 XMP: dc:creator/dc:title; APP13 IPTC: By-line; COM marker), plus
binary Orientation=6, then strips the fixture three ways:
1. JpegStrategy default
2. JpegStrategy with preserveOrientation: true
3. exiftool -all=
Recovery battery on each output: raw `strings`, `exiftool -a -G1 -s`,
and an in-process marker walker that scans every APP*/COM payload as
latin-1.
Findings:
- JpegStrategy default and exiftool -all= are byte-equivalent on
this fixture (both 15 bytes, zero segments remaining, zero sentinel
survivors via every channel).
- JpegStrategy with preserveOrientation=true: 51 bytes, single 34-byte
APP1 carrying ONLY the Orientation tag. Cross-check confirms
Orientation=6 round-trips exactly; zero sentinels recoverable; no
GPS/Make/MakerNote/XMP leak through the synthesized APP1.
- Confirms the preserveOrientation gap-analysis prediction (no leakage
from the synthesized minimal APP1) empirically.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.6 KiB
JPEG metadata-stripping gap analysis
Date: 2026-05-06 (retrofitted from the architecture decisions that drove the Phase 1 implementation)
Goal: Document the gap between the original piexifjs-based JPEG strategy and ExifTool's -all= JPEG strip, the scope of WASM library alternatives that were ruled out, and the rationale for the hand-rolled segment walker that ships in Phase 1.
Methodology
Read:
piexifjssource (theremove()function specifically) — confirmed it operates on APP1 (EXIF) only.- ExifTool documentation at https://exiftool.org/#limitations for the JPEG segments removed by
-all=. - ITU-T T.81 (JPEG specification) §B.1 for marker assignments.
- The previous
image_strategy.ts(the piexifjs wrapper) and the output it produced on real fixtures.
Verified empirically (in docs/poc/little-exif-wasm.md and docs/poc/exiv2-wasm.md):
piexifjsleaves the JPEG Comment marker (0xFFFE) intact even when the user-set Comment is the most user-visible PII source.piexifjsleaves JFIF/APP0 intact (resolution + units; usually inert but still metadata).- The previous implementation also had a critical correctness bug: the
TextDecoder("latin1")round-trip silently corrupted bytes0x80–0x9Fbecause WHATWG aliaseslatin1towindows-1252(where those values are not 1:1). little_exif(Rust → WASM): ~330 KB raw / 111 KB gzip. Left Comment + JFIF + PNG text chunks untouched, errored on TIFF.exiv2-wasm: ~2.3 MB raw / 925 KB gzip. The published API has noeraseprimitive —writeString(buf, key, "")sets values empty but leaves tag IDs in place.
Both library options ruled out for JPEG (and other image formats); see the POC writeups.
Per-segment policy
JPEG marker structure: each segment is 0xFF <code> <length-2-bytes-big-endian> <payload>, except for standalone markers without a length field (SOI, EOI, RST0–RST7, TEM). After SOS, an entropy-coded scan stream extends until the next non-stuffed, non-restart marker. T.81 §B.1.1.2 also permits any number of 0xFF "fill bytes" before a marker code.
| Marker | Code | Source of leak | piexifjs (before) | ExifTool -all= |
Phase 1 walker |
|---|---|---|---|---|---|
| SOI | FFD8 |
n/a | keep | keep | keep |
| JFIF / APP0 | FFE0 |
density, JFIF version, optional thumbnail | leaves intact | drops | drops |
| EXIF / APP1 | FFE1 |
EXIF IFD, GPS, MakerNotes, XMP | drops | drops | drops |
| ICC / APP2 | FFE2 |
colour profile (cmmId, creator, dateTime, desc strings) |
leaves | drops | drops by default; kept when preserveColorProfile: true |
| APP3..APP12 | FFE3..FFEC |
various app-specific (Photoshop, Flashpix, MakerNotes, …) | leaves | drops | drops |
| Photoshop / IPTC / APP13 | FFED |
8BIM, IPTC, Photoshop image resources | leaves | drops | drops |
| Adobe / APP14 | FFEE |
Adobe DCT encoding signal | leaves | leaves | keeps — required for correct decoding of some Adobe-encoded JPEGs |
| APP15 | FFEF |
rare | leaves | drops | drops |
| Comment | FFFE |
arbitrary string (filenames, review notes, user comments) | leaves intact | drops | drops |
| DQT, DHT, SOF, SOS, RST, EOI, DRI, DAC | various | image data | keep | keep | keep |
Entropy-coded data after SOS is copied byte-for-byte. 0xFF 0x00 byte-stuffing and 0xFF D0..D7 restart markers within the stream are preserved (they're part of the entropy data, not metadata).
Honest gap summary
piexifjs vs ExifTool: piexifjs covered roughly 10–15% of what -all= removes. The Comment marker survival was the most user-visible privacy gap. JFIF/APP0 is rarely meaningful, but it's still data the user expects to be stripped.
ExifTool -all= vs theoretical: essentially equivalent on a single-pass strip. ExifTool has been battle-tested against 20+ years of edge-case fixtures; a hand-rolled walker is exposed to whatever subset we test against.
Phase 1 walker vs ExifTool -all=: the policy table is identical for the marker classes covered. Differences are at the edges:
- Fill bytes between markers (T.81 §B.1.1.2) — Phase 1 handles by skipping fill-byte runs at the top of each iteration.
- Hierarchical / multi-frame JPEGs (rare) — Phase 1 handles single hierarchy via the SOS-then-entropy cycle re-entering for subsequent SOS markers.
- Granular tag-level operations (e.g.
-EXIF:Orientation=keep) — out of scope for the walker; planned Phase 2 with a TIFF parser inside APP1.
Recommendation
Hand-rolled segment walker. Reasoning:
- JPEG marker structure is fully specified and well-documented (~150 lines of clean TypeScript).
- WASM library options were both ruled out by the POCs.
- The walker has zero production dependencies and ships ~111 KB less than
little_exifwould have cost. - We control the marker policy directly — no library defaults to fight.
Phase 1 implementation
Lives at src/infrastructure/wasm/strategies/jpeg_strategy.ts. Key invariants:
- Marker policy: as the table above. Mirrors ExifTool's
-all=behaviour with three deliberate exceptions: APP14 always kept (decoder-affecting), APP2 kept on opt-in viapreserveColorProfile, APP1 EXIF replaced with a minimal Orientation-only APP1 on opt-in viapreserveOrientation. - Fill-byte tolerance: any number of consecutive
0xFFbytes before a marker code is permitted. - Truncation behaviour: missing EOI is a structural error and surfaces via
Result<_, ExifError>. The walker does not silently return malformed JPEGs. metadataRemoved: counts dropped APP/COM segments. A clean input that needed no changes returns0, not1— callers must not treat0as a failure signal.preserveOrientation: honored. When set, the walker parses IFD0 inside the original APP1 EXIF, extracts the Orientation tag (0x0112, SHORT, count 1), and emits a synthesized minimal APP1 carrying ONLY that tag. No Make/Model, no GPS, no SubIFDs, no MakerNotes survive. Both byte orders (II/MM) are supported. APP1 XMP is still dropped (XMP carries no orientation).
Compatibility note: APP0/JFIF removal
ExifTool's -all= drops APP0/JFIF and we follow that policy. Modern decoders (browsers, libjpeg/libjpeg-turbo, ImageMagick, Skia) don't require APP0 — they read sample dimensions from SOF and treat absence of APP0 as "no JFIF metadata." Some legacy strict-JFIF pipelines (older scanner pipelines, certain embedded image libraries) do require APP0 and may reject the cleaned output. If that becomes a real-world support issue, the cheap mitigation is to synthesize a minimal 18-byte APP0 (JFIF\0 identifier + version + units + density + zero thumbnail), which carries no PII. Not implemented in Phase 1; tracked under deferred items.
Privacy note: ICC profile preservation
preserveColorProfile: true keeps the APP2 ICC profile segment in the output. ICC profiles include cmmId, profile creator, dateTime, and description strings — a small but real fingerprint surface. Callers who need accurate colour reproduction should accept this trade-off explicitly; the default of false errs toward privacy.
Deferred to Phase 2 (if needed)
- Comparison-corpus test against
exiftool -all=on a diverse fixture set (Canon, Nikon, iPhone-via-Photos, Photoshop, GIMP) to expose any vendor-specific surprises. - Granular ICC scrubbing — write back the ICC profile with the identity-revealing fields zeroed instead of all-or-nothing.
- Sub-error-codes so callers can distinguish "not a JPEG" from "truncated JPEG" from "valid JPEG processed cleanly with zero metadata to remove."
- Synthesized minimal APP0 for strict-JFIF decoder compatibility (only if real-world support reports surface).