exifcleaner-web/docs/gap-analysis/jpeg.md
obuvuyoviz26-lab 68b29c9e4b
feat(jpeg,pdf): route through WASM in Electron, honor preserveOrientation (#56)
* feat(jpeg,pdf): route through WasmProcessor in Electron, honor preserveOrientation

Closes the JPEG orientation gap that previously blocked migrating .jpg/.jpeg
off the ExifTool path. The walker now parses IFD0 inside APP1 EXIF, finds
tag 0x0112 (Orientation, SHORT, count 1), and — when preserveOrientation
is set — emits a synthesized minimal APP1 carrying ONLY that tag. Make,
Model, GPS, SubIFDs, MakerNotes, XMP all do not survive. Both byte
orders (II / MM) are supported. Forensic check on a JPEG with
Make+Orientation+GPS confirms only Orientation remains in output.

With orientation handled, .jpg, .jpeg, and .pdf are added to
WASM_HANDLED_EXTENSIONS so Electron uses the hand-rolled walkers
instead of ExifTool. The web build was already routing them through
WASM.

ExifTool remains the path for the long tail (TIFF, HEIC/HEIF, RAW,
AVIF, BMP, GIF, SVG, WebP, MKV, AVI, WMV) until each gets its own
strategy.

Tests: 8 new JPEG orientation tests (BE/LE, drop/keep cases, XMP,
no-Orientation, garbage values), updated renderer routing tests,
sync-guard intact. Full suite: 378/378.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(processViaWasm): populate before/after metadata via ExifTool in Electron

The WASM strip path was unconditionally setting beforeMetadata and
afterMetadata to null, which gates the inspection panel
(MetadataExpansion only renders when both are non-null per FileRow.tsx).
Once .jpg/.pdf started routing through WasmProcessor in Electron, the
inspect-metadata UI silently regressed for those formats — caught by
the metadata-inspection.spec.ts e2e suite in CI.

Fix: when ExifTool is available (Electron build), read the original
file's metadata before the strip and the cleaned file's metadata after,
both via ExifTool. The strip itself still happens in the hand-rolled
WASM strategy. The web build skips this since ExifTool is unavailable.

This is a clean split of read/write responsibility — ExifTool's read
coverage is solid; only its write path had the gaps that motivated
the WASM walkers.

Updated the .docx routing test to reflect the new contract: removeMetadata
is not called (strip is WASM), but readMetadata is called twice (before
and after the strip, for the inspection UI).

* feat(forensic): JPEG forensic battery confirms zero sentinel survival

Phase 3 verification per .claude/rules/format-strategy-workflow.md.
Adds tools/forensic/jpeg.ts (the runner) and docs/forensic/jpeg.md
(the writeup) for the JpegStrategy changes in this PR.

The runner injects 10 sentinels via exiftool across every metadata-bearing
JPEG segment (APP1 EXIF: Make/Model/Software/Artist/Copyright/UserComment;
APP1 XMP: dc:creator/dc:title; APP13 IPTC: By-line; COM marker), plus
binary Orientation=6, then strips the fixture three ways:

  1. JpegStrategy default
  2. JpegStrategy with preserveOrientation: true
  3. exiftool -all=

Recovery battery on each output: raw `strings`, `exiftool -a -G1 -s`,
and an in-process marker walker that scans every APP*/COM payload as
latin-1.

Findings:

  - JpegStrategy default and exiftool -all= are byte-equivalent on
    this fixture (both 15 bytes, zero segments remaining, zero sentinel
    survivors via every channel).
  - JpegStrategy with preserveOrientation=true: 51 bytes, single 34-byte
    APP1 carrying ONLY the Orientation tag. Cross-check confirms
    Orientation=6 round-trips exactly; zero sentinels recoverable; no
    GPS/Make/MakerNote/XMP leak through the synthesized APP1.
  - Confirms the preserveOrientation gap-analysis prediction (no leakage
    from the synthesized minimal APP1) empirically.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 20:05:21 +04:00

7.6 KiB
Raw Permalink Blame History

JPEG metadata-stripping gap analysis

Date: 2026-05-06 (retrofitted from the architecture decisions that drove the Phase 1 implementation) Goal: Document the gap between the original piexifjs-based JPEG strategy and ExifTool's -all= JPEG strip, the scope of WASM library alternatives that were ruled out, and the rationale for the hand-rolled segment walker that ships in Phase 1.

Methodology

Read:

  • piexifjs source (the remove() function specifically) — confirmed it operates on APP1 (EXIF) only.
  • ExifTool documentation at https://exiftool.org/#limitations for the JPEG segments removed by -all=.
  • ITU-T T.81 (JPEG specification) §B.1 for marker assignments.
  • The previous image_strategy.ts (the piexifjs wrapper) and the output it produced on real fixtures.

Verified empirically (in docs/poc/little-exif-wasm.md and docs/poc/exiv2-wasm.md):

  • piexifjs leaves the JPEG Comment marker (0xFFFE) intact even when the user-set Comment is the most user-visible PII source.
  • piexifjs leaves JFIF/APP0 intact (resolution + units; usually inert but still metadata).
  • The previous implementation also had a critical correctness bug: the TextDecoder("latin1") round-trip silently corrupted bytes 0x800x9F because WHATWG aliases latin1 to windows-1252 (where those values are not 1:1).
  • little_exif (Rust → WASM): ~330 KB raw / 111 KB gzip. Left Comment + JFIF + PNG text chunks untouched, errored on TIFF.
  • exiv2-wasm: ~2.3 MB raw / 925 KB gzip. The published API has no erase primitive — writeString(buf, key, "") sets values empty but leaves tag IDs in place.

Both library options ruled out for JPEG (and other image formats); see the POC writeups.

Per-segment policy

JPEG marker structure: each segment is 0xFF <code> <length-2-bytes-big-endian> <payload>, except for standalone markers without a length field (SOI, EOI, RST0RST7, TEM). After SOS, an entropy-coded scan stream extends until the next non-stuffed, non-restart marker. T.81 §B.1.1.2 also permits any number of 0xFF "fill bytes" before a marker code.

Marker Code Source of leak piexifjs (before) ExifTool -all= Phase 1 walker
SOI FFD8 n/a keep keep keep
JFIF / APP0 FFE0 density, JFIF version, optional thumbnail leaves intact drops drops
EXIF / APP1 FFE1 EXIF IFD, GPS, MakerNotes, XMP drops drops drops
ICC / APP2 FFE2 colour profile (cmmId, creator, dateTime, desc strings) leaves drops drops by default; kept when preserveColorProfile: true
APP3..APP12 FFE3..FFEC various app-specific (Photoshop, Flashpix, MakerNotes, …) leaves drops drops
Photoshop / IPTC / APP13 FFED 8BIM, IPTC, Photoshop image resources leaves drops drops
Adobe / APP14 FFEE Adobe DCT encoding signal leaves leaves keeps — required for correct decoding of some Adobe-encoded JPEGs
APP15 FFEF rare leaves drops drops
Comment FFFE arbitrary string (filenames, review notes, user comments) leaves intact drops drops
DQT, DHT, SOF, SOS, RST, EOI, DRI, DAC various image data keep keep keep

Entropy-coded data after SOS is copied byte-for-byte. 0xFF 0x00 byte-stuffing and 0xFF D0..D7 restart markers within the stream are preserved (they're part of the entropy data, not metadata).

Honest gap summary

piexifjs vs ExifTool: piexifjs covered roughly 1015% of what -all= removes. The Comment marker survival was the most user-visible privacy gap. JFIF/APP0 is rarely meaningful, but it's still data the user expects to be stripped.

ExifTool -all= vs theoretical: essentially equivalent on a single-pass strip. ExifTool has been battle-tested against 20+ years of edge-case fixtures; a hand-rolled walker is exposed to whatever subset we test against.

Phase 1 walker vs ExifTool -all=: the policy table is identical for the marker classes covered. Differences are at the edges:

  • Fill bytes between markers (T.81 §B.1.1.2) — Phase 1 handles by skipping fill-byte runs at the top of each iteration.
  • Hierarchical / multi-frame JPEGs (rare) — Phase 1 handles single hierarchy via the SOS-then-entropy cycle re-entering for subsequent SOS markers.
  • Granular tag-level operations (e.g. -EXIF:Orientation= keep) — out of scope for the walker; planned Phase 2 with a TIFF parser inside APP1.

Recommendation

Hand-rolled segment walker. Reasoning:

  • JPEG marker structure is fully specified and well-documented (~150 lines of clean TypeScript).
  • WASM library options were both ruled out by the POCs.
  • The walker has zero production dependencies and ships ~111 KB less than little_exif would have cost.
  • We control the marker policy directly — no library defaults to fight.

Phase 1 implementation

Lives at src/infrastructure/wasm/strategies/jpeg_strategy.ts. Key invariants:

  • Marker policy: as the table above. Mirrors ExifTool's -all= behaviour with three deliberate exceptions: APP14 always kept (decoder-affecting), APP2 kept on opt-in via preserveColorProfile, APP1 EXIF replaced with a minimal Orientation-only APP1 on opt-in via preserveOrientation.
  • Fill-byte tolerance: any number of consecutive 0xFF bytes before a marker code is permitted.
  • Truncation behaviour: missing EOI is a structural error and surfaces via Result<_, ExifError>. The walker does not silently return malformed JPEGs.
  • metadataRemoved: counts dropped APP/COM segments. A clean input that needed no changes returns 0, not 1 — callers must not treat 0 as a failure signal.
  • preserveOrientation: honored. When set, the walker parses IFD0 inside the original APP1 EXIF, extracts the Orientation tag (0x0112, SHORT, count 1), and emits a synthesized minimal APP1 carrying ONLY that tag. No Make/Model, no GPS, no SubIFDs, no MakerNotes survive. Both byte orders (II / MM) are supported. APP1 XMP is still dropped (XMP carries no orientation).

Compatibility note: APP0/JFIF removal

ExifTool's -all= drops APP0/JFIF and we follow that policy. Modern decoders (browsers, libjpeg/libjpeg-turbo, ImageMagick, Skia) don't require APP0 — they read sample dimensions from SOF and treat absence of APP0 as "no JFIF metadata." Some legacy strict-JFIF pipelines (older scanner pipelines, certain embedded image libraries) do require APP0 and may reject the cleaned output. If that becomes a real-world support issue, the cheap mitigation is to synthesize a minimal 18-byte APP0 (JFIF\0 identifier + version + units + density + zero thumbnail), which carries no PII. Not implemented in Phase 1; tracked under deferred items.

Privacy note: ICC profile preservation

preserveColorProfile: true keeps the APP2 ICC profile segment in the output. ICC profiles include cmmId, profile creator, dateTime, and description strings — a small but real fingerprint surface. Callers who need accurate colour reproduction should accept this trade-off explicitly; the default of false errs toward privacy.

Deferred to Phase 2 (if needed)

  • Comparison-corpus test against exiftool -all= on a diverse fixture set (Canon, Nikon, iPhone-via-Photos, Photoshop, GIMP) to expose any vendor-specific surprises.
  • Granular ICC scrubbing — write back the ICC profile with the identity-revealing fields zeroed instead of all-or-nothing.
  • Sub-error-codes so callers can distinguish "not a JPEG" from "truncated JPEG" from "valid JPEG processed cleanly with zero metadata to remove."
  • Synthesized minimal APP0 for strict-JFIF decoder compatibility (only if real-world support reports surface).