B: Deployable webapp — Vite build, web adapters, JPEG/PDF strategies, PWA, Docker, CI
Phase B ships a static-asset web build that processes files entirely in the browser via WASM and pure-TS strategies — no server, no upload, fully offline. Same renderer code path as Electron; WASM strategies plug in via the existing `FormatStrategy` interface, and `platform.isWeb` drives routing.
## Web build infrastructure
- `vite.config.web.ts` + dedicated entry `src/web/main.tsx`
- Web adapters (`FileRegistry`, `BrowserFileBytes`, `makeWebApi`) mirror the Electron preload contract so the renderer is unchanged
- File picker button (`FileBrowseButton`) gated on `platform.isWeb`; rendered both in the empty state and in the StatusBar so users can add to a mid-batch without clearing
## PWA + deployment
- `manifest.webmanifest`, service worker via `vite-plugin-pwa` (Workbox)
- `Dockerfile` (multi-stage Node 22 → nginx:alpine) with `nginx.conf` carrying COOP/COEP/CSP/cache headers
- Cloudflare Pages: `public/_headers` mirrors the nginx config; deploy workflow shipped at `.github/workflows/deploy-web.yml` (gated to `workflow_dispatch` until secrets are wired)
- Unified deploy guide at `docs/deploying.md` covering Cloudflare Pages, Docker + Caddy, Docker + nginx + certbot, Cloudflare Tunnel, and Tailscale Funnel
## JPEG strategy — replaces piexifjs entirely
Hand-rolled marker walker at `src/infrastructure/wasm/strategies/jpeg_strategy.ts`. Mirrors ExifTool's `-all=` policy with two deliberate exceptions: APP14 (Adobe DCT) always kept, APP2 (ICC) kept on opt-in via `preserveColorProfile`. Drops APP0/APP1/APP3–APP13/APP15 + COM. Entropy-stream byte-stuffing and RST markers preserved verbatim.
- Fill-byte tolerance per T.81 §B.1.1.2
- Truncation surfaces as `Result<_, ExifError>` rather than silently emitting a no-EOI file
- Removes `piexifjs` and `@types/piexifjs` as production deps
- Closes the prior `TextDecoder("latin1")` corruption bug (WHATWG aliases `latin1` to windows-1252, so 0x80–0x9F bytes were silently rewritten)
## PDF strategy — rewrites the previous "setX('')" pass
`pdf-lib` with `updateMetadata: false` to defeat auto-stamp of Producer / ModDate / Creator / CreationDate. Direct Info-dict key deletion. Drops the catalog `/Metadata` reference *and* the indirect XMP stream object (orphan-free). Scrubs annotation `/T`, `/Contents`, `/M`, `/CreationDate`, `/RC`, `/Subj`. Drops `/Lang`, `/PageLabels`, `/OutputIntents`, per-page `/Metadata`, per-page `/Thumb`. AcroForm and `/EmbeddedFiles` deferred behind opt-ins.
Also corrects the previous "pdf-lib re-injects Producer" claim in source comments + README — it does not (verified empirically; the string is an in-memory `getProducer()` fallback, not file content).
## Documentation pattern
Three companion folders for each format:
- `docs/gap-analysis/<format>.md` — current vs reference vs theoretical, *before* implementation. Contains `pdf.md` and `jpeg.md`.
- `docs/poc/<approach>.md` — library evaluations. Contains `little-exif-wasm.md` and `exiv2-wasm.md` (both ruled out — full size + coverage data).
- `docs/forensic/<format>.md` — adversarial recovery tests *after* implementation, with reproducible runner under `tools/forensic/`. Contains `pdf.md` (zero sentinel survival across `strings`, `exiftool -PDF-update:all=`, `qpdf --qdf`, and in-process pdf-lib indirect-object walk; ExifTool's own output leaks 8/10 sentinels via the same battery).
## CI
New workflow runs lint + typecheck + tests + electron compile + Playwright e2e on PRs and master. Platform builds compile on PRs but skip artifact upload to avoid the GitHub Actions storage quota; uploads only happen on master pushes.
## Deferred (tracked as issues)
- HEIC strategy for mobile (issue #48 — most-hit format on iPhone)
- Mobile touch UX pass (#49)
- iOS Photos picker UX note (#50)
- Unsupported-format messaging (#51)
- PWA install prompt UX (#52)
- `preserveOrientation` honoring inside JPEG (Phase 2 of `docs/gap-analysis/jpeg.md`)
- Forensic comparison-corpus runs for JPEG and other formats
## Stats
- 369/369 unit tests, lint clean, typecheck clean
- Electron build: ~640 KB renderer bundle
- Web build: 1.1 MB precache (PWA shell)
- 30 commits on the branch (squashed into this one)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
This commit is contained in:
parent
17ccdd20dd
commit
dfbd680737
47 changed files with 7263 additions and 69 deletions
5
.github/workflows/ci.yml
vendored
5
.github/workflows/ci.yml
vendored
|
|
@ -81,7 +81,10 @@ jobs:
|
|||
- name: Build macOS artifacts
|
||||
run: yarn packmac --publish never
|
||||
|
||||
# Only upload on master pushes — PR runs prove the build compiles,
|
||||
# uploading every PR artifact eats the GitHub Actions storage quota.
|
||||
- name: Upload macOS artifacts
|
||||
if: github.event_name == 'push'
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: ExifCleaner-macOS
|
||||
|
|
@ -108,6 +111,7 @@ jobs:
|
|||
run: yarn packwin --publish never
|
||||
|
||||
- name: Upload Windows artifacts
|
||||
if: github.event_name == 'push'
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: ExifCleaner-Windows
|
||||
|
|
@ -135,6 +139,7 @@ jobs:
|
|||
run: yarn packlinux --publish never
|
||||
|
||||
- name: Upload Linux artifacts
|
||||
if: github.event_name == 'push'
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: ExifCleaner-Linux
|
||||
|
|
|
|||
44
.github/workflows/deploy-web.yml
vendored
Normal file
44
.github/workflows/deploy-web.yml
vendored
Normal file
|
|
@ -0,0 +1,44 @@
|
|||
name: Deploy Web App to Cloudflare Pages
|
||||
|
||||
# Manual trigger only for now. To re-enable automatic deploys on push and
|
||||
# PR to master, replace this `on:` block with:
|
||||
# on:
|
||||
# push:
|
||||
# branches: [master]
|
||||
# pull_request:
|
||||
# branches: [master]
|
||||
# (Requires CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID secrets to be
|
||||
# set in repo settings — see docs/deploying.md.)
|
||||
on:
|
||||
workflow_dispatch:
|
||||
|
||||
jobs:
|
||||
deploy:
|
||||
runs-on: ubuntu-latest
|
||||
name: Build and Deploy
|
||||
permissions:
|
||||
contents: read
|
||||
deployments: write
|
||||
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Node.js
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version: "22"
|
||||
cache: "yarn"
|
||||
|
||||
- name: Install dependencies
|
||||
run: yarn install --frozen-lockfile
|
||||
|
||||
- name: Build web app
|
||||
run: yarn build:web
|
||||
|
||||
- name: Deploy to Cloudflare Pages
|
||||
uses: cloudflare/wrangler-action@v3
|
||||
with:
|
||||
apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
|
||||
accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
|
||||
command: pages deploy dist/web --project-name=exifcleaner-web --commit-dirty=true
|
||||
|
|
@ -51,6 +51,12 @@
|
|||
"fa": "تصاویر، ویدئو یا فایلهای پیدیاف را اینجا بکشید تا متادیتا به صورت خودکار حذف شود.",
|
||||
"ca": "Arrossegueu i deixeu anar imatges, vídeos o fitxers PDF per a eliminar automàticament les metadades."
|
||||
},
|
||||
"empty.browseButton": {
|
||||
"en": "Choose Files"
|
||||
},
|
||||
"statusBar.addFiles": {
|
||||
"en": "Add files"
|
||||
},
|
||||
"table.header.filename": {
|
||||
"en": "Selected files",
|
||||
"da": "Valgte filer",
|
||||
|
|
@ -1311,4 +1317,4 @@
|
|||
"en": "Clear",
|
||||
"fr": "Effacer"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
17
Dockerfile
Normal file
17
Dockerfile
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
# Stage 1: build the static web bundle
|
||||
FROM node:22-alpine AS builder
|
||||
WORKDIR /app
|
||||
|
||||
# Install dependencies first (better layer caching)
|
||||
COPY package.json yarn.lock ./
|
||||
RUN yarn install --frozen-lockfile
|
||||
|
||||
# Copy source and build
|
||||
COPY . .
|
||||
RUN yarn build:web
|
||||
|
||||
# Stage 2: serve with nginx
|
||||
FROM nginx:alpine AS server
|
||||
COPY --from=builder /app/dist/web /usr/share/nginx/html
|
||||
COPY nginx.conf /etc/nginx/nginx.conf
|
||||
EXPOSE 80
|
||||
141
README.md
141
README.md
|
|
@ -34,6 +34,47 @@ ExifCleaner v4.0 is a complete modernization — the first release since v3.6.0
|
|||
|
||||
See the [CHANGELOG](CHANGELOG.md) for the full list of changes.
|
||||
|
||||
## Project Direction
|
||||
|
||||
ExifCleaner currently runs two metadata-removal engines side by side. The Electron desktop build ships a bundled Perl ExifTool, which gives it broad coverage across 90+ formats (the canonical list lives in [Supported File Types](#supported-file-types)). The web build, shipped in Phase B, processes files entirely in the browser via WASM and pure-JS strategies — no server, no upload, fully offline.
|
||||
|
||||
The goal is to converge on a single code path: format strategies that run identically in both Electron and the browser. Same library, same behaviour, no Perl runtime, no platform fork. The plan, after the POCs documented under [`docs/poc/`](docs/poc/), is to **hand-roll pure-TypeScript marker and chunk walkers** for documented containers (JPEG, PNG, WebP, GIF, BMP, TIFF). They turned out to be smaller, more transparent, and more thorough than the WASM libraries we evaluated (`little_exif` and `exiv2-wasm` both leave significant metadata behind on JPEG/PNG; full writeups in [`docs/poc/`](docs/poc/)). For ISOBMFF-based formats (HEIC, AVIF), the existing video-strategy box walker provides a starting point; a targeted Rust→WASM module is the second-line option only if a hand-rolled approach proves insufficient.
|
||||
|
||||
Server-side processing is **explicitly out of scope**. Uploading user files to a server, even as a "last resort fallback", would invalidate the privacy guarantee that defines this app — and "last resort" tends to drift to "default". The web build stays fully offline. Users who need formats the web build cannot handle are directed to the desktop app, which keeps bundled ExifTool indefinitely for that purpose.
|
||||
|
||||
RAW formats are the honest exception. ExifTool represents roughly two decades of reverse-engineering on proprietary RAW containers (CR2, CR3, NEF, ARW, RAF, ORF, DNG, and dozens of vendor variants), and no production-ready WASM library covers that surface. RAW support stays on ExifTool inside the Electron desktop build and remains unsupported in the web build for the foreseeable future. RAW workflows belong on the desktop app.
|
||||
|
||||
## Format Support Matrix
|
||||
|
||||
A fast lookup of where each format stands today. The full list of 90+ ExifTool-supported formats is below under [Supported File Types](#supported-file-types).
|
||||
|
||||
| Format | Electron | Web |
|
||||
| --- | --- | --- |
|
||||
| JPG, JPEG | Full (ExifTool) | Best-effort¹ (piexifjs strategy) |
|
||||
| PNG | Full (ExifTool) | Unsupported² |
|
||||
| GIF | Full (ExifTool) | Unsupported² |
|
||||
| WebP | Full (ExifTool) | Unsupported² |
|
||||
| BMP | Full (ExifTool) | Unsupported² |
|
||||
| TIFF | Full (ExifTool) | Unsupported² |
|
||||
| HEIC, HEIF | Full (ExifTool) | Unsupported² |
|
||||
| AVIF | Full (ExifTool) | Unsupported² |
|
||||
| PDF | Full (ExifTool) | Best-effort³ |
|
||||
| DOCX, XLSX, PPTX | Full (WASM strategy) | Full (WASM strategy) |
|
||||
| ODT | Full (WASM strategy) | Full (WASM strategy) |
|
||||
| MP4, MOV, M4V | Full (WASM strategy) | Full (WASM strategy) |
|
||||
| 3GP, 3G2 | Full (WASM strategy) | Full (WASM strategy) |
|
||||
| MKV | Unsupported | Unsupported |
|
||||
| RAW (CR2/CR3/NEF/ARW/RAF/ORF/DNG/...) | Best-effort⁴ (ExifTool) | Unsupported |
|
||||
| SVG, JXL, JPEG 2000 | Best-effort⁵ (ExifTool) | Unsupported |
|
||||
|
||||
Footnotes:
|
||||
|
||||
1. JPEG (web): the current piexifjs strategy clears the EXIF (APP1) segment but does not strip the JPEG Comment marker (0xFFFE), JFIF (APP0), or other APP segments. A hand-rolled JPEG segment walker is the planned replacement.
|
||||
2. Image formats listed as Unsupported in the web build fall through with an explicit "unsupported in web" error today. Hand-rolled marker/chunk walkers are the planned path; see [`docs/poc/`](docs/poc/) for the investigations that ruled out WASM library alternatives.
|
||||
3. PDF (web): the strategy clears the Info dictionary (Title, Author, Subject, Keywords, Producer, Creator, CreationDate, ModDate), drops the catalog `/Metadata` XMP stream and its indirect object, scrubs annotation author/comment/timestamp keys, and removes catalog-level fingerprints (`/Lang`, `/PageLabels`, `/OutputIntents`) plus per-page `/Metadata` and `/Thumb`. Embedded files and AcroForm data are not touched (they may carry legitimate document content). The strip is structurally cleaner than ExifTool's PDF behaviour, which uses incremental updates and leaves the original metadata recoverable in the file body — see ExifTool's own [limitations](https://exiftool.org/#limitations) ("the original metadata is never actually deleted"). Full analysis: [`docs/gap-analysis/pdf.md`](docs/gap-analysis/pdf.md).
|
||||
4. RAW: ExifTool's own docs warn that fully stripping a RAW will likely break rendering — proprietary tags are required to decode the image. ExifCleaner removes what it safely can. No production-ready WASM library covers proprietary RAW formats today, so RAW will likely stay on ExifTool (Electron) for the foreseeable future.
|
||||
5. SVG/JXL/JPEG 2000: ExifTool support varies by tag; see the official [ExifTool documentation](https://exiftool.org/) for per-format detail.
|
||||
|
||||
## Download and Install
|
||||
|
||||
macOS 10.15+, Windows 10+, and Linux are supported (64-bit).
|
||||
|
|
@ -58,6 +99,42 @@ Each release includes a `SHASUMS256.txt` file. Download it from the [release pag
|
|||
sha256sum -c SHASUMS256.txt 2>&1 | grep OK
|
||||
```
|
||||
|
||||
## Running the web app locally
|
||||
|
||||
ExifCleaner runs entirely in your browser — no server-side processing, no file uploads.
|
||||
|
||||
### Option 1: Docker (recommended)
|
||||
|
||||
```bash
|
||||
# Build the image
|
||||
docker build -t exifcleaner-web .
|
||||
|
||||
# Run on http://localhost:8080
|
||||
docker run -p 8080:80 exifcleaner-web
|
||||
```
|
||||
|
||||
Open http://localhost:8080. Drag and drop files to clean metadata.
|
||||
|
||||
### Option 2: Node dev server
|
||||
|
||||
Requires Node 22 and yarn:
|
||||
|
||||
```bash
|
||||
yarn install
|
||||
yarn dev:web
|
||||
```
|
||||
|
||||
Open http://localhost:5173.
|
||||
|
||||
### Option 3: Build and preview
|
||||
|
||||
```bash
|
||||
yarn build:web
|
||||
yarn preview:web
|
||||
```
|
||||
|
||||
Open http://localhost:4173. This serves the same optimised bundle as production.
|
||||
|
||||
## Links
|
||||
|
||||
- [Official Website](https://exifcleaner.com)
|
||||
|
|
@ -211,6 +288,70 @@ yarn lint # Prettier formatting check
|
|||
yarn typecheck # TypeScript strict mode check
|
||||
```
|
||||
|
||||
### Contributing a Format Strategy
|
||||
|
||||
A `FormatStrategy` is a pure function that takes file bytes and returns cleaned bytes for one or more file extensions. Strategies are how ExifCleaner is unifying its desktop and web builds on a single processing pipeline — they run identically in Electron and the browser, with no Perl ExifTool dependency.
|
||||
|
||||
The interface lives at [`src/infrastructure/wasm/format_strategy.ts`](src/infrastructure/wasm/format_strategy.ts):
|
||||
|
||||
```typescript
|
||||
export interface FormatStrategy {
|
||||
/**
|
||||
* Returns the lowercase set of file extensions this strategy handles
|
||||
* (each starting with a dot, e.g. ".docx").
|
||||
*/
|
||||
readonly extensions: ReadonlySet<string>;
|
||||
|
||||
/**
|
||||
* Optional magic-byte check to confirm the file content matches the
|
||||
* declared extension. Returns true if confirmed, false to decline.
|
||||
* If absent, extension match alone is sufficient.
|
||||
*/
|
||||
readonly verifyMagicBytes?: (args: { bytes: Uint8Array }) => boolean;
|
||||
|
||||
/**
|
||||
* Strips metadata from the file bytes and returns the cleaned bytes.
|
||||
* Pure function — no I/O, no globals.
|
||||
*/
|
||||
strip(args: {
|
||||
bytes: Uint8Array;
|
||||
options: StripOptions;
|
||||
}): Promise<Result<StripResult, ExifError>>;
|
||||
}
|
||||
```
|
||||
|
||||
To add a new strategy:
|
||||
|
||||
1. Create `src/infrastructure/wasm/strategies/<name>_strategy.ts` and implement `FormatStrategy`. Keep it pure — accept bytes, return bytes.
|
||||
2. Register it in [`src/infrastructure/wasm/strategy_registry.ts`](src/infrastructure/wasm/strategy_registry.ts) by adding an instance to the `STRATEGIES` array.
|
||||
3. (Optional) Add the extension to [`src/renderer/utils/wasm_handled_extensions.ts`](src/renderer/utils/wasm_handled_extensions.ts) **only** if you want the Electron build to route this format through your strategy instead of ExifTool. The default for new formats is web-only; ExifTool remains the Electron default unless you opt in.
|
||||
4. Add tests at `tests/infrastructure/wasm/<name>_strategy.test.ts` — see the existing `image_strategy.test.ts`, `pdf_strategy.test.ts`, `office_strategy.test.ts`, and `video_strategy.test.ts` for the established patterns (round-trip fixtures, magic-byte rejection, malformed input handling).
|
||||
|
||||
The smallest existing strategy is the JPEG path in [`src/infrastructure/wasm/strategies/image_strategy.ts`](src/infrastructure/wasm/strategies/image_strategy.ts). Class skeleton:
|
||||
|
||||
```typescript
|
||||
export class ImageStrategy implements FormatStrategy {
|
||||
readonly extensions: ReadonlySet<string> = new Set([".jpg", ".jpeg"]);
|
||||
|
||||
verifyMagicBytes({ bytes }: { bytes: Uint8Array }): boolean {
|
||||
// JPEG SOI: FF D8
|
||||
return (bytes[0] ?? 0) === 0xff && (bytes[1] ?? 0) === 0xd8;
|
||||
}
|
||||
|
||||
async strip({
|
||||
bytes,
|
||||
options: _options,
|
||||
}: {
|
||||
bytes: Uint8Array;
|
||||
options: StripOptions;
|
||||
}): Promise<Result<StripResult, ExifError>> {
|
||||
// ...delegates to piexifjs, returns { ok, value: { bytes, metadataRemoved } }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
See the file for the full implementation including dynamic `import("piexifjs")` and the latin1 round-trip.
|
||||
|
||||
### Adding a Translation
|
||||
|
||||
All translations live in [`.resources/strings.json`](https://github.com/szTheory/exifcleaner/blob/master/.resources/strings.json). Add an entry for the new language code ([list of codes](https://www.electronjs.org/docs/api/locales)) under each string:
|
||||
|
|
|
|||
220
docs/deploying.md
Normal file
220
docs/deploying.md
Normal file
|
|
@ -0,0 +1,220 @@
|
|||
# Deploying the web app
|
||||
|
||||
The web build is plain static assets — HTML, JS, CSS, WASM, a manifest, a service worker. Anything that serves static files over HTTPS will host it. Two reference paths are documented below; pick whichever fits.
|
||||
|
||||
- [Self-hosted Docker](#self-hosted-docker) — full control, runs anywhere, no third party serves your code. The bundled `nginx.conf` already has all required headers configured.
|
||||
- [Cloudflare Pages](#cloudflare-pages) — free hosting on a global CDN, auto-deploy from GitHub. Currently the GitHub Actions workflow is set to manual trigger only.
|
||||
|
||||
If you have no preference, start with self-hosted Docker + Cloudflare Tunnel — it's the lowest lock-in path and easy to move off of later.
|
||||
|
||||
## Common requirements
|
||||
|
||||
- **HTTPS is mandatory.** Browsers refuse to register service workers over plain HTTP, which means PWA install will not work. The only exception is `localhost`, useful for development.
|
||||
- **Headers must be set.** The Docker image's `nginx.conf` and the Cloudflare Pages `public/_headers` are intentionally identical (COOP/COEP for SharedArrayBuffer, CSP with `'wasm-unsafe-eval'` for WASM, immutable cache for `/assets/*`, no-cache for `/sw.js`). If you change one, change the other.
|
||||
- **The build output is `dist/web/`.** Produced by `yarn build:web`. Self-hosted, you'll either copy this into the Docker image (the included `Dockerfile` does this) or serve it directly from any static host.
|
||||
|
||||
## Self-hosted Docker
|
||||
|
||||
The included `Dockerfile` is a multi-stage build: stage 1 builds the bundle with Node 22; stage 2 serves it with nginx Alpine. Final image is around 90 MB.
|
||||
|
||||
### Build and run locally
|
||||
|
||||
```bash
|
||||
docker build -t exifcleaner-web .
|
||||
docker run -d -p 8080:80 --name exifcleaner-web exifcleaner-web
|
||||
# → reachable at http://localhost:8080
|
||||
```
|
||||
|
||||
This is enough for local testing. For phone testing or sharing, you need HTTPS — three concrete options below.
|
||||
|
||||
### Option A: Cloudflare Tunnel (free, no port-forwarding)
|
||||
|
||||
Easiest path. Install [`cloudflared`](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/downloads/), point it at the local container, get a public HTTPS URL. Works from any network — no router or firewall configuration.
|
||||
|
||||
Quick test (random URL, throwaway):
|
||||
|
||||
```bash
|
||||
cloudflared tunnel --url http://localhost:8080
|
||||
# → outputs https://<random>.trycloudflare.com
|
||||
```
|
||||
|
||||
Stable URL on a domain you own (requires a free Cloudflare account with the domain on Cloudflare's nameservers):
|
||||
|
||||
```bash
|
||||
cloudflared tunnel login
|
||||
cloudflared tunnel create exifcleaner
|
||||
cloudflared tunnel route dns exifcleaner exifcleaner.example.com
|
||||
cloudflared tunnel run --url http://localhost:8080 exifcleaner
|
||||
```
|
||||
|
||||
Cloudflare provisions and renews TLS certs automatically.
|
||||
|
||||
### Option B: VPS with a reverse proxy
|
||||
|
||||
The "I own the box" version. Any cheap VPS works (Hetzner, DigitalOcean, Vultr — typically $4–6/month). Run the Docker container on port 8080, put any reverse proxy in front to terminate TLS. Both nginx and Caddy work; pick what you already know.
|
||||
|
||||
The Docker container's internal nginx already sets all required response headers (COOP/COEP/CSP). The reverse proxy preserves response headers by default, so no extra header configuration is needed at the proxy layer.
|
||||
|
||||
#### Caddy (shortest config, auto-TLS)
|
||||
|
||||
Caddy provisions and renews Let's Encrypt certs automatically. Full config:
|
||||
|
||||
```caddy
|
||||
exifcleaner.example.com {
|
||||
reverse_proxy localhost:8080
|
||||
}
|
||||
```
|
||||
|
||||
`caddy run` (or systemd unit), point your DNS at the VPS, done.
|
||||
|
||||
#### nginx + certbot (most operators already know it)
|
||||
|
||||
```nginx
|
||||
# /etc/nginx/sites-enabled/exifcleaner
|
||||
server {
|
||||
listen 80;
|
||||
server_name exifcleaner.example.com;
|
||||
|
||||
location / {
|
||||
proxy_pass http://localhost:8080;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Then provision the cert and rewrite the config to enable HTTPS:
|
||||
|
||||
```bash
|
||||
sudo certbot --nginx -d exifcleaner.example.com
|
||||
```
|
||||
|
||||
Certbot adds the SSL block and the HTTP→HTTPS redirect. It also installs a renewal cron/timer.
|
||||
|
||||
### Option C: Tailscale Funnel (home server, no exposed ports)
|
||||
|
||||
If the server is at home and you'd rather not expose ports on your router, [Tailscale Funnel](https://tailscale.com/kb/1223/funnel) exposes a Tailscale-internal service to the public internet over HTTPS. Free for personal use.
|
||||
|
||||
```bash
|
||||
sudo tailscale funnel --bg --https=443 8080
|
||||
```
|
||||
|
||||
The service is reachable at `https://<machine>.<tailnet>.ts.net`.
|
||||
|
||||
## Cloudflare Pages
|
||||
|
||||
Free tier: 500 builds/month, unlimited requests, unlimited bandwidth, 100 custom domains per project. No credit card required.
|
||||
|
||||
The GitHub Actions workflow at `.github/workflows/deploy-web.yml` handles the deploy. **It is currently set to manual trigger only** (`workflow_dispatch`) so it doesn't run before secrets are configured. Re-enable automatic deploys by editing the `on:` block per the comment in the file.
|
||||
|
||||
### 1. Create a Cloudflare account
|
||||
|
||||
Sign up at <https://dash.cloudflare.com/sign-up>.
|
||||
|
||||
### 2. Capture your Account ID
|
||||
|
||||
After login, the dashboard URL is `https://dash.cloudflare.com/<account-id>`. The same value appears in the **Workers & Pages** tab right sidebar with a copy button.
|
||||
|
||||
### 3. Create an API token scoped to Pages
|
||||
|
||||
Use a scoped token, not the global API key.
|
||||
|
||||
1. Top-right profile icon → **My Profile** → **API Tokens** → **Create Token**
|
||||
2. Pick **Custom token**
|
||||
3. **Permissions:** `Account` → `Cloudflare Pages` → `Edit`
|
||||
4. **Account Resources:** `Include` → your account
|
||||
5. **Continue** → **Create Token**
|
||||
6. **Copy the token now** — Cloudflare shows it once
|
||||
|
||||
### 4. Add the two secrets to GitHub
|
||||
|
||||
In the repo:
|
||||
|
||||
1. **Settings** → **Secrets and variables** → **Actions** → **New repository secret**
|
||||
2. Add `CLOUDFLARE_API_TOKEN` (token from step 3)
|
||||
3. Add `CLOUDFLARE_ACCOUNT_ID` (account ID from step 2)
|
||||
|
||||
### 5. Trigger the first deploy
|
||||
|
||||
While the workflow is on `workflow_dispatch`:
|
||||
|
||||
- Go to the **Actions** tab → **Deploy Web App to Cloudflare Pages** → **Run workflow** → pick branch → **Run**
|
||||
- The first run auto-creates the Pages project with name `exifcleaner-web`
|
||||
|
||||
To re-enable auto-deploys on every push, edit the `on:` block in the workflow file. Once enabled:
|
||||
|
||||
- **Push to `master`** → production deployment
|
||||
- **Pull request to `master`** → preview deployment per commit (URL posted in the action output)
|
||||
|
||||
### 6. Find your URL
|
||||
|
||||
- **Production:** `https://exifcleaner-web.pages.dev`
|
||||
- **Preview:** `https://<commit-hash>.exifcleaner-web.pages.dev`
|
||||
|
||||
Both visible at: Cloudflare dashboard → **Workers & Pages** → `exifcleaner-web` → **Deployments**.
|
||||
|
||||
### 7. Custom domain (optional)
|
||||
|
||||
Free.
|
||||
|
||||
1. **Workers & Pages** → `exifcleaner-web` → **Custom domains** → **Set up a custom domain**
|
||||
2. Enter the domain
|
||||
3. Add the DNS records Cloudflare shows. If the domain is on Cloudflare's nameservers, this is automatic.
|
||||
|
||||
## Installing as a PWA on a phone
|
||||
|
||||
Once the URL is reachable over HTTPS, the install flow is the same regardless of how it's hosted.
|
||||
|
||||
### Android (Chrome)
|
||||
|
||||
1. Open the URL
|
||||
2. Wait a few seconds for the service worker to register
|
||||
3. ⋮ menu → **Install app** (older Chrome: "Add to Home Screen")
|
||||
4. Confirm. Icon lands on the home screen, launches in standalone mode (no browser chrome). Subsequent launches work offline.
|
||||
|
||||
### iOS (Safari)
|
||||
|
||||
1. Open the URL
|
||||
2. Share sheet → **Add to Home Screen**
|
||||
3. Confirm.
|
||||
|
||||
iOS PWAs have weaker capabilities than Android: no install banner, no Web Share Target API, more file-system limits. Functional, but second-class. See open issue #52 for the planned in-app install prompt UX.
|
||||
|
||||
### Offline behaviour
|
||||
|
||||
The first visit (online) caches the app shell + WASM modules via the service worker. After that, the PWA loads from cache regardless of connectivity. The actual file processing is in-browser anyway — files never leave the device — so offline use is a first-class case.
|
||||
|
||||
## Headers configuration
|
||||
|
||||
Two files, one truth. Keep them in sync.
|
||||
|
||||
| File | Used by | Format |
|
||||
| --- | --- | --- |
|
||||
| `nginx.conf` | Docker image | nginx `add_header` directives |
|
||||
| `public/_headers` | Cloudflare Pages | [Cloudflare Pages headers syntax](https://developers.cloudflare.com/pages/configuration/headers/) |
|
||||
|
||||
Both apply the same set:
|
||||
|
||||
- `Cross-Origin-Opener-Policy: same-origin`
|
||||
- `Cross-Origin-Embedder-Policy: require-corp`
|
||||
- `X-Frame-Options: DENY`
|
||||
- `X-Content-Type-Options: nosniff`
|
||||
- `Referrer-Policy: no-referrer`
|
||||
- `Content-Security-Policy` with `'wasm-unsafe-eval'`
|
||||
- Long-cache for `/assets/*`, no-cache for `/sw.js`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**PWA install option doesn't appear in the Chrome menu** — three things to check: (1) the URL is HTTPS, (2) the service worker registered (DevTools → Application → Service Workers should show one as "activated"), (3) the manifest is reachable at `/manifest.webmanifest` and parses cleanly. Without all three, Chrome won't surface the install prompt.
|
||||
|
||||
**Service worker doesn't register on Cloudflare Pages** — check response headers in the Network tab. COOP/COEP must be on the HTML response. If they're missing, the `_headers` file likely didn't make it into the build output. Verify `dist/web/_headers` exists after `yarn build:web`.
|
||||
|
||||
**Cloudflare workflow fails with "Authentication error"** — the API token scope must be `Account → Cloudflare Pages → Edit`. Account-level, not zone-level.
|
||||
|
||||
**Cloudflare workflow fails with "account_id is required"** — `CLOUDFLARE_ACCOUNT_ID` is missing or wrong. It's the hex string from the dashboard URL, not the account email.
|
||||
|
||||
**Phone shows "Add to Home Screen" but the icon launches in a normal browser tab** — manifest didn't pass Chrome's PWA criteria. Check that both 192px and 512px icons exist with `purpose: "any maskable"` and `display: "standalone"` is set in `manifest.webmanifest`.
|
||||
|
||||
**Self-signed cert on a LAN-only setup gives a security warning** — browsers will not register service workers behind self-signed certs even if you click through the warning. Use Cloudflare Tunnel or Tailscale Funnel for HTTPS without exposing your home network.
|
||||
125
docs/forensic/pdf.md
Normal file
125
docs/forensic/pdf.md
Normal file
|
|
@ -0,0 +1,125 @@
|
|||
# PDF forensic recovery test
|
||||
|
||||
**Date:** 2026-05-06
|
||||
**Goal:** Verify that metadata stripped by `PdfStrategy` cannot be recovered by an attacker with standard PDF forensic tooling. Compare against ExifTool `-all=` and Ghostscript pdfwrite as reference points.
|
||||
|
||||
**Reproducible at:** [`tools/forensic/pdf.ts`](../../tools/forensic/pdf.ts) — `npx tsx tools/forensic/pdf.ts` from the project root.
|
||||
|
||||
## Methodology
|
||||
|
||||
The runner generates a synthetic PDF fixture with **10 unique sentinel strings** embedded across every metadata source the gap analysis identified. Each sentinel is a 24-character ASCII string with a unique tail (e.g. `FORENSIC-AUTHOR-BBBB2222`) so any survivor can be unambiguously attributed to its source.
|
||||
|
||||
Sources covered:
|
||||
|
||||
| Sentinel | Where it lives | How it was injected |
|
||||
|---|---|---|
|
||||
| `TITLE` | `/Info /Title` | `doc.setTitle()` |
|
||||
| `AUTHOR` | `/Info /Author` | `doc.setAuthor()` |
|
||||
| `SUBJECT` | `/Info /Subject` | `doc.setSubject()` |
|
||||
| `PRODUCER` | `/Info /Producer` | `doc.setProducer()` |
|
||||
| `CREATOR` | `/Info /Creator` | `doc.setCreator()` |
|
||||
| `XMP_CREATOR` | XMP `/Metadata` stream `dc:creator` | raw stream object |
|
||||
| `XMP_TITLE` | XMP `/Metadata` stream `dc:title` | raw stream object |
|
||||
| `ANNOT_AUTHOR` | Page annotation `/T` | low-level annotation object |
|
||||
| `ANNOT_COMMENT` | Page annotation `/Contents` | low-level annotation object |
|
||||
| `LANG` | `/Catalog /Lang` | catalog dict set |
|
||||
|
||||
The fixture is then stripped three ways:
|
||||
|
||||
1. **`PdfStrategy`** — our Phase 1 implementation
|
||||
2. **`exiftool -all= -overwrite_original`** — the canonical reference
|
||||
3. **`gs -sDEVICE=pdfwrite`** — Ghostscript clean-rewrite as a third comparison
|
||||
|
||||
For each output, the runner applies five recovery techniques:
|
||||
|
||||
1. **Raw `strings`** — finds sentinels left in unencoded form anywhere in the file
|
||||
2. **`exiftool -a -G1 -s`** — every visible metadata tag including hidden namespaces
|
||||
3. **`exiftool -PDF-update:all=`** — ExifTool's "revert my last update" pseudo-tag, which restores metadata that was hidden via incremental updates
|
||||
4. **`qpdf --qdf --object-streams=disable`** — decompresses every FlateDecode stream and disables object streams, exposing all dictionary contents in plain text
|
||||
5. **Walk every indirect object via pdf-lib** — decompress streams in-process and search for sentinels
|
||||
|
||||
Plus structural checks: presence of `/Prev` (incremental-update chain), presence of the literal `BeginExifToolUpdate` marker, and `qpdf --check` validity.
|
||||
|
||||
## Results
|
||||
|
||||
| | Our strategy | ExifTool `-all=` | Ghostscript pdfwrite |
|
||||
|---|---|---|---|
|
||||
| Output size | 492 bytes | 2 249 bytes | 3 502 bytes |
|
||||
| Has `/Prev` (incremental update chain) | no | **yes** | no |
|
||||
| `BeginExifToolUpdate` marker | no | **yes** | no |
|
||||
| `qpdf --check` valid | yes | yes | yes |
|
||||
| Raw `strings` sentinels | **0** | 5 | 6 |
|
||||
| ExifTool visible tags | **0** | 1 | 4 |
|
||||
| After `-PDF-update:all=` revert | **0** | **8** | 4 |
|
||||
| `qpdf --qdf` decompressed | **0** | 3 | 6 |
|
||||
| Walk all streams (pdf-lib) | **0** | 2 | 4 |
|
||||
|
||||
Sentinel survival per recovery method (sentinels listed by short name):
|
||||
|
||||
**Our strategy — every check returns `[]`.** Output is 492 bytes; the smallest of the three.
|
||||
|
||||
**ExifTool `-all=`:**
|
||||
|
||||
- Raw strings: `XMP_CREATOR`, `XMP_TITLE`, `ANNOT_AUTHOR`, `ANNOT_COMMENT`, `LANG`
|
||||
- Visible tags: `LANG` (others are hidden under the incremental update)
|
||||
- After `-PDF-update:all=`: **`TITLE`, `AUTHOR`, `SUBJECT`, `PRODUCER`, `CREATOR`, `XMP_CREATOR`, `XMP_TITLE`, `LANG`** — eight sentinels recovered with one command
|
||||
- `qpdf --qdf`: `ANNOT_AUTHOR`, `ANNOT_COMMENT`, `LANG`
|
||||
- Walk all streams: `XMP_CREATOR`, `XMP_TITLE`
|
||||
|
||||
ExifTool itself emits the warning `Warning: [minor] ExifTool PDF edits are reversible. Deleted tags may be recovered!` when stripping. The test confirms that warning is literal: `exiftool -PDF-update:all=` recovers the original Info dict and XMP values from a single command.
|
||||
|
||||
**Ghostscript pdfwrite:**
|
||||
|
||||
- Raw strings: `TITLE`, `AUTHOR`, `SUBJECT`, `CREATOR`, `ANNOT_AUTHOR`, `ANNOT_COMMENT`
|
||||
- Visible tags: `TITLE`, `AUTHOR`, `SUBJECT`, `CREATOR`
|
||||
- After `-PDF-update:all=`: same as visible (no incremental update layer to revert)
|
||||
- `qpdf --qdf`: all six raw-string sentinels
|
||||
- Walk all streams: same as visible tags
|
||||
|
||||
Ghostscript pdfwrite is not a metadata-stripping pass — it copies the Info dict through unchanged (the PDF/A pipeline). It rewrites Producer to `GPL Ghostscript X.YZ`. Listed for comparison only; it is not a privacy tool.
|
||||
|
||||
## Interpretation
|
||||
|
||||
**Our strategy is the only output where every recovery technique returns zero survivors.** The combination of:
|
||||
|
||||
- Direct deletion of Info-dict keys (not "set to empty" — actually removed)
|
||||
- `updateMetadata: false` on load (defeats pdf-lib's auto-stamp of Producer/ModDate)
|
||||
- Removing both the catalog `/Metadata` reference *and* the indirect XMP stream object
|
||||
- Walking pages to scrub annotation `/T` / `/Contents` / `/M` / `/CreationDate`
|
||||
- pdf-lib's single-trailer rewrite (no `/Prev` chain, no incremental updates)
|
||||
|
||||
produces a file where the original metadata is genuinely gone — not hidden, not pending revert, not buried in an orphan stream. Output is also the smallest of the three (492 bytes vs ExifTool's 2 249 — ExifTool's strip *grows* the file because incremental updates append rather than replace).
|
||||
|
||||
**ExifTool's PDF strip is reversible by design.** Per its own [docs](https://exiftool.org/#limitations): "PDF — The original metadata is never actually deleted." The test demonstrates this concretely: a one-line `exiftool -PDF-update:all=` recovers eight of ten original sentinels including the entire Info dictionary and the XMP `dc:creator` and `dc:title`. ExifTool also leaves the literal string `BeginExifToolUpdate` in the file as a fingerprint that the file was processed by ExifTool.
|
||||
|
||||
## Caveats and limits of this test
|
||||
|
||||
- The fixture is synthetic — generated with pdf-lib + low-level dict manipulation. Real-world PDFs from Word, Acrobat, InDesign, etc. have richer XMP profiles (XMP MM history, prismeta, Adobe-specific extensions) that this fixture doesn't exercise. Our strip drops the entire `/Metadata` stream regardless of contents, so the result should still be zero survival, but extending the fixture with a captured real-world XMP is a worthwhile follow-up.
|
||||
- We do not test `/Names` `/EmbeddedFiles` (deferred behind opt-in) or AcroForm field data (deferred). A document with attached files or filled-in form fields will still leak via those channels — by design, since both can carry legitimate document content.
|
||||
- The 10 sentinels cover the categories from the [`pdf.md` gap analysis](../gap-analysis/pdf.md). We did not test page-level `/Metadata` streams (covered by the strategy code but not by this fixture) or `/Thumb` thumbnails — both are dropped by the same code paths but not exercised by sentinel here.
|
||||
- The `-PDF-update:all=` revert was tried against all four files (input, our-stripped, exiftool-stripped, gs-stripped). It only succeeded on exiftool-stripped, which is expected — the others have no ExifTool update layer to revert, and ExifTool reports `File contains no previous ExifTool update`.
|
||||
- This test is reproducible but not in CI yet. A natural follow-up is wiring `tools/forensic/pdf.ts` into a test or release-gate that fails if any sentinel survives.
|
||||
|
||||
## Reproducing
|
||||
|
||||
```bash
|
||||
# From the project root
|
||||
npx tsx tools/forensic/pdf.ts
|
||||
```
|
||||
|
||||
Outputs go to `/tmp/pdf-forensic/`:
|
||||
|
||||
- `input.pdf` — the rich fixture
|
||||
- `our-stripped.pdf` — `PdfStrategy` output
|
||||
- `exiftool-stripped.pdf` — `exiftool -all=` output
|
||||
- `gs-stripped.pdf` — Ghostscript pdfwrite output
|
||||
- `report.json` — structured per-output sentinel-survival data
|
||||
- `*-revert.pdf`, `*-qdf.pdf` — intermediate files from the recovery battery
|
||||
|
||||
Required tools: `exiftool`, `qpdf`, `gs`, `strings`. All available on Debian/Ubuntu via apt.
|
||||
|
||||
## What this directory is for
|
||||
|
||||
`docs/forensic/` documents adversarial recovery tests run *after* implementation lands, complementing `docs/gap-analysis/` (which runs *before* implementation to scope what should be removed). The pattern: implement → unit-test correctness → forensic-test unrecoverability → document the result.
|
||||
|
||||
Each format gets its own writeup as we go: `pdf.md` here, `jpeg.md` next time we run the same battery on JPEG fixtures with embedded EXIF/XMP/IPTC, etc. The runner scripts at `tools/forensic/<format>.ts` stay in the repo so the tests can be re-run any time the strategy changes.
|
||||
89
docs/gap-analysis/jpeg.md
Normal file
89
docs/gap-analysis/jpeg.md
Normal file
|
|
@ -0,0 +1,89 @@
|
|||
# JPEG metadata-stripping gap analysis
|
||||
|
||||
**Date:** 2026-05-06 (retrofitted from the architecture decisions that drove the Phase 1 implementation)
|
||||
**Goal:** Document the gap between the original `piexifjs`-based JPEG strategy and ExifTool's `-all=` JPEG strip, the scope of WASM library alternatives that were ruled out, and the rationale for the hand-rolled segment walker that ships in Phase 1.
|
||||
|
||||
## Methodology
|
||||
|
||||
Read:
|
||||
|
||||
- `piexifjs` source (the `remove()` function specifically) — confirmed it operates on APP1 (EXIF) only.
|
||||
- ExifTool documentation at <https://exiftool.org/#limitations> for the JPEG segments removed by `-all=`.
|
||||
- ITU-T T.81 (JPEG specification) §B.1 for marker assignments.
|
||||
- The previous `image_strategy.ts` (the piexifjs wrapper) and the output it produced on real fixtures.
|
||||
|
||||
Verified empirically (in [`docs/poc/little-exif-wasm.md`](../poc/little-exif-wasm.md) and [`docs/poc/exiv2-wasm.md`](../poc/exiv2-wasm.md)):
|
||||
|
||||
- `piexifjs` leaves the JPEG Comment marker (`0xFFFE`) intact even when the user-set Comment is the most user-visible PII source.
|
||||
- `piexifjs` leaves JFIF/APP0 intact (resolution + units; usually inert but still metadata).
|
||||
- The previous implementation also had a critical correctness bug: the `TextDecoder("latin1")` round-trip silently corrupted bytes `0x80–0x9F` because WHATWG aliases `latin1` to `windows-1252` (where those values are not 1:1).
|
||||
- `little_exif` (Rust → WASM): ~330 KB raw / 111 KB gzip. Left Comment + JFIF + PNG text chunks untouched, errored on TIFF.
|
||||
- `exiv2-wasm`: ~2.3 MB raw / 925 KB gzip. The published API has no `erase` primitive — `writeString(buf, key, "")` sets values empty but leaves tag IDs in place.
|
||||
|
||||
Both library options ruled out for JPEG (and other image formats); see the POC writeups.
|
||||
|
||||
## Per-segment policy
|
||||
|
||||
JPEG marker structure: each segment is `0xFF <code> <length-2-bytes-big-endian> <payload>`, except for standalone markers without a length field (SOI, EOI, RST0–RST7, TEM). After SOS, an entropy-coded scan stream extends until the next non-stuffed, non-restart marker. T.81 §B.1.1.2 also permits any number of `0xFF` "fill bytes" before a marker code.
|
||||
|
||||
| Marker | Code | Source of leak | piexifjs (before) | ExifTool `-all=` | Phase 1 walker |
|
||||
|---|---|---|---|---|---|
|
||||
| SOI | `FFD8` | n/a | keep | keep | keep |
|
||||
| JFIF / APP0 | `FFE0` | density, JFIF version, optional thumbnail | leaves intact | drops | drops |
|
||||
| EXIF / APP1 | `FFE1` | EXIF IFD, GPS, MakerNotes, XMP | drops | drops | drops |
|
||||
| ICC / APP2 | `FFE2` | colour profile (`cmmId`, creator, `dateTime`, desc strings) | leaves | drops | drops by default; kept when `preserveColorProfile: true` |
|
||||
| APP3..APP12 | `FFE3..FFEC` | various app-specific (Photoshop, Flashpix, MakerNotes, …) | leaves | drops | drops |
|
||||
| Photoshop / IPTC / APP13 | `FFED` | 8BIM, IPTC, Photoshop image resources | leaves | drops | drops |
|
||||
| Adobe / APP14 | `FFEE` | Adobe DCT encoding signal | leaves | leaves | **keeps** — required for correct decoding of some Adobe-encoded JPEGs |
|
||||
| APP15 | `FFEF` | rare | leaves | drops | drops |
|
||||
| Comment | `FFFE` | arbitrary string (filenames, review notes, user comments) | leaves intact | drops | drops |
|
||||
| DQT, DHT, SOF, SOS, RST, EOI, DRI, DAC | various | image data | keep | keep | keep |
|
||||
|
||||
Entropy-coded data after SOS is copied byte-for-byte. `0xFF 0x00` byte-stuffing and `0xFF D0..D7` restart markers within the stream are preserved (they're part of the entropy data, not metadata).
|
||||
|
||||
## Honest gap summary
|
||||
|
||||
**piexifjs vs ExifTool:** piexifjs covered roughly 10–15% of what `-all=` removes. The Comment marker survival was the most user-visible privacy gap. JFIF/APP0 is rarely meaningful, but it's still data the user expects to be stripped.
|
||||
|
||||
**ExifTool `-all=` vs theoretical:** essentially equivalent on a single-pass strip. ExifTool has been battle-tested against 20+ years of edge-case fixtures; a hand-rolled walker is exposed to whatever subset we test against.
|
||||
|
||||
**Phase 1 walker vs ExifTool `-all=`:** the policy table is identical for the marker classes covered. Differences are at the edges:
|
||||
|
||||
- Fill bytes between markers (T.81 §B.1.1.2) — Phase 1 handles by skipping fill-byte runs at the top of each iteration.
|
||||
- Hierarchical / multi-frame JPEGs (rare) — Phase 1 handles single hierarchy via the SOS-then-entropy cycle re-entering for subsequent SOS markers.
|
||||
- Granular tag-level operations (e.g. `-EXIF:Orientation=` keep) — out of scope for the walker; planned Phase 2 with a TIFF parser inside APP1.
|
||||
|
||||
## Recommendation
|
||||
|
||||
Hand-rolled segment walker. Reasoning:
|
||||
|
||||
- JPEG marker structure is fully specified and well-documented (~150 lines of clean TypeScript).
|
||||
- WASM library options were both ruled out by the POCs.
|
||||
- The walker has zero production dependencies and ships ~111 KB less than `little_exif` would have cost.
|
||||
- We control the marker policy directly — no library defaults to fight.
|
||||
|
||||
## Phase 1 implementation
|
||||
|
||||
Lives at `src/infrastructure/wasm/strategies/jpeg_strategy.ts`. Key invariants:
|
||||
|
||||
- **Marker policy:** as the table above. Mirrors ExifTool's `-all=` behaviour with two deliberate exceptions: APP14 always kept (decoder-affecting), APP2 kept on opt-in via `preserveColorProfile`.
|
||||
- **Fill-byte tolerance:** any number of consecutive `0xFF` bytes before a marker code is permitted.
|
||||
- **Truncation behaviour:** missing EOI is a structural error and surfaces via `Result<_, ExifError>`. The walker does not silently return malformed JPEGs.
|
||||
- **`metadataRemoved`:** counts dropped APP/COM segments. A clean input that needed no changes returns `0`, not `1` — callers must not treat `0` as a failure signal.
|
||||
- **`preserveOrientation`:** documented as not honored in Phase 1; would require a TIFF parser inside APP1. Tracked for Phase 2.
|
||||
|
||||
### Compatibility note: APP0/JFIF removal
|
||||
|
||||
ExifTool's `-all=` drops APP0/JFIF and we follow that policy. Modern decoders (browsers, libjpeg/libjpeg-turbo, ImageMagick, Skia) don't require APP0 — they read sample dimensions from SOF and treat absence of APP0 as "no JFIF metadata." Some legacy strict-JFIF pipelines (older scanner pipelines, certain embedded image libraries) do require APP0 and may reject the cleaned output. If that becomes a real-world support issue, the cheap mitigation is to synthesize a minimal 18-byte APP0 (`JFIF\0` identifier + version + units + density + zero thumbnail), which carries no PII. Not implemented in Phase 1; tracked under deferred items.
|
||||
|
||||
### Privacy note: ICC profile preservation
|
||||
|
||||
`preserveColorProfile: true` keeps the APP2 ICC profile segment in the output. ICC profiles include `cmmId`, profile creator, `dateTime`, and description strings — a small but real fingerprint surface. Callers who need accurate colour reproduction should accept this trade-off explicitly; the default of `false` errs toward privacy.
|
||||
|
||||
## Deferred to Phase 2 (if needed)
|
||||
|
||||
- `preserveOrientation` flag honoring — TIFF IFD parsing inside APP1 to extract just tag `0x0112` and re-inject as a minimal APP1.
|
||||
- Comparison-corpus test against `exiftool -all=` on a diverse fixture set (Canon, Nikon, iPhone-via-Photos, Photoshop, GIMP) to expose any vendor-specific surprises.
|
||||
- Granular ICC scrubbing — write back the ICC profile with the identity-revealing fields zeroed instead of all-or-nothing.
|
||||
- Sub-error-codes so callers can distinguish "not a JPEG" from "truncated JPEG" from "valid JPEG processed cleanly with zero metadata to remove."
|
||||
- Synthesized minimal APP0 for strict-JFIF decoder compatibility (only if real-world support reports surface).
|
||||
159
docs/gap-analysis/pdf.md
Normal file
159
docs/gap-analysis/pdf.md
Normal file
|
|
@ -0,0 +1,159 @@
|
|||
# PDF metadata-stripping gap analysis
|
||||
|
||||
**Date:** 2026-05-06
|
||||
**Goal:** Compare what `pdf-lib` actually clears today (in `src/infrastructure/wasm/strategies/pdf_strategy.ts`) against what ExifTool clears, and against what is theoretically possible with a hand-rolled rewrite. Drives the decision on whether to replace pdf-lib, hand-roll, or keep it for an explicitly-scoped first PR.
|
||||
|
||||
## Methodology
|
||||
|
||||
Read:
|
||||
|
||||
- `src/infrastructure/wasm/strategies/pdf_strategy.ts` — the current implementation.
|
||||
- `tests/infrastructure/wasm/pdf_strategy.test.ts` — what is currently asserted.
|
||||
- `tests/fixtures/wasm/pdf/sample.pdf` — the test fixture (901 bytes; produced by pdf-lib; Title/Author/Subject/Creator/Producer set).
|
||||
- `README.md` "File writer limitations" + Format Support Matrix footnote 3.
|
||||
- ExifTool docs at <https://exiftool.org/#limitations> and <https://exiftool.org/TagNames/PDF.html>.
|
||||
|
||||
Ran (all in `/tmp/pdf-poc/`, nothing added to `package.json`):
|
||||
|
||||
- Generated rich PDFs with pdf-lib 1.17.1 + supplemented with `exiftool -XMP-*=` to inject an XMP stream.
|
||||
- Generated a PDF with annotations (`/T` author, `/Contents`, `/CreationDate`) via low-level pdf-lib `ctx.obj()`.
|
||||
- Stripped each fixture three ways: pdf-lib (replicating `pdf_strategy.ts` exactly), `exiftool -all= -overwrite_original`, and `gs -sDEVICE=pdfwrite` (as a "structurally clean" baseline).
|
||||
- Diffed each output with `exiftool -a -G1 -s`, raw `strings`, `xxd`, and a custom inflater that walks `/Length`-declared streams and decompresses the FlateDecode object streams to reveal the actual on-disk dictionary contents.
|
||||
- Bypass test: tried `doc.catalog.delete(PDFName.of('Metadata'))` to see if pdf-lib can drop the XMP catalog reference, and what that does to the orphaned stream object.
|
||||
|
||||
## Verified facts about the current pdf-lib behaviour
|
||||
|
||||
The existing `pdf_strategy.ts` calls `setTitle/setAuthor/setSubject/setKeywords/setProducer/setCreator` with empty values and saves. Empirically, on pdf-lib 1.17.1:
|
||||
|
||||
- All six Info-dict fields end up as `<FEFF>` (UTF-16 BOM with no data) on disk. Verified by inflating the object stream:
|
||||
|
||||
```text
|
||||
/Producer <FEFF>
|
||||
/ModDate (D:20260505214006Z)
|
||||
/Creator <FEFF>
|
||||
/CreationDate (D:20240115103000Z)
|
||||
/Title <FEFF>
|
||||
/Author <FEFF>
|
||||
/Subject <FEFF>
|
||||
/Keywords <FEFF>
|
||||
```
|
||||
|
||||
- **The `pdf_strategy.ts` source comment ("pdf-lib re-injects 'pdf-lib (...)' as Producer on every save") and the `pdf_strategy.test.ts` assertion `expect(cleaned.getProducer()).toContain("pdf-lib")` are misleading.** `setProducer("")` in pdf-lib 1.17.1 does write an empty Producer to the on-disk Info dict — `exiftool -Producer fixture-stripped.pdf` returns an empty string. What the test is observing is `doc.getProducer()` returning the string `"pdf-lib (...)"` — that's pdf-lib's *in-memory default fallback* when Producer is read back from a previously-saved-and-reloaded doc, not a value present in the file. README footnote 3 inherits the same misconception. Neither blocks the strip in practice, but the comment + test + footnote should be corrected.
|
||||
- `/CreationDate` is **not** clearable through pdf-lib's API — there is no `setCreationDate(undefined)` or `clearCreationDate`. It survives every strip from the current code.
|
||||
- `/ModDate` is rewritten to **the current time** on every save. This is a privacy regression: the cleaned file leaks "this file was metadata-stripped at YYYY-MM-DD hh:mm:ss" — a fingerprint that wasn't in the input. The old ModDate is gone, but a new one is silently added.
|
||||
- pdf-lib does **not** use incremental updates on save — it rewrites the file structure cleanly with a single `xref` and one `%%EOF`. This is a structural advantage over ExifTool. (No `/Prev` key in the trailer, no `%BeginExifToolUpdate` marker.)
|
||||
- pdf-lib does not write a `/ID` array in the trailer at all. Neither input fingerprint propagates, and no new ID is added.
|
||||
- pdf-lib **preserves any `/Metadata` (XMP) stream** referenced from the catalog as-is — every Dublin Core, XMP, and pdf:* property survives untouched.
|
||||
- `doc.catalog.delete(PDFName.of('Metadata'))` removes the catalog reference but leaves the XMP stream as an orphaned (unreferenced) object that still exists in the file body. `exiftool` no longer surfaces it, but `strings` and any forensic walker reading the raw object table will. Without a true "garbage-collect orphans + rewrite" pass, dereferencing alone is no better than ExifTool's incremental-update trick.
|
||||
- pdf-lib does not touch annotation `/T` (author), `/Contents`, `/M`, or `/CreationDate`. It does not touch page-level `/Metadata`, `/Thumb`, `/Names`, `/EmbeddedFiles`, `/AcroForm`, or `/PageLabels`.
|
||||
|
||||
## Verified facts about ExifTool
|
||||
|
||||
`exiftool -all= -overwrite_original` on the same fixture:
|
||||
|
||||
- Emits the warning: `Warning: [minor] ExifTool PDF edits are reversible. Deleted tags may be recovered!`
|
||||
- Adds an incremental update: original objects (containing every Info-dict and XMP value) stay in the file, plus a new Catalog object that hides them. The file ends with two `%%EOF` markers, two `startxref`, and the literal markers `%BeginExifToolUpdate ... %EndExifToolUpdate`.
|
||||
- `strings cleaned.pdf` returns clean output because the original Info dict is still in a FlateDecoded stream (and the XMP stream is also still present) — but inflating reveals everything intact: `/Producer <FEFF...SecretInternalToolProducerStamp>`, `/Title <FEFF...OriginalTitle>`, etc.
|
||||
- `-PDF-update:all=` (an ExifTool-specific pseudo-tag) only removes ExifTool's own update layers — i.e. it *reverts* a previous strip rather than performing a stronger one.
|
||||
- Annotation `/T`, `/Contents`, etc., are unaffected.
|
||||
|
||||
Quoting ExifTool's docs verbatim:
|
||||
|
||||
> "PDF - The original metadata is never actually deleted." (<https://exiftool.org/#limitations>)
|
||||
>
|
||||
> "All metadata edits are reversible. While this would normally be considered an advantage, it is a potential security problem because old information is never actually deleted from the file." (<https://exiftool.org/TagNames/PDF.html>)
|
||||
>
|
||||
> "[To permanently remove old information,] use the 'qpdf' utility with linearization."
|
||||
|
||||
## What's theoretically possible
|
||||
|
||||
A hand-rolled approach can do everything ExifTool refuses to do, plus catch sources neither tool today addresses. The reference is `qpdf --linearize input output` (rewrite the cross-reference, drop unreferenced objects, no incremental updates) plus targeted dictionary scrubbing. In a JS implementation that means: parse the xref, walk every indirect object, drop or scrub the offending ones, regenerate the xref + trailer, and emit a new file with no `/Prev` chain. ~600–1000 lines of TypeScript for a hand-rolled minimal parser, considerably less for a pdf-lib-assisted hybrid that uses pdf-lib's parser and emits via its serializer.
|
||||
|
||||
## Per-source comparison
|
||||
|
||||
| Source | What pdf-lib does today | What ExifTool does | What's theoretically possible |
|
||||
|---|---|---|---|
|
||||
| `/Info` Title | Sets to UTF-16 empty (`<FEFF>`) — key remains, value gone | Adds new empty Info object via incremental update; original Title still in old object | Drop `/Info` reference from trailer entirely + rewrite without the old Info object |
|
||||
| `/Info` Author | Same as Title | Same as Title | Same as Title |
|
||||
| `/Info` Subject | Same | Same | Same |
|
||||
| `/Info` Keywords | Same (empty UTF-16 string instead of empty array) | Same | Same |
|
||||
| `/Info` Creator | Same | Same | Same |
|
||||
| `/Info` Producer | Same — Info dict has `<FEFF>` on disk; in-memory `getProducer()` falls back to "pdf-lib (...)" | Same | Same |
|
||||
| `/Info` CreationDate | **Survives** — no API to clear | Hidden via incremental update; original survives in old object | Drop entire Info object + rewrite |
|
||||
| `/Info` ModDate | **Rewritten to NOW on every save** — adds a new fingerprint | Same as CreationDate | Drop entire Info object + rewrite (no new ModDate written) |
|
||||
| `/Metadata` XMP stream | **Preserved untouched** | Hidden via incremental update; original stream still in file | Drop both catalog reference and the stream object via rewrite |
|
||||
| `/Catalog` `/Lang` | **Preserved** | Preserved | Drop key from catalog dict |
|
||||
| `/Catalog` `/PageLabels` | Preserved | Preserved | Drop key |
|
||||
| `/Catalog` `/Names` (incl. `/EmbeddedFiles`) | Preserved | Preserved (and ExifTool will not delete file attachments) | Drop the `/Names` tree, drop attached file streams |
|
||||
| `/Catalog` `/OutputIntents` (color profile metadata) | Preserved | Preserved | Drop key (caveat: may affect color reproduction) |
|
||||
| Page-level `/Metadata` (per-page XMP) | Preserved | Preserved | Walk pages, drop key + stream |
|
||||
| Page `/Thumb` (page thumbnails) | Preserved | Preserved | Walk pages, drop key + stream |
|
||||
| Annotations `/Annots` `/T` author | **Preserved** | **Preserved** | Walk every page's `/Annots`, scrub `/T`, `/Contents`, `/M`, `/CreationDate`, `/RC`, `/AP` |
|
||||
| Annotations `/Contents` | Preserved | Preserved | Same |
|
||||
| AcroForm field defaults / `/DA` / `/DR` | Preserved | Preserved (limited write support) | Walk `/AcroForm`, scrub field metadata |
|
||||
| Trailer `/ID` array | **Not written** (pdf-lib emits no `/ID`) | Updated as part of incremental update; old `/ID` retrievable from old trailer | Generate fresh random `/ID` pair on rewrite (or drop entirely) |
|
||||
| Encryption dictionary | Decrypted on `ignoreEncryption: true` then output is unencrypted | Same | Same — output is unencrypted by definition once we strip |
|
||||
| Linearization hint stream | Stripped (pdf-lib doesn't preserve linearization) | **Stripped** when ExifTool rewrites; preserved when only updating | Either drop or regenerate via qpdf-style pass |
|
||||
| Cross-reference comments | None to begin with (pdf-lib emits clean xref) | Adds `%BeginExifToolUpdate` / `%EndExifToolUpdate` literal comments — a clear "this file was processed by ExifTool" fingerprint | Emit a single clean xref with no commentary |
|
||||
| Extra trailer dictionaries (incremental update history `/Prev`) | None — single trailer | **Adds one** every time; if input already had history, it survives | Walk and merge to a single trailer; drop `/Prev` chain |
|
||||
| Hidden / replaced objects (orphans from prior incremental updates) | Pdf-lib's parser only retains objects referenced from the new catalog → orphans dropped on save (in our tests, an explicit `catalog.delete('Metadata')` left the XMP object orphaned but **still in the file**, so this needs verification per object class) | **Preserved** by design — that's what makes ExifTool's strip reversible | Garbage-collect: keep only objects reachable from the new catalog, write only those |
|
||||
|
||||
## Honest gap summary
|
||||
|
||||
**pdf-lib vs ExifTool**: roughly even on the Info dictionary; pdf-lib is *better* in two underrated ways and *worse* in one:
|
||||
|
||||
- pdf-lib: better — emits a clean single-trailer rewrite, no `%BeginExifToolUpdate` fingerprint, original Info-dict bytes are not retained in the output (because pdf-lib's serializer rebuilds the object table from the parsed in-memory model, not from the original byte stream — hidden objects from prior incremental updates do get dropped during this rebuild).
|
||||
- pdf-lib: better — does not write a `/ID` to the trailer (no document fingerprint propagates).
|
||||
- pdf-lib: worse — adds a fresh `/ModDate` of "now" every save (ExifTool also updates ModDate, but the gap relative to a hand-rolled strip is identical).
|
||||
- pdf-lib: comparable — leaves XMP stream, annotations, /Lang, /PageLabels, /Names, page Thumbs, page-level Metadata, AcroForm exactly as ExifTool does.
|
||||
- pdf-lib: comparable — has no API surface for `/CreationDate` removal.
|
||||
|
||||
**ExifTool vs theoretical**: ExifTool is fundamentally limited by its design choice to use incremental updates. Per its own docs the recommended workaround is `qpdf --linearize`. This is not a fixable limitation in ExifTool — it's a documented "this format is not supported for true deletion."
|
||||
|
||||
**Theoretical vs both**: a rewrite-based hand-rolled strategy can additionally close the annotation, AcroForm, page-level Metadata, /Lang, /Names, /Thumb, embedded-files, and ModDate-leak gaps. None of these are addressed by either tool today.
|
||||
|
||||
**Top three sources that matter most for actual privacy use cases:**
|
||||
|
||||
1. **`/Metadata` XMP stream** — this is where authoring-tool fingerprints, original Title/Author/Subject from Word/Indesign/Acrobat live. Most "leaked" PDF metadata in the real world (lawyers' redaction failures, author names on government documents, internal-build identifiers) ships in XMP, not the Info dict. Both tools fail here. Producer reveals authoring software ("Microsoft Word for Microsoft 365", "Adobe Acrobat Pro DC 22.x", "skia/PDF m118 Google Docs Renderer") which fingerprints the user's environment.
|
||||
2. **Annotations `/T` and `/Contents`** — review comments and authorial markup carry reviewer names, internal review timestamps, and the actual review text. Neither tool clears these. This is the source most likely to embarrass an end-user (e.g. "John Reviewer: this number is wrong, change to X" surviving a strip).
|
||||
3. **`/Info` `/CreationDate` + new `/ModDate`** — pdf-lib leaves the original CreationDate and adds a "now" ModDate, which together fingerprint both the document's age and the strip event. Defensible to clear both.
|
||||
|
||||
Secondary but worth covering in the same pass:
|
||||
|
||||
- **`/Catalog` `/Lang`** — leaks user locale.
|
||||
- **`/Names` `/EmbeddedFiles`** — embedded attachments can carry whole spreadsheets or originals.
|
||||
- **`/Catalog` `/Metadata` orphans from prior tool chains** — anything that has been ExifTool'd or repeatedly updated has shadow metadata.
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Keep pdf-lib as the parser/serializer. Do not replace it with a from-scratch PDF parser; that is a 600+ line project with high crash-on-malformed-input surface area. Instead, extend the strip with three targeted catalog/dict mutations using pdf-lib's low-level API.**
|
||||
|
||||
The hand-rolled-everything path is unattractive specifically for PDF (unlike JPEG/PNG, where a marker walker is ~80 lines): PDFs require xref parsing, cross-reference streams, FlateDecode/Predictor filters, encryption negotiation, object stream parsing. pdf-lib already does all of that correctly. We don't need a different library; we need to use the one we have more aggressively.
|
||||
|
||||
The interesting question is whether to replace pdf-lib with **`qpdf-wasm`** (a port of the canonical PDF rewrite tool ExifTool itself recommends). It would provide true garbage-collection of orphan objects and a rewrite-based strip. Bundle size cost is significant (qpdf is large; published wasm builds are ~1.5–3 MB). Worth evaluating in a separate POC if Phase 1 below proves insufficient. Given the gzip-size patterns observed in `docs/poc/exiv2-wasm.md` (925 KB exiv2-wasm rejected), `qpdf-wasm` is unlikely to clear the size bar unless it ships a much smaller pdf-strip-only build.
|
||||
|
||||
For Electron (where ExifTool is bundled), pdf-lib is *already* the right answer for PDF — ExifTool's PDF strip is provably worse (incremental updates retain everything) and we should consider routing PDFs to the WASM strategy on Electron too for any case where speed allows. That's a larger architectural call; out of scope here.
|
||||
|
||||
## Phase 1 plan: tightly-scoped first PR
|
||||
|
||||
**Goal**: close the three highest-impact gaps (XMP, annotations, dates) within `pdf_strategy.ts`, without bringing in a new library.
|
||||
|
||||
In scope:
|
||||
|
||||
1. **Drop the catalog `/Metadata` reference + verify the orphan is gone after save**. If pdf-lib's serializer doesn't garbage-collect it (our scratch testing suggests it does NOT — the orphaned XMP stream remained), implement a manual "build a fresh PDFDocument and copy only the page content + non-metadata catalog entries" rebuild, which forces a clean object table.
|
||||
2. **Walk every page's `/Annots` array, mutate each annotation dict to drop `/T`, `/Contents`, `/M`, `/CreationDate`, `/RC` (rich content), `/Subj`** (annotation subject). Leave `/Type`, `/Subtype`, `/Rect`, `/AP` (appearance — visual-only), and structural geometry. This preserves annotation visibility but scrubs authorship.
|
||||
3. **Drop catalog-level metadata fingerprints**: `/Lang`, `/PageLabels`, `/Names` (with caveat below), `/OutputIntents`, page-level `/Metadata`, page-level `/Thumb`. Behind individual `StripOptions` flags so users can opt back in if they need (e.g. `/OutputIntents` may matter for color-managed workflows).
|
||||
4. **Drop `/Info` `/CreationDate` and `/ModDate` entries** by reaching into `doc.context.lookup(infoRef, PDFDict).delete(PDFName.of('CreationDate'))` rather than relying on pdf-lib's high-level setters which always emit a new ModDate.
|
||||
5. **Fix the Producer comment + test + README footnote 3** to match observed behaviour. Replace the `expect(cleaned.getProducer()).toContain("pdf-lib")` assertion with a check that the on-disk Info dict has no Producer key at all (or an empty one), and clarify in `pdf_strategy.ts` that the in-memory `getProducer()` returns a default fallback that is *not* present in the file.
|
||||
6. **Test fixture upgrade**: replace the 901-byte minimal fixture with a richer fixture (or add a second fixture) that exercises XMP, annotations, embedded files, and `/Lang`. Generated via a small `tests/fixtures/wasm/pdf/` build script, not committed in binary form unless small enough.
|
||||
|
||||
Deferred:
|
||||
|
||||
- `/Names` `/EmbeddedFiles` removal: risky because some PDFs use embedded files for legitimate functionality (PDF/A archive sidecars, attached spreadsheets). Add behind an explicit `dropEmbeddedFiles` option, default off.
|
||||
- AcroForm scrubbing: form data may legitimately need to round-trip (e.g. invoices with form fields). Default off; opt-in.
|
||||
- True orphan-object garbage collection: only pursue if step 1 above shows orphans surviving the save. If that happens, evaluate `qpdf-wasm` as a separate POC.
|
||||
- `/ID` regeneration: pdf-lib already omits `/ID`, so no action — but document this in the strategy file.
|
||||
- Linearized PDF input: pdf-lib already de-linearizes on save, so cleaned files are non-linearized. Acceptable.
|
||||
- Encrypted-input handling: `ignoreEncryption: true` already covers the common case (we strip, output is unencrypted). Re-encryption is out of scope.
|
||||
|
||||
Effort estimate: ~150 lines of TypeScript additions in `pdf_strategy.ts` + ~50 lines of test scaffolding + 1 new fixture. One PR, mostly sequential edits to a single strategy file.
|
||||
110
docs/poc/exiv2-wasm.md
Normal file
110
docs/poc/exiv2-wasm.md
Normal file
|
|
@ -0,0 +1,110 @@
|
|||
# exiv2 WASM POC
|
||||
|
||||
**Date:** 2026-05-06
|
||||
**Goal:** Evaluate whether `exiv2` (C++) compiled to WebAssembly is a viable alternative to `little_exif` for client-side metadata stripping in ExifCleaner's web build. exiv2 is a mature library covering JPEG (with full APP segment handling), PNG (tEXt/iTXt/zTXt), TIFF, WebP, HEIF/HEIC/AVIF, JP2, and many RAW formats — in theory closing the gaps that hurt little_exif.
|
||||
|
||||
## What was tried
|
||||
|
||||
Skipped a from-source compile in favor of the published [`exiv2-wasm`](https://www.npmjs.com/package/exiv2-wasm) package (v0.5.13, MIT, August 2025), which is a thin embind wrapper over Exiv2 + expat + brotli + inih + zlib. It is the only maintained exiv2-WASM package on npm. Other npm hits for "exiv2" were all native bindings (`exiv2`, `exiv2-buffers`, `@11ways/exiv2`), useless for the browser.
|
||||
|
||||
Installed in `/tmp/exiv2-poc/`. No source-build attempt was run because the published package already exposes exiv2's API and the API limitation found below would equally apply to a custom from-source build unless we wrote our own C++ strip wrapper — at which point the size argument (below) ends the discussion.
|
||||
|
||||
## Bundle size
|
||||
|
||||
| Artifact | Size |
|
||||
| --- | --- |
|
||||
| Raw WASM (`exiv2.wasm`) | 2.3 MB |
|
||||
| Gzipped WASM | 925 KB |
|
||||
| JS glue (`exiv2.js`) | 83 KB |
|
||||
|
||||
For comparison, `little_exif` was 330 KB raw / 111 KB gzipped. **exiv2 is roughly 7x bigger gzipped** and 7x bigger raw. There's little headroom — the package already disables CLI, video, NLS, curl, and webready features. zlib + expat + brotli + the BMFF/HEIF parser dominate the binary.
|
||||
|
||||
## API surface
|
||||
|
||||
The package exposes only:
|
||||
|
||||
```ts
|
||||
read(u8: Uint8Array): { exif, iptc, xmp }
|
||||
readTagText(u8, key): string | null
|
||||
readTagBytes(u8, key): Uint8Array | null
|
||||
writeString(u8, key, value): Uint8Array // new buffer
|
||||
writeBytes(u8, key, bytes): Uint8Array
|
||||
```
|
||||
|
||||
**There is no `erase`, `delete`, `clearAll`, or strip primitive** — neither in the JS wrapper nor in the WASM symbol table (verified via `strings exiv2.wasm`). To "strip" via this package, you must read all keys and write each with an empty value.
|
||||
|
||||
## Functional results
|
||||
|
||||
Test fixtures: 1920×1080 JPEG/PNG/TIFF/WebP generated via Pillow and tagged with ExifTool 12.76 (Make, Model, Artist, Software, Comment, ImageDescription, XPTitle/Author/Comment, GPS lat/lon, ResolutionUnit). Strip strategy: enumerate keys returned by `read()` and call `writeString(buf, key, "")` on each.
|
||||
|
||||
Output diffed against ExifTool ground truth (`-all= -overwrite_original`).
|
||||
|
||||
| Format | exiv2-wasm result | ExifTool result | Gap |
|
||||
| --- | --- | --- | --- |
|
||||
| JPEG | **Broken** | Clean | Tags **still exist** with empty values. JFIF segment intact. JPEG Comment marker (FFFE) "HiddenComment" survives. ResolutionUnit becomes `Unknown ()`. |
|
||||
| PNG | **Broken** | Clean | EXIF tags become empty strings (still present). **PNG tEXt chunks are entirely untouched** — Artist=TestArtist, Comment=HiddenComment, Make/Model/Software all still present in [PNG] group. |
|
||||
| TIFF | **Broken** | Partial | exiv2 reads TIFF (little_exif could not). All 8 string EXIF tags get empty values; numeric/binary tags become `Unknown ()`. ExifTool itself can only partially clean TIFF (refuses to delete IFD0). |
|
||||
| WebP | **Broken** | Clean | Same as JPEG/PNG: keys persist with empty values. |
|
||||
|
||||
The summary table from the run:
|
||||
|
||||
```
|
||||
file beforeCount exiv2AfterCount exiftoolAfterCount exiv2Ms
|
||||
big.jpg 21 21 0 8.87
|
||||
big.png 22 22 0 3.12
|
||||
big.tiff 14 14 8 78.85
|
||||
big.webp 17 17 0 2.47
|
||||
```
|
||||
|
||||
`writeBytes(buf, key, new Uint8Array(0))` was also tested — same outcome, all 18 EXIF keys still readable afterwards. The library treats empty input as "set value to empty," not "delete tag."
|
||||
|
||||
Forensic dump of `exiv2_big.jpg` shows the structural problem clearly: every tag remains in IFD0, just with no string content. ExifTool then renders these as `[IFD0] Make :` (empty), but a downstream consumer reading the raw EXIF block still sees all the tag IDs and offsets — i.e. a fingerprint of what was originally there.
|
||||
|
||||
For PNG, `writeString` only touches the eXIf chunk that exiv2 manages; tEXt chunks (`Artist`, `Comment`, `Make`, `Model`, `Software`) are completely ignored by exiv2's string-write path because they're outside the EXIF/IPTC/XMP model exiv2 surfaces.
|
||||
|
||||
## Performance
|
||||
|
||||
Roughly equivalent to little_exif on JPEG/PNG/WebP (2–9 ms). TIFF is much slower (~79 ms for 6 MB) because the round trip rebuilds the file for every key write — 24 round trips for the keyset. With a real strip API, TIFF would presumably be one pass.
|
||||
|
||||
## Direct comparison vs little_exif
|
||||
|
||||
| Concern | little_exif | exiv2-wasm | Verdict |
|
||||
| --- | --- | --- | --- |
|
||||
| Gzipped size | 111 KB | 925 KB | little_exif wins by 8x |
|
||||
| JPEG metadata stripping | Partial (leaves Comment + JFIF) | **Worse** — same residue plus all original EXIF tag IDs survive with empty values | little_exif at least clears APP1 cleanly |
|
||||
| PNG tEXt/iTXt | Untouched | Untouched | Tie (both fail) |
|
||||
| PNG eXIf | Cleared | Set to empty (still present) | little_exif slightly better |
|
||||
| TIFF | Errors out | Works, leaves empty tags | exiv2 wins (works at all) |
|
||||
| WebP | Clean removal | Tags remain empty | little_exif wins |
|
||||
| Format coverage on paper | jpeg/png/tiff/webp/heif/jxl | + JP2 + RAW formats + IPTC/XMP read | exiv2 wins on read; doesn't matter for strip |
|
||||
| Strip primitive | `clear_metadata()` exists | **Does not exist in published API** | little_exif wins |
|
||||
|
||||
The fundamental issue: the published `exiv2-wasm` is a **read/edit** library, not a **strip** library. Its design assumption is editing individual tags. To use it for stripping you would need to either:
|
||||
|
||||
1. Build exiv2 from source with a custom C++ wrapper that calls `Exiv2::Image::clearMetadata()` (or the per-container `clearExifData/clearIptcData/clearXmpData`) — extra integration cost on top of an already 7x-larger binary, with no help for PNG tEXt chunks (those are outside the exiv2 metadata model).
|
||||
2. Walk the file structure ourselves anyway (which is the hand-rolled solution).
|
||||
|
||||
The from-source path is documented in the package's README (build.bash + emcmake). It would take an hour or two to set up and produce a wrapper exposing `Exiv2::Image::clearMetadata()`. But the resulting binary would still be ~2 MB+ and would still not handle PNG text chunks (which exiv2 doesn't model). For ~80 lines of hand-rolled TypeScript per format, it's not worth it.
|
||||
|
||||
## Conclusion
|
||||
|
||||
**Do not adopt exiv2-wasm.** Either as currently published or via a from-source build, it is the wrong tool for ExifCleaner's web strip pipeline:
|
||||
|
||||
- **The published package can't strip metadata** — its API only mutates tag values; the EXIF structure (and PNG text chunks) survive intact.
|
||||
- **A from-source build is feasible but expensive** — 90 min minimum to set up, ~2 MB binary, and still wouldn't handle PNG tEXt without custom chunk-walking on top.
|
||||
- **Even at best, it's 7x larger than little_exif**, which itself was rejected for being incomplete.
|
||||
- **Hand-rolled walkers are the answer** — for JPEG marker scanning and PNG chunk filtering, ~60–80 lines of pure TypeScript each is more thorough, smaller, more transparent, and has no native compilation step.
|
||||
|
||||
## Recommended path forward (unchanged from little_exif POC)
|
||||
|
||||
The web build should use hand-rolled, format-specific strippers:
|
||||
|
||||
- **JPEG**: walk markers, drop APP0–APP15 and COM, keep SOF/DHT/DQT/SOS — pure TS.
|
||||
- **PNG**: walk chunks, drop tEXt/zTXt/iTXt/eXIf/tIME, keep IHDR/IDAT/IEND/PLTE/tRNS/etc. — pure TS.
|
||||
- **WebP**: walk RIFF chunks, drop EXIF/XMP/ICCP — pure TS.
|
||||
- **TIFF**: IFD walk dropping non-essential tags — moderate effort.
|
||||
- **HEIC/AVIF**: extend the existing ISOBMFF box walker.
|
||||
- **GIF/BMP/ICO**: trivial or no metadata.
|
||||
- **RAW**: no viable WASM path; remains an Electron/ExifTool concern.
|
||||
|
||||
WASM via Rust + wasm-pack remains worthwhile for one specific case — HEIC/AVIF if the hand-rolled box walker proves too thin. exiv2-wasm doesn't change that picture.
|
||||
69
docs/poc/little-exif-wasm.md
Normal file
69
docs/poc/little-exif-wasm.md
Normal file
|
|
@ -0,0 +1,69 @@
|
|||
# little_exif WASM POC
|
||||
|
||||
**Date:** 2026-05-06
|
||||
**Goal:** Evaluate whether `little_exif` (Rust) compiled to WASM is a viable replacement for the Perl ExifTool binary for stripping metadata from common image formats.
|
||||
|
||||
## What was built
|
||||
|
||||
A minimal Rust crate wrapping `little_exif 0.6` via `wasm-bindgen`, exposing a single function:
|
||||
|
||||
```rust
|
||||
pub fn clear_metadata(bytes: Vec<u8>, format: &str) -> Result<Vec<u8>, JsValue>
|
||||
```
|
||||
|
||||
Formats supported by the crate: `jpeg`, `png`, `tiff`, `webp`, `heif`, `jxl`.
|
||||
|
||||
Built with `wasm-pack build --target web --release`. The bundled `wasm-opt` from wasm-pack 0.14 is too old to handle modern Rust's bulk-memory ops — disabling it via `[package.metadata.wasm-pack.profile.release] wasm-opt = false` resolves the error. Installing a modern binaryen would recover the optimization step.
|
||||
|
||||
## Bundle size
|
||||
|
||||
| Artifact | Size |
|
||||
|---|---|
|
||||
| Raw WASM | 330 KB |
|
||||
| Gzipped | 111 KB |
|
||||
| JS glue (`wasm-pack --target web`) | 8.6 KB |
|
||||
|
||||
111 KB gzipped is acceptable. The unoptimized size could realistically drop to ~220 KB with a current `wasm-opt`.
|
||||
|
||||
## Functional results
|
||||
|
||||
Tested against ExifTool 12.76 as ground truth. Test files: 268-byte synthetic JPEG (rich metadata including GPS, Comment, Artist, Make) and 1920×1080 images for JPEG, PNG, TIFF, WebP.
|
||||
|
||||
| Format | little_exif result | ExifTool result | Gap |
|
||||
|---|---|---|---|
|
||||
| JPEG | Partial | Clean | Strips EXIF (APP1). Leaves JPEG **Comment marker (0xFFFE)** and **JFIF segment (APP0)** intact. A user-set Comment field survived stripping. |
|
||||
| PNG | Partial | Clean | Strips eXIf chunk. Leaves **tEXt/iTXt/zTXt chunks** intact — Artist, Software, Comment all survived as PNG text chunks. |
|
||||
| TIFF | **Fails** | Clean | Returns error: `"TIFF requires XResolution (0x011A) tag!"` — refuses to operate on standard TIFFs generated by Pillow. |
|
||||
| WebP | Full | Clean | Equivalent to ExifTool. All tags removed. |
|
||||
|
||||
Performance is not a concern: all formats ran in under 1.5 ms on test files up to 6 MB.
|
||||
|
||||
## Root cause of the gaps
|
||||
|
||||
`little_exif` is an **EXIF library**, not a metadata-stripping library. Its `clear_metadata` function clears the EXIF block only:
|
||||
|
||||
- **JPEG**: removes APP1 (EXIF), but does not touch APP0 (JFIF density/resolution), COM (Comment marker 0xFFFE), APP2–APP15, or IPTC/XMP in APP13.
|
||||
- **PNG**: removes the eXIf chunk, but does not touch tEXt, zTXt, or iTXt chunks (which are how most tools write metadata into PNG).
|
||||
- **TIFF**: the library enforces internal invariants (requires XResolution tag) that prevent stripping mandatory-by-spec tags.
|
||||
|
||||
## Conclusion
|
||||
|
||||
**Do not use `little_exif` alone as an ExifTool replacement.**
|
||||
|
||||
For JPEG and PNG — the two most common user formats — it leaves named metadata fields intact that users expect to be stripped (comments, author, software). This is a correctness gap, not a performance tradeoff.
|
||||
|
||||
**WebP** is the one format where it works correctly and could be used directly.
|
||||
|
||||
## Recommended path forward
|
||||
|
||||
For the formats where `little_exif` falls short, **hand-rolled chunk/segment walkers are the right answer**:
|
||||
|
||||
- **JPEG**: walk marker segments (FFEx, FFFE), drop APP0–APP15 and COM, keep SOF/DHT/DQT/SOS. ~80 lines of pure TypeScript. More thorough than `little_exif`.
|
||||
- **PNG**: walk chunks (tEXt, zTXt, iTXt, eXIf, tIME, bKGD), drop metadata chunks, keep image data chunks. ~60 lines of pure TypeScript.
|
||||
- **GIF, BMP, ICO**: trivial — minimal or no metadata containers.
|
||||
- **WebP**: `little_exif` works, or hand-roll RIFF chunk walk (~50 lines).
|
||||
- **TIFF (basic)**: IFD tag walk, drop non-essential tags. Doable but more involved.
|
||||
- **HEIC/AVIF**: extend the existing VideoStrategy ISOBMFF box walker (MP4 foundation is already in place).
|
||||
- **RAW (CR2, CR3, NEF, ARW, ORF, etc.)**: no viable WASM path. ExifTool (Electron) or best-effort-unsupported (web). This is the one category where the Perl dependency may remain indefinitely.
|
||||
|
||||
The toolchain itself (Rust → WASM via wasm-pack, wasm-bindgen) works well and is worth using for formats that genuinely need a library — primarily HEIC/AVIF. For JPEG/PNG/WebP, hand-rolled TypeScript strategies are smaller, more transparent, and more thorough.
|
||||
1544
docs/superpowers/plans/2026-05-05-phase-b-deployable-webapp.md
Normal file
1544
docs/superpowers/plans/2026-05-05-phase-b-deployable-webapp.md
Normal file
File diff suppressed because it is too large
Load diff
51
nginx.conf
Normal file
51
nginx.conf
Normal file
|
|
@ -0,0 +1,51 @@
|
|||
events {
|
||||
worker_connections 1024;
|
||||
}
|
||||
|
||||
http {
|
||||
include /etc/nginx/mime.types;
|
||||
default_type application/octet-stream;
|
||||
gzip on;
|
||||
gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript application/wasm;
|
||||
gzip_min_length 1024;
|
||||
|
||||
server {
|
||||
listen 80;
|
||||
server_name _;
|
||||
root /usr/share/nginx/html;
|
||||
index index.html;
|
||||
|
||||
# Required for SharedArrayBuffer (multi-threaded WASM)
|
||||
add_header Cross-Origin-Opener-Policy "same-origin" always;
|
||||
add_header Cross-Origin-Embedder-Policy "require-corp" always;
|
||||
|
||||
# Security headers
|
||||
add_header X-Frame-Options "DENY" always;
|
||||
add_header X-Content-Type-Options "nosniff" always;
|
||||
add_header Referrer-Policy "no-referrer" always;
|
||||
add_header Content-Security-Policy "default-src 'none'; script-src 'self' 'wasm-unsafe-eval'; style-src 'self' 'unsafe-inline'; img-src 'self' data: blob:; font-src 'self'; connect-src 'self'; worker-src 'self' blob:; base-uri 'none'; frame-ancestors 'none'" always;
|
||||
|
||||
# Cache static assets (hashed filenames) for 1 year
|
||||
location /assets/ {
|
||||
expires 1y;
|
||||
add_header Cache-Control "public, immutable";
|
||||
add_header Cross-Origin-Opener-Policy "same-origin" always;
|
||||
add_header Cross-Origin-Embedder-Policy "require-corp" always;
|
||||
add_header X-Content-Type-Options "nosniff" always;
|
||||
add_header Content-Security-Policy "default-src 'none'; script-src 'self' 'wasm-unsafe-eval'; style-src 'self' 'unsafe-inline'; img-src 'self' data: blob:; font-src 'self'; connect-src 'self'; worker-src 'self' blob:; base-uri 'none'; frame-ancestors 'none'" always;
|
||||
}
|
||||
|
||||
# Service worker — no cache (must always be fresh)
|
||||
location /sw.js {
|
||||
expires -1;
|
||||
add_header Cache-Control "no-store, no-cache, must-revalidate";
|
||||
add_header Cross-Origin-Opener-Policy "same-origin" always;
|
||||
add_header Cross-Origin-Embedder-Policy "require-corp" always;
|
||||
}
|
||||
|
||||
# SPA fallback — all routes serve index.html
|
||||
location / {
|
||||
try_files $uri $uri/ /index.html;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -27,7 +27,10 @@
|
|||
"release": "echo 'Use GitHub Actions release workflow (workflow_dispatch) instead' && exit 1",
|
||||
"dev": "ELECTRON_RUN_AS_NODE= electron-vite dev",
|
||||
"dev:debug": "ELECTRON_RUN_AS_NODE= electron-vite dev --remote-debugging-port=9222",
|
||||
"dev:web": "vite --config vite.config.web.ts",
|
||||
"compile": "electron-vite build",
|
||||
"build:web": "vite build --config vite.config.web.ts",
|
||||
"preview:web": "vite preview --config vite.config.web.ts",
|
||||
"typecheck": "tsc --noEmit",
|
||||
"preview": "ELECTRON_RUN_AS_NODE= electron-vite preview",
|
||||
"test": "vitest run",
|
||||
|
|
@ -39,6 +42,7 @@
|
|||
},
|
||||
"dependencies": {
|
||||
"jszip": "^3.10.1",
|
||||
"pdf-lib": "^1.17.1",
|
||||
"react": "^19.2.0",
|
||||
"react-dom": "^19.2.0",
|
||||
"zod": "^3.25.0"
|
||||
|
|
@ -56,6 +60,7 @@
|
|||
"prettier": "^3.0",
|
||||
"typescript": "~5.7.0",
|
||||
"vite": "^7.3.1",
|
||||
"vite-plugin-pwa": "^1.3.0",
|
||||
"vitest": "3.2.4"
|
||||
},
|
||||
"build": {
|
||||
|
|
|
|||
19
public/_headers
Normal file
19
public/_headers
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
# Cloudflare Pages headers — mirrors nginx.conf for the Docker deploy.
|
||||
# Format: https://developers.cloudflare.com/pages/configuration/headers/
|
||||
|
||||
# Default headers applied to every path
|
||||
/*
|
||||
Cross-Origin-Opener-Policy: same-origin
|
||||
Cross-Origin-Embedder-Policy: require-corp
|
||||
X-Frame-Options: DENY
|
||||
X-Content-Type-Options: nosniff
|
||||
Referrer-Policy: no-referrer
|
||||
Content-Security-Policy: default-src 'none'; script-src 'self' 'wasm-unsafe-eval'; style-src 'self' 'unsafe-inline'; img-src 'self' data: blob:; font-src 'self'; connect-src 'self'; worker-src 'self' blob:; base-uri 'none'; frame-ancestors 'none'
|
||||
|
||||
# Hashed assets are content-addressed; cache aggressively
|
||||
/assets/*
|
||||
Cache-Control: public, immutable, max-age=31536000
|
||||
|
||||
# Service worker must always be fresh so updates roll out
|
||||
/sw.js
|
||||
Cache-Control: no-store, no-cache, must-revalidate
|
||||
BIN
public/icon-192.png
Normal file
BIN
public/icon-192.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 142 KiB |
BIN
public/icon-512.png
Normal file
BIN
public/icon-512.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 142 KiB |
25
public/manifest.webmanifest
Normal file
25
public/manifest.webmanifest
Normal file
|
|
@ -0,0 +1,25 @@
|
|||
{
|
||||
"name": "ExifCleaner",
|
||||
"short_name": "ExifCleaner",
|
||||
"description": "Remove metadata from your files. 100% private — files never leave your device.",
|
||||
"start_url": "/",
|
||||
"display": "standalone",
|
||||
"background_color": "#1a1a1a",
|
||||
"theme_color": "#2a9d8f",
|
||||
"icons": [
|
||||
{
|
||||
"src": "/icon-192.png",
|
||||
"sizes": "192x192",
|
||||
"type": "image/png",
|
||||
"purpose": "any maskable"
|
||||
},
|
||||
{
|
||||
"src": "/icon-512.png",
|
||||
"sizes": "512x512",
|
||||
"type": "image/png",
|
||||
"purpose": "any maskable"
|
||||
}
|
||||
],
|
||||
"categories": ["utilities", "productivity"],
|
||||
"lang": "en"
|
||||
}
|
||||
|
|
@ -3,6 +3,14 @@
|
|||
// both the ExifTool path and the in-process WASM path consume the same options.
|
||||
export interface StripOptions {
|
||||
readonly preserveOrientation: boolean;
|
||||
/**
|
||||
* When true, retain the embedded ICC color profile (JPEG APP2, PNG iCCP,
|
||||
* etc.). Color fidelity comes at a privacy cost: ICC profiles carry
|
||||
* `cmmId`, creator signature, a `dateTime` timestamp, and free-form
|
||||
* description strings, so preserving the profile keeps a small but real
|
||||
* fingerprint surface attached to the file. Default off; opt in only when
|
||||
* accurate color reproduction matters more than maximum anonymity.
|
||||
*/
|
||||
readonly preserveColorProfile: boolean;
|
||||
readonly preserveTimestamps: boolean;
|
||||
}
|
||||
|
|
|
|||
238
src/infrastructure/wasm/strategies/jpeg_strategy.ts
Normal file
238
src/infrastructure/wasm/strategies/jpeg_strategy.ts
Normal file
|
|
@ -0,0 +1,238 @@
|
|||
import type { Result } from "../../../common";
|
||||
import type { ExifError } from "../../../domain";
|
||||
import type {
|
||||
FormatStrategy,
|
||||
StripOptions,
|
||||
StripResult,
|
||||
} from "../format_strategy";
|
||||
|
||||
// Hand-rolled JPEG segment walker. Replaces the piexifjs-based ImageStrategy
|
||||
// with a focused metadata strip.
|
||||
//
|
||||
// Policy — mirrors ExifTool's `-all=` behaviour with two deliberate exceptions:
|
||||
//
|
||||
// - APP0 (JFIF), APP1 (EXIF/XMP), APP3..13, APP15 → dropped
|
||||
// - APP14 (Adobe DCT) → always kept (decoder-affecting: some Adobe-encoded
|
||||
// JPEGs decode with wrong colors without it; ExifTool keeps it too)
|
||||
// - APP2 (ICC profile) → dropped by default, kept on opt-in via
|
||||
// `preserveColorProfile`
|
||||
// - COM (comment marker) → dropped
|
||||
// - SOI/EOI/SOF/DHT/DQT/SOS/RST/DRI/DAC/etc. → kept (image data)
|
||||
//
|
||||
// Trade-off worth flagging: dropping APP0/JFIF means the output requires a
|
||||
// JFIF-tolerant decoder. Modern browsers, libjpeg/libjpeg-turbo, ImageMagick,
|
||||
// and every consumer image pipeline handle this fine. A handful of legacy
|
||||
// scanner pipelines and embedded image libraries are strict JFIF-only and
|
||||
// will reject the output. We accept this trade-off rather than synthesizing
|
||||
// a placeholder APP0 — the segment is a privacy/anonymity wash but adds a
|
||||
// failure mode if we get the synthesis wrong.
|
||||
//
|
||||
// Entropy-coded data after SOS is copied byte-for-byte; FF 00 byte-stuffing
|
||||
// and RST0..RST7 restart markers within the stream are preserved.
|
||||
//
|
||||
// Reference: ITU-T T.81 (JPEG) §B.1.1; ExifTool limitations doc.
|
||||
|
||||
const SOI = 0xd8;
|
||||
const EOI = 0xd9;
|
||||
const SOS = 0xda;
|
||||
const TEM = 0x01;
|
||||
const RST_FIRST = 0xd0;
|
||||
const RST_LAST = 0xd7;
|
||||
const APP_FIRST = 0xe0;
|
||||
const APP_LAST = 0xef;
|
||||
const APP2_ICC = 0xe2;
|
||||
const APP14_ADOBE = 0xee;
|
||||
const COM = 0xfe;
|
||||
|
||||
function isStandalone(marker: number): boolean {
|
||||
// Markers without a length field. Standalone markers consume only
|
||||
// the FF + code pair.
|
||||
return (
|
||||
marker === SOI ||
|
||||
marker === EOI ||
|
||||
marker === TEM ||
|
||||
(marker >= RST_FIRST && marker <= RST_LAST)
|
||||
);
|
||||
}
|
||||
|
||||
function shouldDropSegment(marker: number, options: StripOptions): boolean {
|
||||
if (marker >= APP_FIRST && marker <= APP_LAST) {
|
||||
// Adobe APP14 carries DCT encoding info that affects how the image
|
||||
// decodes — never drop. Matches ExifTool's documented default.
|
||||
if (marker === APP14_ADOBE) return false;
|
||||
// ICC profiles are only retained when the user opts in, since they
|
||||
// can identify origin software/devices.
|
||||
if (marker === APP2_ICC) return !options.preserveColorProfile;
|
||||
return true;
|
||||
}
|
||||
if (marker === COM) return true;
|
||||
return false;
|
||||
}
|
||||
|
||||
interface WalkResult {
|
||||
bytes: Uint8Array;
|
||||
droppedSegments: number;
|
||||
}
|
||||
|
||||
function walkJpeg(input: Uint8Array, options: StripOptions): WalkResult {
|
||||
if (
|
||||
input.length < 4 ||
|
||||
input[0] !== 0xff ||
|
||||
input[1] !== SOI ||
|
||||
input[2] !== 0xff
|
||||
) {
|
||||
throw new Error("not a JPEG (missing SOI or first marker)");
|
||||
}
|
||||
|
||||
// Output cannot exceed input length — we only ever drop bytes.
|
||||
const out = new Uint8Array(input.length);
|
||||
let outPos = 0;
|
||||
|
||||
// Copy SOI verbatim.
|
||||
out[outPos++] = 0xff;
|
||||
out[outPos++] = SOI;
|
||||
|
||||
let i = 2;
|
||||
let droppedSegments = 0;
|
||||
let sawEOI = false;
|
||||
|
||||
while (i < input.length) {
|
||||
if (input[i] !== 0xff) {
|
||||
throw new Error(
|
||||
`expected marker prefix 0xFF at offset ${i}, got 0x${(input[i] ?? 0).toString(16)}`,
|
||||
);
|
||||
}
|
||||
// T.81 §B.1.1.2: any marker may be preceded by an arbitrary number of
|
||||
// fill bytes (0xFF). Skip the leading run, keeping the final 0xFF as
|
||||
// the marker prefix. Real-world fill-byte output is rare but legal,
|
||||
// and without this loop a sequence FF FF E1 ... would parse `E1` as
|
||||
// a length and bail with "invalid segment length".
|
||||
while (i + 1 < input.length && input[i + 1] === 0xff) {
|
||||
i++;
|
||||
}
|
||||
const marker = input[i + 1];
|
||||
if (marker === undefined) {
|
||||
throw new Error(`truncated marker at offset ${i}`);
|
||||
}
|
||||
|
||||
// Standalone markers: copy FF + code, no length field.
|
||||
if (isStandalone(marker)) {
|
||||
out[outPos++] = 0xff;
|
||||
out[outPos++] = marker;
|
||||
i += 2;
|
||||
if (marker === EOI) {
|
||||
sawEOI = true;
|
||||
break;
|
||||
}
|
||||
continue;
|
||||
}
|
||||
|
||||
// Length-prefixed marker: 2-byte big-endian length (includes the
|
||||
// length field itself, so payload size = len - 2).
|
||||
if (i + 4 > input.length) {
|
||||
throw new Error(`truncated length field at offset ${i}`);
|
||||
}
|
||||
const lenHi = input[i + 2] ?? 0;
|
||||
const lenLo = input[i + 3] ?? 0;
|
||||
const segLen = (lenHi << 8) | lenLo;
|
||||
if (segLen < 2) {
|
||||
throw new Error(`invalid segment length ${segLen} at offset ${i}`);
|
||||
}
|
||||
const segmentEnd = i + 2 + segLen;
|
||||
if (segmentEnd > input.length) {
|
||||
throw new Error(`segment at offset ${i} extends past end of input`);
|
||||
}
|
||||
|
||||
if (shouldDropSegment(marker, options)) {
|
||||
droppedSegments++;
|
||||
} else {
|
||||
// Copy FF + code + length field + payload verbatim.
|
||||
out.set(input.subarray(i, segmentEnd), outPos);
|
||||
outPos += segmentEnd - i;
|
||||
}
|
||||
|
||||
i = segmentEnd;
|
||||
|
||||
// SOS is followed by entropy-coded image data which must be copied
|
||||
// byte-for-byte. The stream ends at the next non-stuffed,
|
||||
// non-restart marker. FF 00 is byte-stuffing (literal FF inside the
|
||||
// stream); FF D0..FF D7 are restart markers that punctuate the
|
||||
// stream; both are part of the entropy data.
|
||||
if (marker === SOS) {
|
||||
const entropyStart = i;
|
||||
while (i < input.length) {
|
||||
if (input[i] !== 0xff) {
|
||||
i++;
|
||||
continue;
|
||||
}
|
||||
const next = input[i + 1];
|
||||
if (next === 0x00) {
|
||||
i += 2;
|
||||
continue;
|
||||
}
|
||||
if (next !== undefined && next >= RST_FIRST && next <= RST_LAST) {
|
||||
i += 2;
|
||||
continue;
|
||||
}
|
||||
// Real marker: entropy stream ends here.
|
||||
break;
|
||||
}
|
||||
out.set(input.subarray(entropyStart, i), outPos);
|
||||
outPos += i - entropyStart;
|
||||
}
|
||||
}
|
||||
|
||||
// Reaching the end of input without seeing EOI means the file is
|
||||
// truncated. This catches both flavours: an entropy stream that runs off
|
||||
// the end of the buffer, and a file that ends after a complete segment
|
||||
// but never reaches EOI. Some decoders accept missing-EOI input, many
|
||||
// don't — fail loudly rather than silently emit a dubious file.
|
||||
if (!sawEOI) {
|
||||
throw new Error("truncated JPEG: no EOI marker found");
|
||||
}
|
||||
|
||||
return {
|
||||
bytes: out.slice(0, outPos),
|
||||
droppedSegments,
|
||||
};
|
||||
}
|
||||
|
||||
export class JpegStrategy implements FormatStrategy {
|
||||
readonly extensions: ReadonlySet<string> = new Set([".jpg", ".jpeg"]);
|
||||
|
||||
verifyMagicBytes({ bytes }: { bytes: Uint8Array }): boolean {
|
||||
// JPEG SOI (FF D8) followed by the start of the next marker (FF).
|
||||
// Real JPEGs always have a marker immediately after SOI; requiring
|
||||
// byte 2 == 0xFF rejects bare FFD8 garbage.
|
||||
return (
|
||||
bytes.length >= 3 &&
|
||||
bytes[0] === 0xff &&
|
||||
bytes[1] === 0xd8 &&
|
||||
bytes[2] === 0xff
|
||||
);
|
||||
}
|
||||
|
||||
async strip({
|
||||
bytes,
|
||||
options,
|
||||
}: {
|
||||
bytes: Uint8Array;
|
||||
options: StripOptions;
|
||||
}): Promise<Result<StripResult, ExifError>> {
|
||||
try {
|
||||
const { bytes: result, droppedSegments } = walkJpeg(bytes, options);
|
||||
return {
|
||||
ok: true,
|
||||
value: { bytes: result, metadataRemoved: droppedSegments },
|
||||
};
|
||||
} catch (err: unknown) {
|
||||
return {
|
||||
ok: false,
|
||||
error: {
|
||||
code: "invalid-file-format",
|
||||
detail: `Failed to strip JPEG metadata: ${err instanceof Error ? err.message : String(err)}`,
|
||||
},
|
||||
};
|
||||
}
|
||||
}
|
||||
}
|
||||
192
src/infrastructure/wasm/strategies/pdf_strategy.ts
Normal file
192
src/infrastructure/wasm/strategies/pdf_strategy.ts
Normal file
|
|
@ -0,0 +1,192 @@
|
|||
import type { Result } from "../../../common";
|
||||
import type { ExifError } from "../../../domain";
|
||||
import type {
|
||||
FormatStrategy,
|
||||
StripOptions,
|
||||
StripResult,
|
||||
} from "../format_strategy";
|
||||
import type { PDFDict } from "pdf-lib";
|
||||
|
||||
// PDF magic bytes: %PDF-
|
||||
const PDF_MAGIC = [0x25, 0x50, 0x44, 0x46, 0x2d] as const;
|
||||
|
||||
// Info-dictionary keys that carry metadata. /Trapped is a colour-management
|
||||
// hint; the others are user-visible identity fields.
|
||||
const INFO_KEYS = [
|
||||
"Title",
|
||||
"Author",
|
||||
"Subject",
|
||||
"Keywords",
|
||||
"Producer",
|
||||
"Creator",
|
||||
"CreationDate",
|
||||
"ModDate",
|
||||
"Trapped",
|
||||
] as const;
|
||||
|
||||
// Catalog-level keys that fingerprint the document. /Lang leaks user locale;
|
||||
// /PageLabels can carry section names; /OutputIntents holds colour-management
|
||||
// metadata that may identify a device.
|
||||
const CATALOG_FINGERPRINT_KEYS = [
|
||||
"Lang",
|
||||
"PageLabels",
|
||||
"OutputIntents",
|
||||
] as const;
|
||||
|
||||
// Annotation keys that carry author/comment/timestamp information. The
|
||||
// annotation itself stays (Type/Subtype/Rect/AP) so visibility is preserved;
|
||||
// only authorship is removed.
|
||||
const ANNOTATION_PII_KEYS = [
|
||||
"T", // author
|
||||
"Contents", // comment text
|
||||
"M", // modification date
|
||||
"CreationDate",
|
||||
"RC", // rich content (HTML-like author markup)
|
||||
"Subj", // subject
|
||||
] as const;
|
||||
|
||||
export class PdfStrategy implements FormatStrategy {
|
||||
readonly extensions: ReadonlySet<string> = new Set([".pdf"]);
|
||||
|
||||
verifyMagicBytes({ bytes }: { bytes: Uint8Array }): boolean {
|
||||
if (bytes.length < PDF_MAGIC.length) return false;
|
||||
return PDF_MAGIC.every(
|
||||
(byte: number, i: number) => (bytes[i] ?? 0) === byte,
|
||||
);
|
||||
}
|
||||
|
||||
async strip({
|
||||
bytes,
|
||||
options: _options,
|
||||
}: {
|
||||
bytes: Uint8Array;
|
||||
options: StripOptions;
|
||||
}): Promise<Result<StripResult, ExifError>> {
|
||||
try {
|
||||
const pdfLib = await import("pdf-lib");
|
||||
const { PDFDocument, PDFName } = pdfLib;
|
||||
const PDFDictClass = pdfLib.PDFDict;
|
||||
const PDFArrayClass = pdfLib.PDFArray;
|
||||
const PDFRefClass = pdfLib.PDFRef;
|
||||
|
||||
// updateMetadata: false disables pdf-lib's `updateInfoDict()` — by
|
||||
// default it re-stamps Producer to "pdf-lib (...)", sets ModDate
|
||||
// to "now", and back-fills Creator/CreationDate. None of that is
|
||||
// what we want for a privacy strip; the strip-event time is its
|
||||
// own fingerprint.
|
||||
const doc = await PDFDocument.load(bytes, {
|
||||
ignoreEncryption: true,
|
||||
updateMetadata: false,
|
||||
});
|
||||
|
||||
let removed = 0;
|
||||
|
||||
const tryDelete = (dict: PDFDict, key: string): boolean => {
|
||||
const name = PDFName.of(key);
|
||||
if (dict.has(name)) {
|
||||
dict.delete(name);
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
};
|
||||
|
||||
const dropIndirect = (ref: unknown): void => {
|
||||
if (ref instanceof PDFRefClass) doc.context.delete(ref);
|
||||
};
|
||||
|
||||
// 1. Empty the Info dict by removing every metadata key. Keeps
|
||||
// the dict itself (some serializers expect /Info present)
|
||||
// but removes all values. With updateMetadata: false, none
|
||||
// of this gets re-added on save.
|
||||
const infoRef = doc.context.trailerInfo.Info;
|
||||
if (infoRef !== undefined) {
|
||||
const infoDict = doc.context.lookup(infoRef);
|
||||
if (infoDict instanceof PDFDictClass) {
|
||||
for (const key of INFO_KEYS) {
|
||||
if (tryDelete(infoDict, key)) removed++;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Drop the catalog /Metadata reference AND the XMP stream
|
||||
// object. This is the highest-impact privacy fix — XMP is
|
||||
// where authoring-tool fingerprints, original Title/Author
|
||||
// from Word/Acrobat, and dc:* properties live. Just removing
|
||||
// the reference would leave the stream as an orphan in the
|
||||
// file; deleting the indirect object too prevents that.
|
||||
const metadataKey = PDFName.of("Metadata");
|
||||
const metadataRef = doc.catalog.get(metadataKey);
|
||||
if (metadataRef !== undefined) {
|
||||
doc.catalog.delete(metadataKey);
|
||||
dropIndirect(metadataRef);
|
||||
removed++;
|
||||
}
|
||||
|
||||
// 3. Drop catalog-level fingerprints.
|
||||
for (const key of CATALOG_FINGERPRINT_KEYS) {
|
||||
if (tryDelete(doc.catalog, key)) removed++;
|
||||
}
|
||||
|
||||
// 4. Per-page cleanup: page-level metadata + thumbnails +
|
||||
// annotation authorship. Annotation visibility (Type, Subtype,
|
||||
// Rect, AP) is preserved; only the PII keys are removed.
|
||||
for (const page of doc.getPages()) {
|
||||
const node = page.node;
|
||||
|
||||
const pageMetaName = PDFName.of("Metadata");
|
||||
const pageMetaRef = node.get(pageMetaName);
|
||||
if (pageMetaRef !== undefined) {
|
||||
node.delete(pageMetaName);
|
||||
dropIndirect(pageMetaRef);
|
||||
removed++;
|
||||
}
|
||||
|
||||
const thumbName = PDFName.of("Thumb");
|
||||
const thumbRef = node.get(thumbName);
|
||||
if (thumbRef !== undefined) {
|
||||
node.delete(thumbName);
|
||||
dropIndirect(thumbRef);
|
||||
removed++;
|
||||
}
|
||||
|
||||
const annotsRef = node.get(PDFName.of("Annots"));
|
||||
if (annotsRef !== undefined) {
|
||||
const annots = doc.context.lookup(annotsRef);
|
||||
if (annots instanceof PDFArrayClass) {
|
||||
for (let i = 0; i < annots.size(); i++) {
|
||||
const annot = doc.context.lookup(annots.get(i));
|
||||
if (annot instanceof PDFDictClass) {
|
||||
for (const key of ANNOTATION_PII_KEYS) {
|
||||
if (tryDelete(annot, key)) removed++;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// updateFieldAppearances: false avoids re-rendering AcroForm
|
||||
// field visuals during save (irrelevant to metadata removal,
|
||||
// just slow).
|
||||
const outputBytes = await doc.save({
|
||||
updateFieldAppearances: false,
|
||||
});
|
||||
|
||||
return {
|
||||
ok: true,
|
||||
value: {
|
||||
bytes: new Uint8Array(outputBytes),
|
||||
metadataRemoved: removed,
|
||||
},
|
||||
};
|
||||
} catch (err: unknown) {
|
||||
return {
|
||||
ok: false,
|
||||
error: {
|
||||
code: "invalid-file-format",
|
||||
detail: `pdf-lib failed: ${err instanceof Error ? err.message : String(err)}`,
|
||||
},
|
||||
};
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -1,10 +1,14 @@
|
|||
import { OfficeStrategy } from "./strategies/office_strategy";
|
||||
import { VideoStrategy } from "./strategies/video_strategy";
|
||||
import { JpegStrategy } from "./strategies/jpeg_strategy";
|
||||
import { PdfStrategy } from "./strategies/pdf_strategy";
|
||||
import type { FormatStrategy } from "./format_strategy";
|
||||
|
||||
const STRATEGIES: readonly FormatStrategy[] = [
|
||||
new OfficeStrategy(),
|
||||
new VideoStrategy(),
|
||||
new JpegStrategy(),
|
||||
new PdfStrategy(),
|
||||
];
|
||||
|
||||
function getExtension({ filename }: { filename: string }): string | null {
|
||||
|
|
|
|||
103
src/infrastructure/web/browser_file_bytes.ts
Normal file
103
src/infrastructure/web/browser_file_bytes.ts
Normal file
|
|
@ -0,0 +1,103 @@
|
|||
import type { FileBytesPort, FileTimestamps } from "../../application";
|
||||
import type { Result } from "../../common";
|
||||
import type { ExifError } from "../../domain";
|
||||
import type { FileRegistry } from "./file_registry";
|
||||
|
||||
// FileBytesPort for the browser:
|
||||
// - read() → reads File bytes via the File API
|
||||
// - write() → triggers a browser download with the cleaned bytes
|
||||
// - exists() → always false (no collision detection needed; download is non-destructive)
|
||||
// - timestamps → mtime from File.lastModified; setTimestamps is a no-op
|
||||
export class BrowserFileBytes implements FileBytesPort {
|
||||
constructor(private readonly registry: FileRegistry) {}
|
||||
|
||||
async read({
|
||||
path,
|
||||
}: {
|
||||
path: string;
|
||||
}): Promise<Result<Uint8Array, ExifError>> {
|
||||
const file = this.registry.get(path);
|
||||
if (file === undefined) {
|
||||
return {
|
||||
ok: false,
|
||||
error: {
|
||||
code: "file-io-error",
|
||||
detail: `File not found in browser registry: ${path}`,
|
||||
},
|
||||
};
|
||||
}
|
||||
try {
|
||||
const ab = await file.arrayBuffer();
|
||||
return { ok: true, value: new Uint8Array(ab) };
|
||||
} catch (err: unknown) {
|
||||
return {
|
||||
ok: false,
|
||||
error: {
|
||||
code: "file-io-error",
|
||||
detail: `Failed to read file bytes: ${err instanceof Error ? err.message : String(err)}`,
|
||||
},
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
async write({
|
||||
path,
|
||||
bytes,
|
||||
}: {
|
||||
path: string;
|
||||
bytes: Uint8Array;
|
||||
}): Promise<Result<void, ExifError>> {
|
||||
// The path basename is the correct download filename in both cases:
|
||||
// - In-place (saveAsCopy=false): path = "/web-file/abc/photo.jpg" → download "photo.jpg"
|
||||
// - Save copy (saveAsCopy=true): path = "/web-file/abc/photo_cleaned.jpg" → download "photo_cleaned.jpg"
|
||||
const filename = path.split("/").at(-1) ?? "cleaned-file";
|
||||
// Copy to a plain ArrayBuffer — Uint8Array.buffer is typed as ArrayBufferLike
|
||||
// which can be SharedArrayBuffer; Blob constructor rejects SharedArrayBuffer.
|
||||
// Copying via a new Uint8Array guarantees a plain ArrayBuffer.
|
||||
const plainBuffer = new Uint8Array(bytes).buffer;
|
||||
const blob = new Blob([plainBuffer], { type: "application/octet-stream" });
|
||||
const url = URL.createObjectURL(blob);
|
||||
const anchor = document.createElement("a");
|
||||
anchor.href = url;
|
||||
anchor.download = filename;
|
||||
document.body.appendChild(anchor);
|
||||
anchor.click();
|
||||
document.body.removeChild(anchor);
|
||||
// Revoke after a short delay so the download has time to start
|
||||
setTimeout(() => URL.revokeObjectURL(url), 1000);
|
||||
return { ok: true, value: undefined };
|
||||
}
|
||||
|
||||
async exists(_args: { path: string }): Promise<boolean> {
|
||||
// Always false — no file collision possible in browser downloads
|
||||
return false;
|
||||
}
|
||||
|
||||
async getTimestamps({
|
||||
path,
|
||||
}: {
|
||||
path: string;
|
||||
}): Promise<Result<FileTimestamps, ExifError>> {
|
||||
const file = this.registry.get(path);
|
||||
if (file === undefined) {
|
||||
return {
|
||||
ok: false,
|
||||
error: {
|
||||
code: "file-io-error",
|
||||
detail: `File not found in browser registry: ${path}`,
|
||||
},
|
||||
};
|
||||
}
|
||||
const mtime = new Date(file.lastModified);
|
||||
return { ok: true, value: { atime: mtime, mtime } };
|
||||
}
|
||||
|
||||
async setTimestamps(_args: {
|
||||
path: string;
|
||||
atime: Date;
|
||||
mtime: Date;
|
||||
}): Promise<Result<void, ExifError>> {
|
||||
// No-op: browser downloads cannot have timestamps set
|
||||
return { ok: true, value: undefined };
|
||||
}
|
||||
}
|
||||
16
src/infrastructure/web/file_registry.ts
Normal file
16
src/infrastructure/web/file_registry.ts
Normal file
|
|
@ -0,0 +1,16 @@
|
|||
// Maps virtual path → File object for the duration of the browser session.
|
||||
// Paths have the form "/web-file/<uuid>/<filename>" and are stable for the
|
||||
// lifetime of the FileEntry that holds them.
|
||||
export class FileRegistry {
|
||||
private readonly map = new Map<string, File>();
|
||||
|
||||
register(file: File): string {
|
||||
const path = `/web-file/${crypto.randomUUID()}/${file.name}`;
|
||||
this.map.set(path, file);
|
||||
return path;
|
||||
}
|
||||
|
||||
get(path: string): File | undefined {
|
||||
return this.map.get(path);
|
||||
}
|
||||
}
|
||||
148
src/infrastructure/web/web_api.ts
Normal file
148
src/infrastructure/web/web_api.ts
Normal file
|
|
@ -0,0 +1,148 @@
|
|||
import type { ElectronApi } from "../../preload/api_types";
|
||||
import type { Settings, I18nStringsDictionary } from "../../domain";
|
||||
import {
|
||||
DEFAULT_SETTINGS,
|
||||
validateSettings,
|
||||
formatExifError,
|
||||
} from "../../domain";
|
||||
import { WasmProcessor } from "../wasm/wasm_processor";
|
||||
import { FileRegistry } from "./file_registry";
|
||||
import { BrowserFileBytes } from "./browser_file_bytes";
|
||||
// Bundled at build time by Vite — no network request at runtime
|
||||
import stringsJson from "../../../.resources/strings.json";
|
||||
|
||||
const SETTINGS_KEY = "exifcleaner-settings-v1";
|
||||
|
||||
function loadSettingsFromStorage(): Settings {
|
||||
try {
|
||||
const raw = localStorage.getItem(SETTINGS_KEY);
|
||||
if (raw === null) return { ...DEFAULT_SETTINGS };
|
||||
const parsed: unknown = JSON.parse(raw);
|
||||
if (typeof parsed !== "object" || parsed === null)
|
||||
return { ...DEFAULT_SETTINGS };
|
||||
const result = validateSettings({ input: parsed });
|
||||
return result.ok ? result.value : { ...DEFAULT_SETTINGS };
|
||||
} catch {
|
||||
return { ...DEFAULT_SETTINGS };
|
||||
}
|
||||
}
|
||||
|
||||
function saveSettingsToStorage(settings: Settings): void {
|
||||
try {
|
||||
localStorage.setItem(SETTINGS_KEY, JSON.stringify(settings));
|
||||
} catch {
|
||||
// localStorage may be unavailable (private browsing quota exhausted)
|
||||
}
|
||||
}
|
||||
|
||||
export function makeWebApi(): ElectronApi {
|
||||
const registry = new FileRegistry();
|
||||
const fileBytes = new BrowserFileBytes(registry);
|
||||
const processor = new WasmProcessor({ fileBytes });
|
||||
const settingsListeners = new Set<(s: Settings) => void>();
|
||||
let currentSettings: Settings = loadSettingsFromStorage();
|
||||
|
||||
return {
|
||||
exif: {
|
||||
readMetadata: async () => ({}),
|
||||
removeMetadata: async () => ({
|
||||
data: null,
|
||||
error: "ExifTool not available in browser",
|
||||
}),
|
||||
},
|
||||
|
||||
i18n: {
|
||||
getLocale: async () => navigator.language,
|
||||
getStrings: async () => stringsJson as unknown as I18nStringsDictionary,
|
||||
onLanguageChanged: () => () => {},
|
||||
},
|
||||
|
||||
files: {
|
||||
basename: (p: string) => p.split("/").at(-1) ?? p,
|
||||
getPathForFile: (file: File): string => registry.register(file),
|
||||
notifyFilesAdded: () => {},
|
||||
notifyFileProcessed: () => {},
|
||||
notifyAllFilesProcessed: () => {},
|
||||
onFileOpenAddFiles: () => () => {},
|
||||
},
|
||||
|
||||
theme: {
|
||||
get: async () => ({
|
||||
shouldUseDarkColors: window.matchMedia("(prefers-color-scheme: dark)")
|
||||
.matches,
|
||||
}),
|
||||
set: async (mode) => {
|
||||
currentSettings = { ...currentSettings, themeMode: mode };
|
||||
saveSettingsToStorage(currentSettings);
|
||||
return { success: true };
|
||||
},
|
||||
getAccentColor: async () => ({ color: "#007AFF" }),
|
||||
onChanged: (callback) => {
|
||||
const mq = window.matchMedia("(prefers-color-scheme: dark)");
|
||||
const handler = (e: MediaQueryListEvent): void => {
|
||||
callback({ shouldUseDarkColors: e.matches });
|
||||
};
|
||||
mq.addEventListener("change", handler);
|
||||
return () => mq.removeEventListener("change", handler);
|
||||
},
|
||||
onAccentColorChanged: () => () => {},
|
||||
},
|
||||
|
||||
settings: {
|
||||
get: async () => ({ ...currentSettings }),
|
||||
set: async (updates) => {
|
||||
currentSettings = { ...currentSettings, ...updates };
|
||||
saveSettingsToStorage(currentSettings);
|
||||
settingsListeners.forEach((cb) => cb({ ...currentSettings }));
|
||||
return { success: true, error: null };
|
||||
},
|
||||
onChanged: (callback) => {
|
||||
settingsListeners.add(callback);
|
||||
return () => settingsListeners.delete(callback);
|
||||
},
|
||||
onToggle: () => () => {},
|
||||
},
|
||||
|
||||
wasm: {
|
||||
process: async (filePath, options) => {
|
||||
const result = await processor.process({ filePath, options });
|
||||
if (!result.ok) {
|
||||
return {
|
||||
ok: false,
|
||||
outputPath: null,
|
||||
metadataRemoved: null,
|
||||
error: formatExifError(result.error),
|
||||
};
|
||||
}
|
||||
return {
|
||||
ok: true,
|
||||
outputPath: result.value.outputPath,
|
||||
metadataRemoved: result.value.metadataRemoved,
|
||||
error: null,
|
||||
};
|
||||
},
|
||||
},
|
||||
|
||||
folder: {
|
||||
classify: async (paths) => ({ files: paths, folders: [] }),
|
||||
expand: async () => ({
|
||||
files: [],
|
||||
skippedCount: 0,
|
||||
error: "Folder expansion not supported in browser",
|
||||
}),
|
||||
},
|
||||
|
||||
reveal: {
|
||||
showInFolder: async () => ({
|
||||
success: false,
|
||||
error: "Not supported in browser",
|
||||
}),
|
||||
showContextMenu: async () => ({ success: false }),
|
||||
},
|
||||
|
||||
platform: {
|
||||
isMac: false,
|
||||
isWeb: true,
|
||||
},
|
||||
};
|
||||
}
|
||||
|
|
@ -74,6 +74,10 @@ export interface RevealApi {
|
|||
|
||||
export interface PlatformApi {
|
||||
isMac: boolean;
|
||||
// true only in the browser webapp build (web_api.ts); false in Electron (preload/index.ts).
|
||||
// Renderer uses this to route ALL supported files through WasmProcessor in web context,
|
||||
// since ExifTool is unavailable there.
|
||||
isWeb: boolean;
|
||||
}
|
||||
|
||||
export interface WasmApi {
|
||||
|
|
|
|||
|
|
@ -170,6 +170,7 @@ const api: ElectronApi = {
|
|||
|
||||
platform: {
|
||||
isMac: process.platform === "darwin",
|
||||
isWeb: false,
|
||||
},
|
||||
|
||||
wasm: {
|
||||
|
|
|
|||
|
|
@ -1,4 +1,5 @@
|
|||
import { useI18n } from "../../hooks/use_i18n";
|
||||
import { FileBrowseButton } from "./FileBrowseButton";
|
||||
|
||||
export function EmptyState(): React.JSX.Element {
|
||||
const { t } = useI18n();
|
||||
|
|
@ -24,6 +25,7 @@ export function EmptyState(): React.JSX.Element {
|
|||
</svg>
|
||||
<h1 className="empty-state__title">{t("empty.title")}</h1>
|
||||
<p className="empty-state__subtitle">{t("empty.subtitle")}</p>
|
||||
<FileBrowseButton />
|
||||
</div>
|
||||
</section>
|
||||
);
|
||||
|
|
|
|||
128
src/renderer/components/ui/FileBrowseButton.tsx
Normal file
128
src/renderer/components/ui/FileBrowseButton.tsx
Normal file
|
|
@ -0,0 +1,128 @@
|
|||
// Web-only file picker button. In Electron, the File > Open menu and the
|
||||
// native dialog cover this entry point — adding a redundant in-window button
|
||||
// would clutter the desktop UX. This component returns null whenever
|
||||
// window.api.platform.isWeb is false.
|
||||
|
||||
import { useRef, useCallback } from "react";
|
||||
import type { Dispatch } from "react";
|
||||
import { useAppContext } from "../../contexts/AppContext";
|
||||
import type { AppAction, FileEntry } from "../../contexts/AppContext";
|
||||
import { useProcessFiles } from "../../hooks/use_process_files";
|
||||
import { useI18n } from "../../hooks/use_i18n";
|
||||
import { FileProcessingStatus, isSupportedFile } from "../../../domain";
|
||||
import { getFileExtension } from "../../utils/get_file_extension";
|
||||
|
||||
// Mirror the platform gate as a pure predicate so tests can verify the
|
||||
// invariant ("button is web-only") without rendering React.
|
||||
export function shouldRenderFileBrowseButton({
|
||||
isWeb,
|
||||
}: {
|
||||
isWeb: boolean;
|
||||
}): boolean {
|
||||
return isWeb;
|
||||
}
|
||||
|
||||
// Extracted so unit tests can exercise the registration + dispatch path
|
||||
// without rendering the component or touching the DOM.
|
||||
export function handleSelectedFiles({
|
||||
files,
|
||||
dispatch,
|
||||
processFiles,
|
||||
}: {
|
||||
files: readonly File[];
|
||||
dispatch: Dispatch<AppAction>;
|
||||
processFiles: (entries: FileEntry[]) => void;
|
||||
}): FileEntry[] {
|
||||
const entries: FileEntry[] = [];
|
||||
for (const file of files) {
|
||||
if (!isSupportedFile({ filename: file.name })) {
|
||||
continue;
|
||||
}
|
||||
// Mirror DropZone: register File with the web shim's FileRegistry to
|
||||
// receive a stable virtual path, then build a FileEntry around it.
|
||||
const path = window.api.files.getPathForFile(file);
|
||||
entries.push({
|
||||
id: crypto.randomUUID(),
|
||||
path,
|
||||
name: file.name,
|
||||
extension: getFileExtension({ filename: file.name }),
|
||||
size: file.size,
|
||||
folder: null,
|
||||
status: FileProcessingStatus.Pending,
|
||||
beforeTags: null,
|
||||
afterTags: null,
|
||||
beforeMetadata: null,
|
||||
afterMetadata: null,
|
||||
error: null,
|
||||
});
|
||||
}
|
||||
|
||||
if (entries.length > 0) {
|
||||
dispatch({ type: "ADD_FILES", files: entries });
|
||||
processFiles(entries);
|
||||
}
|
||||
|
||||
return entries;
|
||||
}
|
||||
|
||||
// "default" is the prominent box button used in the empty state. "compact"
|
||||
// is the inline link-style button shown in the StatusBar after files exist,
|
||||
// so users can keep adding to the batch without clearing first.
|
||||
export type FileBrowseButtonVariant = "default" | "compact";
|
||||
|
||||
export function FileBrowseButton({
|
||||
variant = "default",
|
||||
}: {
|
||||
variant?: FileBrowseButtonVariant;
|
||||
} = {}): React.JSX.Element | null {
|
||||
const { t } = useI18n();
|
||||
const { dispatch } = useAppContext();
|
||||
const { processFiles } = useProcessFiles();
|
||||
const inputRef = useRef<HTMLInputElement>(null);
|
||||
|
||||
const handleClick = useCallback((): void => {
|
||||
inputRef.current?.click();
|
||||
}, []);
|
||||
|
||||
const handleChange = useCallback(
|
||||
(e: React.ChangeEvent<HTMLInputElement>): void => {
|
||||
const fileList = e.target.files;
|
||||
if (fileList === null) return;
|
||||
const files = Array.from(fileList);
|
||||
handleSelectedFiles({ files, dispatch, processFiles });
|
||||
// Reset so picking the same file again still fires onChange.
|
||||
e.target.value = "";
|
||||
},
|
||||
[dispatch, processFiles],
|
||||
);
|
||||
|
||||
// Platform gate: only the deployed webapp shows this button. The Electron
|
||||
// build relies on the existing File > Open menu / drag-and-drop. Hooks
|
||||
// above this guard run unconditionally so React's hook order stays stable.
|
||||
if (!shouldRenderFileBrowseButton({ isWeb: window.api.platform.isWeb })) {
|
||||
return null;
|
||||
}
|
||||
|
||||
const isCompact = variant === "compact";
|
||||
const buttonClassName = isCompact
|
||||
? "status-bar__button"
|
||||
: "file-browse-button";
|
||||
const label = t(isCompact ? "statusBar.addFiles" : "empty.browseButton");
|
||||
|
||||
return (
|
||||
<>
|
||||
<button type="button" className={buttonClassName} onClick={handleClick}>
|
||||
{label}
|
||||
</button>
|
||||
<input
|
||||
ref={inputRef}
|
||||
type="file"
|
||||
multiple
|
||||
className="file-browse-button__input"
|
||||
onChange={handleChange}
|
||||
aria-hidden="true"
|
||||
tabIndex={-1}
|
||||
/>
|
||||
</>
|
||||
);
|
||||
}
|
||||
|
|
@ -3,6 +3,7 @@
|
|||
|
||||
import type { ReactNode } from "react";
|
||||
import { useI18n } from "../../hooks/use_i18n";
|
||||
import { FileBrowseButton } from "./FileBrowseButton";
|
||||
|
||||
function interpolate(
|
||||
template: string,
|
||||
|
|
@ -53,6 +54,7 @@ export function StatusBar({
|
|||
seconds: elapsedSeconds ?? 0,
|
||||
})}
|
||||
</div>
|
||||
<FileBrowseButton variant="compact" />
|
||||
{onCleanMore !== undefined && (
|
||||
<button
|
||||
className="status-bar__button"
|
||||
|
|
|
|||
2
src/renderer/env.d.ts
vendored
2
src/renderer/env.d.ts
vendored
|
|
@ -1,3 +1,5 @@
|
|||
/// <reference types="vite/client" />
|
||||
|
||||
import type { ElectronApi } from "../preload/api_types";
|
||||
|
||||
declare global {
|
||||
|
|
|
|||
|
|
@ -20,12 +20,16 @@ export async function processFileEntries(
|
|||
): Promise<void> {
|
||||
window.api.files.notifyFilesAdded(entries.length);
|
||||
|
||||
// In the web build, ExifTool is unavailable — all files go through WasmProcessor.
|
||||
// In Electron, only formats ExifTool can't handle (Office, fragmented MP4) use WASM.
|
||||
const isWebBuild = window.api.platform.isWeb;
|
||||
const usesWasm = (entry: FileEntry): boolean =>
|
||||
isWasmHandled({ extension: entry.extension }) || isWebBuild;
|
||||
|
||||
// Fetch settings once per batch when needed. The exif path doesn't consume
|
||||
// renderer-side settings (main reads its own), so we only pay this IPC if
|
||||
// at least one entry will go through the WASM path.
|
||||
const hasWasmEntries = entries.some((e) =>
|
||||
isWasmHandled({ extension: e.extension }),
|
||||
);
|
||||
const hasWasmEntries = entries.some((e) => usesWasm(e));
|
||||
let wasmOptions: WasmOptions | null = null;
|
||||
if (hasWasmEntries) {
|
||||
const settings = await window.api.settings.get();
|
||||
|
|
@ -39,7 +43,7 @@ export async function processFileEntries(
|
|||
|
||||
for (const entry of entries) {
|
||||
try {
|
||||
if (isWasmHandled({ extension: entry.extension })) {
|
||||
if (usesWasm(entry)) {
|
||||
// wasmOptions is non-null here because hasWasmEntries was true.
|
||||
await processViaWasm({ entry, dispatch, options: wasmOptions! });
|
||||
} else {
|
||||
|
|
|
|||
|
|
@ -6,6 +6,7 @@ import "./styles/tokens.css";
|
|||
import "./styles/app.css";
|
||||
import "./styles/drop_zone.css";
|
||||
import "./styles/empty_state.css";
|
||||
import "./styles/file_browse_button.css";
|
||||
import "./styles/error_boundary.css";
|
||||
import "./styles/file_list.css";
|
||||
import "./styles/file_table.css";
|
||||
|
|
|
|||
43
src/renderer/styles/file_browse_button.css
Normal file
43
src/renderer/styles/file_browse_button.css
Normal file
|
|
@ -0,0 +1,43 @@
|
|||
.file-browse-button {
|
||||
appearance: none;
|
||||
border: 1px solid var(--ec-color-border);
|
||||
background: var(--ec-color-surface);
|
||||
color: var(--ec-color-text);
|
||||
font-family: var(--ec-font-family);
|
||||
font-size: var(--ec-font-size-small);
|
||||
font-weight: var(--ec-font-weight-semibold);
|
||||
padding: var(--ec-space-2) var(--ec-space-4);
|
||||
border-radius: 6px;
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.file-browse-button:hover {
|
||||
border-color: var(--ec-color-accent);
|
||||
color: var(--ec-color-accent);
|
||||
}
|
||||
|
||||
.file-browse-button:focus-visible {
|
||||
outline: 2px solid var(--ec-color-accent);
|
||||
outline-offset: 2px;
|
||||
}
|
||||
|
||||
@media (prefers-reduced-motion: no-preference) {
|
||||
.file-browse-button {
|
||||
transition: border-color var(--ec-duration-fast) var(--ec-ease-out),
|
||||
color var(--ec-duration-fast) var(--ec-ease-out);
|
||||
}
|
||||
}
|
||||
|
||||
/* Hidden file input — kept in the DOM so the visible button can trigger it
|
||||
programmatically, but never shown to sighted or keyboard users. */
|
||||
.file-browse-button__input {
|
||||
position: absolute;
|
||||
width: 1px;
|
||||
height: 1px;
|
||||
padding: 0;
|
||||
margin: -1px;
|
||||
overflow: hidden;
|
||||
clip: rect(0, 0, 0, 0);
|
||||
white-space: nowrap;
|
||||
border: 0;
|
||||
}
|
||||
|
|
@ -10,6 +10,9 @@ export const WASM_HANDLED_EXTENSIONS: ReadonlySet<string> = new Set([
|
|||
".m4v",
|
||||
".3gp",
|
||||
".3g2",
|
||||
// In Electron, images and PDFs use ExifTool (more thorough than browser-side libs).
|
||||
// In the web build, ALL supported files go through WasmProcessor because ExifTool
|
||||
// is unavailable — the platform.isWeb flag extends routing in use_process_files.ts.
|
||||
]);
|
||||
|
||||
export function isWasmHandled({ extension }: { extension: string }): boolean {
|
||||
|
|
|
|||
17
src/web/index.html
Normal file
17
src/web/index.html
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
<!doctype html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<title>ExifCleaner</title>
|
||||
<link rel="icon" type="image/png" href="/icon-192.png" />
|
||||
<link rel="manifest" href="/manifest.webmanifest" />
|
||||
<meta name="theme-color" content="#2a9d8f" />
|
||||
<meta name="apple-mobile-web-app-capable" content="yes" />
|
||||
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
|
||||
</head>
|
||||
<body>
|
||||
<div id="root"></div>
|
||||
<script type="module" src="./main.tsx"></script>
|
||||
</body>
|
||||
</html>
|
||||
38
src/web/main.tsx
Normal file
38
src/web/main.tsx
Normal file
|
|
@ -0,0 +1,38 @@
|
|||
import { StrictMode } from "react";
|
||||
import { createRoot } from "react-dom/client";
|
||||
import { makeWebApi } from "../infrastructure/web/web_api";
|
||||
import { App } from "../renderer/App";
|
||||
import "../renderer/styles/reset.css";
|
||||
import "../renderer/styles/tokens.css";
|
||||
import "../renderer/styles/app.css";
|
||||
import "../renderer/styles/drop_zone.css";
|
||||
import "../renderer/styles/empty_state.css";
|
||||
import "../renderer/styles/file_browse_button.css";
|
||||
import "../renderer/styles/error_boundary.css";
|
||||
import "../renderer/styles/file_list.css";
|
||||
import "../renderer/styles/file_table.css";
|
||||
import "../renderer/styles/folder_row.css";
|
||||
import "../renderer/styles/status_bar.css";
|
||||
import "../renderer/styles/status_icon.css";
|
||||
import "../renderer/styles/toast.css";
|
||||
import "../renderer/styles/type_pill.css";
|
||||
// Additional styles added after B1 plan was written
|
||||
import "../renderer/styles/metadata_expansion.css";
|
||||
import "../renderer/styles/language_dropdown.css";
|
||||
import "../renderer/styles/settings_drawer.css";
|
||||
import "../renderer/styles/segmented_control.css";
|
||||
import "../renderer/styles/toggle_switch.css";
|
||||
|
||||
// Set up browser API before React mounts — no IPC, no Electron
|
||||
window.api = makeWebApi();
|
||||
|
||||
const rootEl = document.getElementById("root");
|
||||
if (rootEl === null) {
|
||||
throw new Error("Root element #root not found");
|
||||
}
|
||||
|
||||
createRoot(rootEl).render(
|
||||
<StrictMode>
|
||||
<App />
|
||||
</StrictMode>,
|
||||
);
|
||||
BIN
tests/fixtures/wasm/images/sample.jpg
vendored
Normal file
BIN
tests/fixtures/wasm/images/sample.jpg
vendored
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 103 B |
BIN
tests/fixtures/wasm/pdf/sample.pdf
vendored
Normal file
BIN
tests/fixtures/wasm/pdf/sample.pdf
vendored
Normal file
Binary file not shown.
510
tests/infrastructure/wasm/jpeg_strategy.test.ts
Normal file
510
tests/infrastructure/wasm/jpeg_strategy.test.ts
Normal file
|
|
@ -0,0 +1,510 @@
|
|||
import { describe, it, expect } from "vitest";
|
||||
import { readFile } from "node:fs/promises";
|
||||
import { resolve, dirname } from "node:path";
|
||||
import { fileURLToPath } from "node:url";
|
||||
import { JpegStrategy } from "../../../src/infrastructure/wasm/strategies/jpeg_strategy";
|
||||
import type { StripOptions } from "../../../src/infrastructure/wasm/format_strategy";
|
||||
|
||||
const HERE = dirname(fileURLToPath(import.meta.url));
|
||||
const DEFAULT_OPTIONS: StripOptions = {
|
||||
preserveOrientation: false,
|
||||
preserveColorProfile: false,
|
||||
preserveTimestamps: false,
|
||||
};
|
||||
|
||||
// Synthetic JPEG builder: SOI, the listed segments in order, a minimal SOS,
|
||||
// the entropy bytes verbatim, and EOI. Each segment is `[FF, code, lenHi,
|
||||
// lenLo, ...payload]` where len includes the 2 length bytes. `fillBytes`
|
||||
// optionally inserts extra 0xFF bytes immediately before the segment's
|
||||
// marker code (T.81 §B.1.1.2 fill bytes).
|
||||
type Segment = { code: number; payload: Uint8Array; fillBytes?: number };
|
||||
|
||||
function makeJpeg(
|
||||
segments: readonly Segment[],
|
||||
entropy: Uint8Array,
|
||||
): Uint8Array {
|
||||
const parts: number[] = [0xff, 0xd8]; // SOI
|
||||
for (const seg of segments) {
|
||||
parts.push(0xff);
|
||||
// Optional T.81 fill bytes between marker prefix and code.
|
||||
const fill = seg.fillBytes ?? 0;
|
||||
for (let f = 0; f < fill; f++) parts.push(0xff);
|
||||
parts.push(seg.code);
|
||||
const len = seg.payload.length + 2;
|
||||
parts.push((len >> 8) & 0xff, len & 0xff);
|
||||
for (const b of seg.payload) parts.push(b);
|
||||
}
|
||||
// Minimal SOS — payload is just two zero bytes; not a decodable scan
|
||||
// header, but the walker only cares about marker structure.
|
||||
parts.push(0xff, 0xda, 0x00, 0x04, 0x00, 0x00);
|
||||
for (const b of entropy) parts.push(b);
|
||||
parts.push(0xff, 0xd9); // EOI
|
||||
return new Uint8Array(parts);
|
||||
}
|
||||
|
||||
// Find the first segment with the given marker code. Returns offset or -1.
|
||||
function findMarker(bytes: Uint8Array, code: number): number {
|
||||
for (let i = 0; i + 1 < bytes.length; i++) {
|
||||
if (bytes[i] === 0xff && bytes[i + 1] === code) return i;
|
||||
}
|
||||
return -1;
|
||||
}
|
||||
|
||||
describe("JpegStrategy — extension and magic byte", () => {
|
||||
const strategy = new JpegStrategy();
|
||||
|
||||
it("claims only JPEG extensions", () => {
|
||||
expect(strategy.extensions.has(".jpg")).toBe(true);
|
||||
expect(strategy.extensions.has(".jpeg")).toBe(true);
|
||||
expect(strategy.extensions.has(".png")).toBe(false);
|
||||
expect(strategy.extensions.has(".webp")).toBe(false);
|
||||
});
|
||||
|
||||
it("verifies JPEG magic bytes (FFD8 followed by FF marker prefix)", () => {
|
||||
const jpegMagic = new Uint8Array([0xff, 0xd8, 0xff, 0xe0]);
|
||||
expect(strategy.verifyMagicBytes?.({ bytes: jpegMagic })).toBe(true);
|
||||
});
|
||||
|
||||
it("rejects FFD8 prefix not followed by FF", () => {
|
||||
const garbage = new Uint8Array([0xff, 0xd8, 0x00]);
|
||||
expect(strategy.verifyMagicBytes?.({ bytes: garbage })).toBe(false);
|
||||
});
|
||||
|
||||
it("rejects PNG, BMP, PDF prefixes", () => {
|
||||
const png = new Uint8Array([0x89, 0x50, 0x4e, 0x47]);
|
||||
const bmp = new Uint8Array([0x42, 0x4d, 0x00]);
|
||||
const pdf = new Uint8Array([0x25, 0x50, 0x44, 0x46]);
|
||||
expect(strategy.verifyMagicBytes?.({ bytes: png })).toBe(false);
|
||||
expect(strategy.verifyMagicBytes?.({ bytes: bmp })).toBe(false);
|
||||
expect(strategy.verifyMagicBytes?.({ bytes: pdf })).toBe(false);
|
||||
});
|
||||
|
||||
it("returns false for inputs shorter than 3 bytes", () => {
|
||||
expect(strategy.verifyMagicBytes?.({ bytes: new Uint8Array(0) })).toBe(
|
||||
false,
|
||||
);
|
||||
expect(
|
||||
strategy.verifyMagicBytes?.({ bytes: new Uint8Array([0xff, 0xd8]) }),
|
||||
).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe("JpegStrategy — drops APP and COM segments by default", () => {
|
||||
const strategy = new JpegStrategy();
|
||||
|
||||
it("drops JFIF/APP0 (FFE0)", async () => {
|
||||
const input = makeJpeg(
|
||||
[{ code: 0xe0, payload: new Uint8Array([0x4a, 0x46, 0x49, 0x46, 0x00]) }],
|
||||
new Uint8Array([0x12, 0x34]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
expect(findMarker(result.value.bytes, 0xe0)).toBe(-1);
|
||||
expect(result.value.metadataRemoved).toBe(1);
|
||||
});
|
||||
|
||||
it("drops EXIF/APP1 (FFE1)", async () => {
|
||||
const input = makeJpeg(
|
||||
[
|
||||
{
|
||||
code: 0xe1,
|
||||
payload: new Uint8Array([0x45, 0x78, 0x69, 0x66, 0x00, 0x00]),
|
||||
},
|
||||
],
|
||||
new Uint8Array([0x12, 0x34]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
expect(findMarker(result.value.bytes, 0xe1)).toBe(-1);
|
||||
});
|
||||
|
||||
it("drops the COM marker (FFFE)", async () => {
|
||||
const input = makeJpeg(
|
||||
[
|
||||
{
|
||||
code: 0xfe,
|
||||
payload: new Uint8Array([
|
||||
0x53, 0x65, 0x6e, 0x73, 0x69, 0x74, 0x69, 0x76, 0x65,
|
||||
]),
|
||||
},
|
||||
],
|
||||
new Uint8Array([0x12, 0x34]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
expect(findMarker(result.value.bytes, 0xfe)).toBe(-1);
|
||||
});
|
||||
|
||||
it("drops APP13 (Photoshop/IPTC, FFED)", async () => {
|
||||
const input = makeJpeg(
|
||||
[{ code: 0xed, payload: new Uint8Array([0x38, 0x42, 0x49, 0x4d]) }],
|
||||
new Uint8Array([0x12, 0x34]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
expect(findMarker(result.value.bytes, 0xed)).toBe(-1);
|
||||
});
|
||||
|
||||
it("drops APP15 (FFEF)", async () => {
|
||||
const input = makeJpeg(
|
||||
[{ code: 0xef, payload: new Uint8Array([0x00, 0x00]) }],
|
||||
new Uint8Array([0x12, 0x34]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
expect(findMarker(result.value.bytes, 0xef)).toBe(-1);
|
||||
});
|
||||
|
||||
it("drops every APP marker plus COM in a single pass and counts them", async () => {
|
||||
const segments: Segment[] = [
|
||||
{ code: 0xe0, payload: new Uint8Array([0x00]) },
|
||||
{ code: 0xe1, payload: new Uint8Array([0x00]) },
|
||||
{ code: 0xed, payload: new Uint8Array([0x00]) },
|
||||
{ code: 0xfe, payload: new Uint8Array([0x00]) },
|
||||
];
|
||||
const input = makeJpeg(segments, new Uint8Array([0x12]));
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
expect(result.value.metadataRemoved).toBe(4);
|
||||
});
|
||||
});
|
||||
|
||||
describe("JpegStrategy — preserves required markers", () => {
|
||||
const strategy = new JpegStrategy();
|
||||
|
||||
it("preserves Adobe APP14 (FFEE) — required for correct Adobe DCT decoding", async () => {
|
||||
const adobePayload = new Uint8Array([
|
||||
0x41, 0x64, 0x6f, 0x62, 0x65, 0x00, 0x64, 0x00, 0x00, 0x00, 0x00, 0x00,
|
||||
]);
|
||||
const input = makeJpeg(
|
||||
[
|
||||
{ code: 0xe0, payload: new Uint8Array([0x00]) }, // JFIF (drop)
|
||||
{ code: 0xee, payload: adobePayload }, // Adobe APP14 (keep)
|
||||
],
|
||||
new Uint8Array([0x12]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
const adobeOffset = findMarker(result.value.bytes, 0xee);
|
||||
expect(adobeOffset).toBeGreaterThan(0);
|
||||
// Verify the payload is intact ("Adobe" signature in first bytes)
|
||||
expect(result.value.bytes[adobeOffset + 4]).toBe(0x41); // 'A'
|
||||
expect(result.value.bytes[adobeOffset + 5]).toBe(0x64); // 'd'
|
||||
});
|
||||
|
||||
it("preserves DQT (FFDB) and DHT (FFC4) image-data segments", async () => {
|
||||
const input = makeJpeg(
|
||||
[
|
||||
{ code: 0xdb, payload: new Uint8Array([0x00, 0x10, 0x20]) }, // DQT
|
||||
{ code: 0xc4, payload: new Uint8Array([0x00, 0x10]) }, // DHT
|
||||
{ code: 0xe1, payload: new Uint8Array([0x00]) }, // EXIF (drop)
|
||||
],
|
||||
new Uint8Array([0x12]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
expect(findMarker(result.value.bytes, 0xdb)).toBeGreaterThan(0);
|
||||
expect(findMarker(result.value.bytes, 0xc4)).toBeGreaterThan(0);
|
||||
expect(findMarker(result.value.bytes, 0xe1)).toBe(-1);
|
||||
});
|
||||
|
||||
it("preserves EOI (FFD9) at the end", async () => {
|
||||
const input = makeJpeg(
|
||||
[{ code: 0xe1, payload: new Uint8Array([0x00]) }],
|
||||
new Uint8Array([0x12]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
const last = result.value.bytes;
|
||||
expect(last[last.length - 2]).toBe(0xff);
|
||||
expect(last[last.length - 1]).toBe(0xd9);
|
||||
});
|
||||
});
|
||||
|
||||
describe("JpegStrategy — preserveColorProfile", () => {
|
||||
const strategy = new JpegStrategy();
|
||||
|
||||
it("drops APP2 (ICC) by default", async () => {
|
||||
const input = makeJpeg(
|
||||
[{ code: 0xe2, payload: new Uint8Array([0x49, 0x43, 0x43, 0x5f]) }],
|
||||
new Uint8Array([0x12]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
expect(findMarker(result.value.bytes, 0xe2)).toBe(-1);
|
||||
});
|
||||
|
||||
it("keeps APP2 (ICC) when preserveColorProfile is true", async () => {
|
||||
const input = makeJpeg(
|
||||
[{ code: 0xe2, payload: new Uint8Array([0x49, 0x43, 0x43, 0x5f]) }],
|
||||
new Uint8Array([0x12]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: { ...DEFAULT_OPTIONS, preserveColorProfile: true },
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
expect(findMarker(result.value.bytes, 0xe2)).toBeGreaterThan(0);
|
||||
});
|
||||
});
|
||||
|
||||
describe("JpegStrategy — entropy-stream byte fidelity", () => {
|
||||
const strategy = new JpegStrategy();
|
||||
|
||||
it("preserves bytes 0x80–0x9F (windows-1252 corruption range) verbatim", async () => {
|
||||
// The exact bytes that the previous TextDecoder("latin1") path would
|
||||
// have corrupted. They must survive the walker untouched.
|
||||
const entropy = new Uint8Array([
|
||||
0x80, 0x82, 0x88, 0x91, 0x95, 0x9f, 0xa0, 0x42, 0x55,
|
||||
]);
|
||||
const input = makeJpeg(
|
||||
[{ code: 0xe1, payload: new Uint8Array([0x00]) }],
|
||||
entropy,
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
// Output: SOI + SOS marker + len + 2-byte SOS payload + entropy + EOI.
|
||||
// SOI=2, SOS marker+len=4, SOS payload=2 → entropy starts at offset 8.
|
||||
const entropyStart = 8;
|
||||
const entropyEnd = result.value.bytes.length - 2; // strip EOI
|
||||
const got = Array.from(
|
||||
result.value.bytes.subarray(entropyStart, entropyEnd),
|
||||
);
|
||||
expect(got).toEqual(Array.from(entropy));
|
||||
});
|
||||
|
||||
it("preserves FF 00 byte-stuffing within the entropy stream", async () => {
|
||||
// FF 00 means a literal 0xFF inside the entropy data — must survive.
|
||||
const entropy = new Uint8Array([0x12, 0xff, 0x00, 0x34, 0xff, 0x00, 0x56]);
|
||||
const input = makeJpeg([], entropy);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
const entropyStart = 8;
|
||||
const entropyEnd = result.value.bytes.length - 2;
|
||||
const got = Array.from(
|
||||
result.value.bytes.subarray(entropyStart, entropyEnd),
|
||||
);
|
||||
expect(got).toEqual(Array.from(entropy));
|
||||
});
|
||||
|
||||
it("preserves RST0..RST7 restart markers within the entropy stream", async () => {
|
||||
const entropy = new Uint8Array([0x12, 0xff, 0xd0, 0x34, 0xff, 0xd7, 0x56]);
|
||||
const input = makeJpeg([], entropy);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
const entropyStart = 8;
|
||||
const entropyEnd = result.value.bytes.length - 2;
|
||||
const got = Array.from(
|
||||
result.value.bytes.subarray(entropyStart, entropyEnd),
|
||||
);
|
||||
expect(got).toEqual(Array.from(entropy));
|
||||
});
|
||||
});
|
||||
|
||||
describe("JpegStrategy — error handling", () => {
|
||||
const strategy = new JpegStrategy();
|
||||
|
||||
it("returns an error for input that doesn't start with SOI", async () => {
|
||||
const input = new Uint8Array([0x89, 0x50, 0x4e, 0x47]); // PNG magic
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(false);
|
||||
});
|
||||
|
||||
it("returns an error for input shorter than the SOI + first marker", async () => {
|
||||
const input = new Uint8Array([0xff, 0xd8]);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(false);
|
||||
});
|
||||
|
||||
it("returns an error for a segment whose length runs off the end", async () => {
|
||||
// FF E1 with declared length 0xFFFF but no payload bytes
|
||||
const input = new Uint8Array([0xff, 0xd8, 0xff, 0xe1, 0xff, 0xff, 0x00]);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe("JpegStrategy — fill bytes (T.81 §B.1.1.2)", () => {
|
||||
const strategy = new JpegStrategy();
|
||||
|
||||
it("strips a JPEG with fill bytes before APP1 cleanly", async () => {
|
||||
// Multiple 0xFF fill bytes precede the APP1 marker code: FF FF FF E1 ...
|
||||
const input = makeJpeg(
|
||||
[
|
||||
{
|
||||
code: 0xe1,
|
||||
payload: new Uint8Array([0x45, 0x78, 0x69, 0x66, 0x00, 0x00]),
|
||||
fillBytes: 2,
|
||||
},
|
||||
],
|
||||
new Uint8Array([0x12, 0x34]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
// APP1 was dropped — no FFE1 should remain in the output.
|
||||
expect(findMarker(result.value.bytes, 0xe1)).toBe(-1);
|
||||
expect(result.value.metadataRemoved).toBe(1);
|
||||
// Output must still be a well-formed JPEG.
|
||||
expect(result.value.bytes[0]).toBe(0xff);
|
||||
expect(result.value.bytes[1]).toBe(0xd8);
|
||||
expect(result.value.bytes[result.value.bytes.length - 2]).toBe(0xff);
|
||||
expect(result.value.bytes[result.value.bytes.length - 1]).toBe(0xd9);
|
||||
});
|
||||
|
||||
it("preserves APP14 even when preceded by fill bytes", async () => {
|
||||
const adobePayload = new Uint8Array([
|
||||
0x41, 0x64, 0x6f, 0x62, 0x65, 0x00, 0x64, 0x00, 0x00, 0x00, 0x00, 0x00,
|
||||
]);
|
||||
const input = makeJpeg(
|
||||
[
|
||||
{ code: 0xe1, payload: new Uint8Array([0x00]) }, // EXIF (drop)
|
||||
{ code: 0xee, payload: adobePayload, fillBytes: 3 }, // APP14 with fill
|
||||
],
|
||||
new Uint8Array([0x12]),
|
||||
);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
const adobeOffset = findMarker(result.value.bytes, 0xee);
|
||||
expect(adobeOffset).toBeGreaterThan(0);
|
||||
// "Adobe" signature still present in the kept payload.
|
||||
expect(result.value.bytes[adobeOffset + 4]).toBe(0x41); // 'A'
|
||||
expect(result.value.bytes[adobeOffset + 5]).toBe(0x64); // 'd'
|
||||
// EXIF was dropped.
|
||||
expect(findMarker(result.value.bytes, 0xe1)).toBe(-1);
|
||||
});
|
||||
});
|
||||
|
||||
describe("JpegStrategy — truncated input", () => {
|
||||
const strategy = new JpegStrategy();
|
||||
|
||||
it("returns an error when the entropy stream runs to EOF without an EOI marker", async () => {
|
||||
// SOI, minimal SOS, entropy bytes, then nothing. No EOI is reached.
|
||||
const input = new Uint8Array([
|
||||
0xff, 0xd8, // SOI
|
||||
0xff, 0xda, 0x00, 0x04, 0x00, 0x00, // SOS with 2-byte payload
|
||||
0x12, 0x34, 0x56, 0x78, // entropy data, then EOF
|
||||
]);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(false);
|
||||
if (result.ok) return;
|
||||
const detail = result.error.detail.toLowerCase();
|
||||
expect(detail).toMatch(/truncated|eoi/);
|
||||
});
|
||||
|
||||
it("returns an error when the file ends after a complete segment but never reaches EOI", async () => {
|
||||
// SOI, valid APP0 (4-byte payload), valid DQT (4-byte payload), no SOS, no EOI.
|
||||
const input = new Uint8Array([
|
||||
0xff, 0xd8, // SOI
|
||||
0xff, 0xe0, 0x00, 0x04, 0x00, 0x00, // APP0, len=4, 2 payload bytes
|
||||
0xff, 0xdb, 0x00, 0x04, 0x00, 0x00, // DQT, len=4, 2 payload bytes
|
||||
]);
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(false);
|
||||
if (result.ok) return;
|
||||
const detail = result.error.detail.toLowerCase();
|
||||
expect(detail).toMatch(/truncated|eoi/);
|
||||
});
|
||||
});
|
||||
|
||||
describe("JpegStrategy — real fixture", () => {
|
||||
const strategy = new JpegStrategy();
|
||||
|
||||
it("strips the bundled sample.jpg fixture and produces a valid JPEG", async () => {
|
||||
const fixturePath = resolve(HERE, "../../fixtures/wasm/images/sample.jpg");
|
||||
const buf = await readFile(fixturePath);
|
||||
const bytes = new Uint8Array(buf);
|
||||
|
||||
const result = await strategy.strip({ bytes, options: DEFAULT_OPTIONS });
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
|
||||
const stripped = result.value.bytes;
|
||||
// Must start with SOI and end with EOI
|
||||
expect(stripped[0]).toBe(0xff);
|
||||
expect(stripped[1]).toBe(0xd8);
|
||||
expect(stripped[stripped.length - 2]).toBe(0xff);
|
||||
expect(stripped[stripped.length - 1]).toBe(0xd9);
|
||||
// At least one segment was dropped (the fixture has metadata)
|
||||
expect(result.value.metadataRemoved).toBeGreaterThan(0);
|
||||
// No EXIF, JFIF, or COM markers remain
|
||||
expect(findMarker(stripped, 0xe0)).toBe(-1);
|
||||
expect(findMarker(stripped, 0xe1)).toBe(-1);
|
||||
expect(findMarker(stripped, 0xfe)).toBe(-1);
|
||||
});
|
||||
});
|
||||
346
tests/infrastructure/wasm/pdf_strategy.test.ts
Normal file
346
tests/infrastructure/wasm/pdf_strategy.test.ts
Normal file
|
|
@ -0,0 +1,346 @@
|
|||
import { describe, it, expect } from "vitest";
|
||||
import { readFile } from "node:fs/promises";
|
||||
import { resolve, dirname } from "node:path";
|
||||
import { fileURLToPath } from "node:url";
|
||||
import { PdfStrategy } from "../../../src/infrastructure/wasm/strategies/pdf_strategy";
|
||||
import type { StripOptions } from "../../../src/infrastructure/wasm/format_strategy";
|
||||
|
||||
const HERE = dirname(fileURLToPath(import.meta.url));
|
||||
const DEFAULT_OPTIONS: StripOptions = {
|
||||
preserveOrientation: false,
|
||||
preserveColorProfile: false,
|
||||
preserveTimestamps: false,
|
||||
};
|
||||
|
||||
// Build a synthetic PDF with rich metadata for the strip to chew on. Uses
|
||||
// pdf-lib's low-level API so each test exercises a specific source: Info
|
||||
// dict fields, dates, /Lang, XMP stream, annotations.
|
||||
async function makeRichPdf({
|
||||
xmpContent,
|
||||
withAnnotation = false,
|
||||
withLang = false,
|
||||
withDates = true,
|
||||
}: {
|
||||
xmpContent?: string;
|
||||
withAnnotation?: boolean;
|
||||
withLang?: boolean;
|
||||
withDates?: boolean;
|
||||
} = {}): Promise<Uint8Array> {
|
||||
const {
|
||||
PDFDocument,
|
||||
PDFName,
|
||||
PDFArray,
|
||||
PDFString,
|
||||
PDFRawStream,
|
||||
} = await import("pdf-lib");
|
||||
|
||||
// updateMetadata: true (default) is fine for fixture generation — we
|
||||
// want pdf-lib to populate Producer/Creator/etc. so our strip has
|
||||
// something to remove.
|
||||
const doc = await PDFDocument.create();
|
||||
const page = doc.addPage([612, 792]);
|
||||
|
||||
doc.setTitle("Sensitive Title");
|
||||
doc.setAuthor("Sensitive Author");
|
||||
doc.setSubject("Sensitive Subject");
|
||||
doc.setKeywords(["secret", "internal"]);
|
||||
doc.setProducer("Internal Tool 1.0");
|
||||
doc.setCreator("Author Software");
|
||||
if (withDates) {
|
||||
doc.setCreationDate(new Date("2024-01-15T10:30:00Z"));
|
||||
doc.setModificationDate(new Date("2024-06-20T14:45:00Z"));
|
||||
}
|
||||
|
||||
if (withLang) {
|
||||
doc.catalog.set(PDFName.of("Lang"), PDFString.of("en-US-secret-locale"));
|
||||
}
|
||||
|
||||
if (xmpContent !== undefined) {
|
||||
const xmpBytes = new TextEncoder().encode(xmpContent);
|
||||
const xmpDict = doc.context.obj({
|
||||
Type: "Metadata",
|
||||
Subtype: "XML",
|
||||
Length: xmpBytes.length,
|
||||
});
|
||||
const xmpStream = PDFRawStream.of(xmpDict, xmpBytes);
|
||||
const xmpRef = doc.context.register(xmpStream);
|
||||
doc.catalog.set(PDFName.of("Metadata"), xmpRef);
|
||||
}
|
||||
|
||||
if (withAnnotation) {
|
||||
const annotation = doc.context.obj({
|
||||
Type: "Annot",
|
||||
Subtype: "Text",
|
||||
Rect: [100, 100, 200, 200],
|
||||
Contents: PDFString.of("Sensitive review comment"),
|
||||
T: PDFString.of("Reviewer Name"),
|
||||
M: PDFString.of("D:20250101000000Z"),
|
||||
CreationDate: PDFString.of("D:20250101000000Z"),
|
||||
Subj: PDFString.of("Sensitive subject line"),
|
||||
});
|
||||
const annotRef = doc.context.register(annotation);
|
||||
const existing = page.node.get(PDFName.of("Annots"));
|
||||
if (existing instanceof PDFArray) {
|
||||
existing.push(annotRef);
|
||||
} else {
|
||||
const annots = doc.context.obj([annotRef]);
|
||||
page.node.set(PDFName.of("Annots"), annots);
|
||||
}
|
||||
}
|
||||
|
||||
return await doc.save();
|
||||
}
|
||||
|
||||
describe("PdfStrategy — extension and magic byte", () => {
|
||||
const strategy = new PdfStrategy();
|
||||
|
||||
it("claims .pdf extension only", () => {
|
||||
expect(strategy.extensions.has(".pdf")).toBe(true);
|
||||
expect(strategy.extensions.has(".jpg")).toBe(false);
|
||||
expect(strategy.extensions.has(".docx")).toBe(false);
|
||||
});
|
||||
|
||||
it("verifies PDF magic bytes (%PDF-)", () => {
|
||||
const pdfMagic = new Uint8Array([0x25, 0x50, 0x44, 0x46, 0x2d]);
|
||||
expect(strategy.verifyMagicBytes?.({ bytes: pdfMagic })).toBe(true);
|
||||
const notPdf = new Uint8Array([0xff, 0xd8, 0xff, 0xe0, 0x00]);
|
||||
expect(strategy.verifyMagicBytes?.({ bytes: notPdf })).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe("PdfStrategy — Info dictionary", () => {
|
||||
const strategy = new PdfStrategy();
|
||||
|
||||
it("clears Title, Author, Subject, Keywords, Producer, Creator", async () => {
|
||||
const input = await makeRichPdf();
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
|
||||
const { PDFDocument } = await import("pdf-lib");
|
||||
// Re-load with updateMetadata: false so reading the cleaned doc
|
||||
// doesn't re-introduce Producer/ModDate in memory.
|
||||
const cleaned = await PDFDocument.load(result.value.bytes, {
|
||||
updateMetadata: false,
|
||||
});
|
||||
expect(cleaned.getTitle()).toBeUndefined();
|
||||
expect(cleaned.getAuthor()).toBeUndefined();
|
||||
expect(cleaned.getSubject()).toBeUndefined();
|
||||
expect(cleaned.getKeywords()).toBeUndefined();
|
||||
expect(cleaned.getCreator()).toBeUndefined();
|
||||
// Producer is the field that sat behind the false "re-injection"
|
||||
// claim. Verify it's actually absent on disk.
|
||||
expect(cleaned.getProducer()).toBeUndefined();
|
||||
});
|
||||
|
||||
it("removes both CreationDate and ModDate (no fresh strip-event timestamp)", async () => {
|
||||
const input = await makeRichPdf({ withDates: true });
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
|
||||
const { PDFDocument } = await import("pdf-lib");
|
||||
const cleaned = await PDFDocument.load(result.value.bytes, {
|
||||
updateMetadata: false,
|
||||
});
|
||||
expect(cleaned.getCreationDate()).toBeUndefined();
|
||||
// Critical: with updateMetadata: true (default for our strip path),
|
||||
// pdf-lib would otherwise stamp ModDate to "now" — leaking the
|
||||
// strip event time. The strategy must defeat that.
|
||||
expect(cleaned.getModificationDate()).toBeUndefined();
|
||||
});
|
||||
|
||||
it("counts every removed metadata key in metadataRemoved", async () => {
|
||||
const input = await makeRichPdf({ withDates: true });
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
// 6 Info string fields + CreationDate + ModDate = 8 minimum from
|
||||
// makeRichPdf. We allow >= because pdf-lib may add extras.
|
||||
expect(result.value.metadataRemoved).toBeGreaterThanOrEqual(8);
|
||||
});
|
||||
});
|
||||
|
||||
describe("PdfStrategy — catalog-level fingerprints", () => {
|
||||
const strategy = new PdfStrategy();
|
||||
|
||||
it("drops /Lang from the catalog", async () => {
|
||||
const input = await makeRichPdf({ withLang: true });
|
||||
|
||||
const { PDFDocument, PDFName } = await import("pdf-lib");
|
||||
// Sanity: the input has /Lang on the catalog
|
||||
const inputDoc = await PDFDocument.load(input, { updateMetadata: false });
|
||||
expect(inputDoc.catalog.has(PDFName.of("Lang"))).toBe(true);
|
||||
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
|
||||
const cleaned = await PDFDocument.load(result.value.bytes, {
|
||||
updateMetadata: false,
|
||||
});
|
||||
expect(cleaned.catalog.has(PDFName.of("Lang"))).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe("PdfStrategy — XMP /Metadata stream", () => {
|
||||
const strategy = new PdfStrategy();
|
||||
|
||||
it("drops the catalog /Metadata reference AND the XMP stream content", async () => {
|
||||
// A unique sentinel string we can grep for in the output bytes.
|
||||
// If pdf-lib's serializer emits the orphaned XMP stream, this
|
||||
// string survives and the test fails — exposing the orphan bug.
|
||||
const sentinel = "XMP-SECRET-AUTHOR-FINGERPRINT-9F2C8";
|
||||
const xmp = `<?xml version="1.0"?><x:xmpmeta xmlns:x="adobe:ns:meta/"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/"><dc:creator>${sentinel}</dc:creator></rdf:Description></rdf:RDF></x:xmpmeta>`;
|
||||
const input = await makeRichPdf({ xmpContent: xmp });
|
||||
|
||||
// Sanity: input contains the sentinel (XMP is FlateDecoded by
|
||||
// pdf-lib, so check via re-parsed metadata stream rather than
|
||||
// raw bytes). Verify by loading the input.
|
||||
const { PDFDocument, PDFName, PDFRawStream } = await import("pdf-lib");
|
||||
const inputDoc = await PDFDocument.load(input, {
|
||||
updateMetadata: false,
|
||||
});
|
||||
const inputRef = inputDoc.catalog.get(PDFName.of("Metadata"));
|
||||
expect(inputRef).toBeDefined();
|
||||
const inputStream = inputDoc.context.lookup(inputRef);
|
||||
expect(inputStream).toBeInstanceOf(PDFRawStream);
|
||||
|
||||
// Strip and verify
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
|
||||
const cleaned = await PDFDocument.load(result.value.bytes, {
|
||||
updateMetadata: false,
|
||||
});
|
||||
// The catalog reference is gone
|
||||
expect(cleaned.catalog.get(PDFName.of("Metadata"))).toBeUndefined();
|
||||
|
||||
// And no orphan stream survives in the indirect object table.
|
||||
// Walk every indirect object; assert no PDFRawStream still carries
|
||||
// the sentinel. This catches the "reference dropped, but stream
|
||||
// object still in the file body" failure mode the gap analysis
|
||||
// flagged.
|
||||
let orphanFound = false;
|
||||
for (const [, obj] of cleaned.context.enumerateIndirectObjects()) {
|
||||
if (obj instanceof PDFRawStream) {
|
||||
const content = new TextDecoder("utf-8", { fatal: false }).decode(
|
||||
obj.contents,
|
||||
);
|
||||
if (content.includes(sentinel)) {
|
||||
orphanFound = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
expect(orphanFound).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe("PdfStrategy — annotations", () => {
|
||||
const strategy = new PdfStrategy();
|
||||
|
||||
it("scrubs annotation /T, /Contents, /M, /CreationDate, /Subj", async () => {
|
||||
const input = await makeRichPdf({ withAnnotation: true });
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
|
||||
const { PDFDocument, PDFName, PDFArray, PDFDict } = await import(
|
||||
"pdf-lib"
|
||||
);
|
||||
const cleaned = await PDFDocument.load(result.value.bytes, {
|
||||
updateMetadata: false,
|
||||
});
|
||||
const page = cleaned.getPage(0);
|
||||
const annotsRef = page.node.get(PDFName.of("Annots"));
|
||||
expect(annotsRef).toBeDefined();
|
||||
const annots = cleaned.context.lookup(annotsRef);
|
||||
expect(annots).toBeInstanceOf(PDFArray);
|
||||
if (!(annots instanceof PDFArray)) return;
|
||||
expect(annots.size()).toBe(1);
|
||||
|
||||
const annot = cleaned.context.lookup(annots.get(0));
|
||||
expect(annot).toBeInstanceOf(PDFDict);
|
||||
if (!(annot instanceof PDFDict)) return;
|
||||
|
||||
expect(annot.has(PDFName.of("T"))).toBe(false);
|
||||
expect(annot.has(PDFName.of("Contents"))).toBe(false);
|
||||
expect(annot.has(PDFName.of("M"))).toBe(false);
|
||||
expect(annot.has(PDFName.of("CreationDate"))).toBe(false);
|
||||
expect(annot.has(PDFName.of("Subj"))).toBe(false);
|
||||
});
|
||||
|
||||
it("preserves annotation visibility (Type, Subtype, Rect)", async () => {
|
||||
const input = await makeRichPdf({ withAnnotation: true });
|
||||
const result = await strategy.strip({
|
||||
bytes: input,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
|
||||
const { PDFDocument, PDFName, PDFArray, PDFDict } = await import(
|
||||
"pdf-lib"
|
||||
);
|
||||
const cleaned = await PDFDocument.load(result.value.bytes, {
|
||||
updateMetadata: false,
|
||||
});
|
||||
const page = cleaned.getPage(0);
|
||||
const annotsRef = page.node.get(PDFName.of("Annots"));
|
||||
const annots = cleaned.context.lookup(annotsRef);
|
||||
if (!(annots instanceof PDFArray)) return;
|
||||
const annot = cleaned.context.lookup(annots.get(0));
|
||||
if (!(annot instanceof PDFDict)) return;
|
||||
|
||||
expect(annot.has(PDFName.of("Type"))).toBe(true);
|
||||
expect(annot.has(PDFName.of("Subtype"))).toBe(true);
|
||||
expect(annot.has(PDFName.of("Rect"))).toBe(true);
|
||||
});
|
||||
|
||||
});
|
||||
|
||||
describe("PdfStrategy — bundled fixture", () => {
|
||||
const strategy = new PdfStrategy();
|
||||
|
||||
it("produces a valid PDF and removes metadata from the bundled sample", async () => {
|
||||
const fixturePath = resolve(HERE, "../../fixtures/wasm/pdf/sample.pdf");
|
||||
const buf = await readFile(fixturePath);
|
||||
const bytes = new Uint8Array(buf);
|
||||
|
||||
const result = await strategy.strip({
|
||||
bytes,
|
||||
options: DEFAULT_OPTIONS,
|
||||
});
|
||||
expect(result.ok).toBe(true);
|
||||
if (!result.ok) return;
|
||||
|
||||
// Must start with %PDF-
|
||||
expect(result.value.bytes[0]).toBe(0x25);
|
||||
expect(result.value.bytes[1]).toBe(0x50);
|
||||
expect(result.value.bytes[2]).toBe(0x44);
|
||||
expect(result.value.bytes[3]).toBe(0x46);
|
||||
expect(result.value.bytes[4]).toBe(0x2d);
|
||||
// At least some metadata removed
|
||||
expect(result.value.metadataRemoved).toBeGreaterThan(0);
|
||||
});
|
||||
});
|
||||
|
|
@ -43,38 +43,46 @@ describe("selectStrategy", () => {
|
|||
});
|
||||
|
||||
it("returns null for unsupported extensions", () => {
|
||||
// .orf (Olympus RAW) has no registered strategy
|
||||
const strategy = selectStrategy({
|
||||
filename: "photo.jpg",
|
||||
bytes: new Uint8Array([0xff, 0xd8, 0xff]),
|
||||
filename: "photo.orf",
|
||||
bytes: new Uint8Array([0x49, 0x49, 0x52, 0x4f]),
|
||||
});
|
||||
expect(strategy).toBeNull();
|
||||
});
|
||||
|
||||
it("routes .jpg with JPEG magic bytes to JpegStrategy", () => {
|
||||
const jpegMagic = new Uint8Array([0xff, 0xd8, 0xff, 0xe0, 0x00, 0x00]);
|
||||
const result = selectStrategy({ filename: "photo.jpg", bytes: jpegMagic });
|
||||
expect(result).not.toBeNull();
|
||||
expect(result?.extensions.has(".jpg")).toBe(true);
|
||||
});
|
||||
|
||||
it("routes .pdf with PDF magic bytes to PdfStrategy", () => {
|
||||
const pdfMagic = new Uint8Array([0x25, 0x50, 0x44, 0x46, 0x2d]);
|
||||
const result = selectStrategy({ filename: "doc.pdf", bytes: pdfMagic });
|
||||
expect(result).not.toBeNull();
|
||||
expect(result?.extensions.has(".pdf")).toBe(true);
|
||||
});
|
||||
|
||||
it("returns null for files without an extension", () => {
|
||||
const strategy = selectStrategy({ filename: "README", bytes: ZIP_MAGIC });
|
||||
expect(strategy).toBeNull();
|
||||
});
|
||||
});
|
||||
|
||||
// Sync guard: ensures the renderer's hardcoded extension list stays in step with
|
||||
// the strategy registry. If a new strategy is added to the registry but this list
|
||||
// is not updated, this test fails loudly.
|
||||
// Sync guard: WASM_HANDLED_EXTENSIONS is the Electron-specific routing set —
|
||||
// formats ExifTool can't handle (Office, fragmented video) that must go through WASM.
|
||||
// ImageStrategy (.jpg/.jpeg) and PdfStrategy (.pdf) are in the registry so the web
|
||||
// build can use them, but Electron uses ExifTool for those formats instead.
|
||||
// Invariant: every extension in WASM_HANDLED_EXTENSIONS must have a registered strategy.
|
||||
describe("renderer extension list sync", () => {
|
||||
it("WASM_HANDLED_EXTENSIONS matches allHandledExtensions() from the registry", () => {
|
||||
it("every WASM_HANDLED_EXTENSIONS entry has a registered strategy", () => {
|
||||
const registryExtensions = allHandledExtensions();
|
||||
const rendererExtensions = WASM_HANDLED_EXTENSIONS;
|
||||
|
||||
for (const ext of registryExtensions) {
|
||||
expect(
|
||||
rendererExtensions.has(ext),
|
||||
`Registry has "${ext}" but WASM_HANDLED_EXTENSIONS in renderer does not`,
|
||||
).toBe(true);
|
||||
}
|
||||
|
||||
for (const ext of rendererExtensions) {
|
||||
for (const ext of WASM_HANDLED_EXTENSIONS) {
|
||||
expect(
|
||||
registryExtensions.has(ext),
|
||||
`WASM_HANDLED_EXTENSIONS has "${ext}" but registry does not`,
|
||||
`WASM_HANDLED_EXTENSIONS has "${ext}" but no strategy handles it`,
|
||||
).toBe(true);
|
||||
}
|
||||
});
|
||||
|
|
|
|||
161
tests/renderer/file_browse_button.test.ts
Normal file
161
tests/renderer/file_browse_button.test.ts
Normal file
|
|
@ -0,0 +1,161 @@
|
|||
import { describe, it, expect, vi, beforeEach, afterEach } from "vitest";
|
||||
import { FileProcessingStatus } from "../../src/domain/files/file_status";
|
||||
import type {
|
||||
AppAction,
|
||||
FileEntry,
|
||||
} from "../../src/renderer/contexts/AppContext";
|
||||
import {
|
||||
handleSelectedFiles,
|
||||
shouldRenderFileBrowseButton,
|
||||
} from "../../src/renderer/components/ui/FileBrowseButton";
|
||||
|
||||
// Hand-rolled fake matching the subset of window.api that
|
||||
// FileBrowseButton's helpers actually consume (D-34: no vi.mock for our APIs).
|
||||
function setupWindowApi(): {
|
||||
getPathForFile: ReturnType<typeof vi.fn>;
|
||||
} {
|
||||
const getPathForFile = vi
|
||||
.fn<(file: File) => string>()
|
||||
.mockImplementation(
|
||||
(file: File) =>
|
||||
`/web-file/00000000-0000-0000-0000-000000000000/${file.name}`,
|
||||
);
|
||||
|
||||
(globalThis as Record<string, unknown>).window = {
|
||||
api: {
|
||||
files: { getPathForFile },
|
||||
platform: { isMac: false, isWeb: true },
|
||||
},
|
||||
};
|
||||
|
||||
return { getPathForFile };
|
||||
}
|
||||
|
||||
describe("shouldRenderFileBrowseButton", () => {
|
||||
it("returns true when running in the web build", () => {
|
||||
expect(shouldRenderFileBrowseButton({ isWeb: true })).toBe(true);
|
||||
});
|
||||
|
||||
it("returns false when running in Electron (isWeb=false)", () => {
|
||||
expect(shouldRenderFileBrowseButton({ isWeb: false })).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe("handleSelectedFiles", () => {
|
||||
let dispatched: AppAction[];
|
||||
let processFilesArg: FileEntry[] | null;
|
||||
let dispatch: (action: AppAction) => void;
|
||||
let processFiles: (entries: FileEntry[]) => void;
|
||||
let api: ReturnType<typeof setupWindowApi>;
|
||||
|
||||
beforeEach(() => {
|
||||
dispatched = [];
|
||||
processFilesArg = null;
|
||||
dispatch = (action) => {
|
||||
dispatched.push(action);
|
||||
};
|
||||
processFiles = (entries) => {
|
||||
processFilesArg = entries;
|
||||
};
|
||||
api = setupWindowApi();
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
delete (globalThis as Record<string, unknown>).window;
|
||||
});
|
||||
|
||||
it("dispatches an ADD_FILES action with one FileEntry per supported file", () => {
|
||||
const files = [
|
||||
new File(["a"], "photo.jpg", { type: "image/jpeg" }),
|
||||
new File(["b"], "doc.pdf", { type: "application/pdf" }),
|
||||
];
|
||||
|
||||
const entries = handleSelectedFiles({ files, dispatch, processFiles });
|
||||
|
||||
expect(entries).toHaveLength(2);
|
||||
const addAction = dispatched.find((a) => a.type === "ADD_FILES");
|
||||
expect(addAction).toBeDefined();
|
||||
if (addAction?.type === "ADD_FILES") {
|
||||
expect(addAction.files).toHaveLength(2);
|
||||
expect(addAction.files[0]?.name).toBe("photo.jpg");
|
||||
expect(addAction.files[0]?.extension).toBe("JPG");
|
||||
expect(addAction.files[0]?.status).toBe(FileProcessingStatus.Pending);
|
||||
expect(addAction.files[1]?.name).toBe("doc.pdf");
|
||||
}
|
||||
});
|
||||
|
||||
it("registers each File via window.api.files.getPathForFile", () => {
|
||||
const files = [new File(["a"], "photo.jpg"), new File(["b"], "video.mp4")];
|
||||
|
||||
handleSelectedFiles({ files, dispatch, processFiles });
|
||||
|
||||
expect(api.getPathForFile).toHaveBeenCalledTimes(2);
|
||||
expect(api.getPathForFile).toHaveBeenNthCalledWith(1, files[0]);
|
||||
expect(api.getPathForFile).toHaveBeenNthCalledWith(2, files[1]);
|
||||
});
|
||||
|
||||
it("uses the registered virtual path as the FileEntry path", () => {
|
||||
const file = new File(["a"], "photo.jpg");
|
||||
api.getPathForFile.mockReturnValueOnce("/web-file/abc-123/photo.jpg");
|
||||
|
||||
const entries = handleSelectedFiles({
|
||||
files: [file],
|
||||
dispatch,
|
||||
processFiles,
|
||||
});
|
||||
|
||||
expect(entries[0]?.path).toBe("/web-file/abc-123/photo.jpg");
|
||||
});
|
||||
|
||||
it("forwards the constructed entries to processFiles for the pipeline", () => {
|
||||
const file = new File(["a"], "photo.jpg");
|
||||
|
||||
const entries = handleSelectedFiles({
|
||||
files: [file],
|
||||
dispatch,
|
||||
processFiles,
|
||||
});
|
||||
|
||||
expect(processFilesArg).not.toBeNull();
|
||||
expect(processFilesArg).toEqual(entries);
|
||||
expect(processFilesArg?.[0]?.name).toBe("photo.jpg");
|
||||
});
|
||||
|
||||
it("filters out unsupported file types without dispatching them", () => {
|
||||
const files = [
|
||||
new File(["a"], "photo.jpg"),
|
||||
new File(["b"], "script.exe"),
|
||||
new File(["c"], "notes.txt"),
|
||||
];
|
||||
|
||||
const entries = handleSelectedFiles({ files, dispatch, processFiles });
|
||||
|
||||
expect(entries).toHaveLength(1);
|
||||
expect(entries[0]?.name).toBe("photo.jpg");
|
||||
// Unsupported files should not even be registered with the FileRegistry.
|
||||
expect(api.getPathForFile).toHaveBeenCalledTimes(1);
|
||||
});
|
||||
|
||||
it("does not dispatch or call processFiles when every file is unsupported", () => {
|
||||
const files = [new File(["a"], "script.exe"), new File(["b"], "notes.txt")];
|
||||
|
||||
const entries = handleSelectedFiles({ files, dispatch, processFiles });
|
||||
|
||||
expect(entries).toHaveLength(0);
|
||||
expect(dispatched).toHaveLength(0);
|
||||
expect(processFilesArg).toBeNull();
|
||||
});
|
||||
|
||||
it("does nothing on an empty selection", () => {
|
||||
const entries = handleSelectedFiles({
|
||||
files: [],
|
||||
dispatch,
|
||||
processFiles,
|
||||
});
|
||||
|
||||
expect(entries).toHaveLength(0);
|
||||
expect(dispatched).toHaveLength(0);
|
||||
expect(processFilesArg).toBeNull();
|
||||
expect(api.getPathForFile).not.toHaveBeenCalled();
|
||||
});
|
||||
});
|
||||
|
|
@ -37,9 +37,10 @@ function createMockApi(): {
|
|||
function makeFileEntry(overrides: Partial<FileEntry> = {}): FileEntry {
|
||||
return {
|
||||
id: overrides.id ?? "test-id-1",
|
||||
path: overrides.path ?? "/path/to/test.jpg",
|
||||
name: overrides.name ?? "test.jpg",
|
||||
extension: overrides.extension ?? "JPG",
|
||||
// Default to .orf (RAW format) — handled by ExifTool, not WASM
|
||||
path: overrides.path ?? "/path/to/test.orf",
|
||||
name: overrides.name ?? "test.orf",
|
||||
extension: overrides.extension ?? "ORF",
|
||||
size: overrides.size ?? 1024,
|
||||
folder: overrides.folder ?? null,
|
||||
status: overrides.status ?? FileProcessingStatus.Pending,
|
||||
|
|
@ -93,6 +94,7 @@ describe("processFileEntries", () => {
|
|||
onToggle: vi.fn(),
|
||||
},
|
||||
wasm: mockApi.wasm,
|
||||
platform: { isMac: false, isWeb: false },
|
||||
},
|
||||
};
|
||||
});
|
||||
|
|
@ -223,8 +225,8 @@ describe("processFileEntries", () => {
|
|||
});
|
||||
|
||||
it("processes files sequentially (second file starts after first completes)", async () => {
|
||||
const entry1 = makeFileEntry({ id: "id-1", path: "/a.jpg" });
|
||||
const entry2 = makeFileEntry({ id: "id-2", path: "/b.jpg" });
|
||||
const entry1 = makeFileEntry({ id: "id-1", path: "/a.orf" });
|
||||
const entry2 = makeFileEntry({ id: "id-2", path: "/b.orf" });
|
||||
|
||||
let callOrder: string[] = [];
|
||||
mockApi.exif.readMetadata.mockImplementation(async (path: string) => {
|
||||
|
|
@ -238,14 +240,14 @@ describe("processFileEntries", () => {
|
|||
|
||||
await processFileEntries([entry1, entry2], mockDispatch);
|
||||
|
||||
// Expect: read /a.jpg, remove /a.jpg, read /a.jpg (after), then read /b.jpg, remove /b.jpg, read /b.jpg (after)
|
||||
// Expect: read /a.orf, remove /a.orf, read /a.orf (after), then read /b.orf, remove /b.orf, read /b.orf (after)
|
||||
expect(callOrder).toEqual([
|
||||
"read:/a.jpg",
|
||||
"remove:/a.jpg",
|
||||
"read:/a.jpg",
|
||||
"read:/b.jpg",
|
||||
"remove:/b.jpg",
|
||||
"read:/b.jpg",
|
||||
"read:/a.orf",
|
||||
"remove:/a.orf",
|
||||
"read:/a.orf",
|
||||
"read:/b.orf",
|
||||
"remove:/b.orf",
|
||||
"read:/b.orf",
|
||||
]);
|
||||
});
|
||||
|
||||
|
|
@ -314,12 +316,11 @@ describe("processFileEntries", () => {
|
|||
expect(mockApi.exif.removeMetadata).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it("dispatches .jpg through the existing exif IPC path", async () => {
|
||||
mockApi.exif.readMetadata.mockResolvedValue({});
|
||||
mockApi.exif.removeMetadata.mockResolvedValue({
|
||||
data: null,
|
||||
error: null,
|
||||
});
|
||||
it("dispatches .jpg through ExifTool path in Electron (isWeb=false)", async () => {
|
||||
// In Electron, ExifTool handles images — JpegStrategy is for the web build only.
|
||||
// The mock's platform.isWeb is false, so .jpg goes to processViaExif.
|
||||
mockApi.exif.readMetadata.mockResolvedValue({ Make: "TestCamera" });
|
||||
mockApi.exif.removeMetadata.mockResolvedValue({ data: null, error: null });
|
||||
|
||||
const entry = makeFileEntry({
|
||||
id: "e1",
|
||||
|
|
@ -329,9 +330,8 @@ describe("processFileEntries", () => {
|
|||
|
||||
await processFileEntries([entry], mockDispatch);
|
||||
|
||||
expect(mockApi.wasm.process).not.toHaveBeenCalled();
|
||||
expect(mockApi.exif.readMetadata).toHaveBeenCalled();
|
||||
expect(mockApi.exif.removeMetadata).toHaveBeenCalled();
|
||||
expect(mockApi.wasm.process).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it("fetches settings only once for a multi-WASM-file batch", async () => {
|
||||
|
|
@ -366,9 +366,10 @@ describe("processFileEntries", () => {
|
|||
error: null,
|
||||
});
|
||||
|
||||
// Use RAW formats — handled by ExifTool, not WASM
|
||||
const entries = [
|
||||
makeFileEntry({ id: "a", path: "/tmp/a.jpg", extension: "JPG" }),
|
||||
makeFileEntry({ id: "b", path: "/tmp/b.png", extension: "PNG" }),
|
||||
makeFileEntry({ id: "a", path: "/tmp/a.orf", extension: "ORF" }),
|
||||
makeFileEntry({ id: "b", path: "/tmp/b.cr2", extension: "CR2" }),
|
||||
];
|
||||
|
||||
await processFileEntries(entries, mockDispatch);
|
||||
|
|
|
|||
348
tools/forensic/pdf.ts
Normal file
348
tools/forensic/pdf.ts
Normal file
|
|
@ -0,0 +1,348 @@
|
|||
// Forensic recovery battery for the PDF strategy.
|
||||
// Generates a rich PDF with unique sentinels, strips it three ways, then
|
||||
// runs every recovery technique we can think of and reports which sentinels
|
||||
// survive in which output.
|
||||
|
||||
import { writeFileSync, readFileSync, copyFileSync, existsSync } from "node:fs";
|
||||
import { execFileSync } from "node:child_process";
|
||||
import { inflateSync } from "node:zlib";
|
||||
import { PdfStrategy } from "/home/luffy/space/exifcleaner/.worktrees/phase-b/src/infrastructure/wasm/strategies/pdf_strategy.ts";
|
||||
|
||||
const SCRATCH = "/tmp/pdf-forensic";
|
||||
|
||||
const SENTINELS = {
|
||||
TITLE: "FORENSIC-TITLE-AAAA1111",
|
||||
AUTHOR: "FORENSIC-AUTHOR-BBBB2222",
|
||||
SUBJECT: "FORENSIC-SUBJECT-CCCC3333",
|
||||
PRODUCER: "FORENSIC-PRODUCER-DDDD4444",
|
||||
CREATOR: "FORENSIC-CREATOR-EEEE5555",
|
||||
XMP_CREATOR: "FORENSIC-XMP-CREATOR-FFFF6666",
|
||||
XMP_TITLE: "FORENSIC-XMP-TITLE-GGGG7777",
|
||||
ANNOT_AUTHOR: "FORENSIC-ANNOT-AUTHOR-HHHH8888",
|
||||
ANNOT_COMMENT: "FORENSIC-ANNOT-COMMENT-IIII9999",
|
||||
LANG: "en-FORENSIC-LANG-JJJJ0000",
|
||||
} as const;
|
||||
|
||||
type SentinelKey = keyof typeof SENTINELS;
|
||||
|
||||
async function generateRichPdf(): Promise<Uint8Array> {
|
||||
const pdfLib = await import("pdf-lib");
|
||||
const { PDFDocument, PDFName, PDFString, PDFRawStream } = pdfLib;
|
||||
|
||||
const doc = await PDFDocument.create();
|
||||
const page = doc.addPage([612, 792]);
|
||||
|
||||
// Info dict — every field gets a unique sentinel.
|
||||
doc.setTitle(SENTINELS.TITLE);
|
||||
doc.setAuthor(SENTINELS.AUTHOR);
|
||||
doc.setSubject(SENTINELS.SUBJECT);
|
||||
doc.setKeywords(["FORENSIC-KEY-secret", "FORENSIC-KEY-internal"]);
|
||||
doc.setProducer(SENTINELS.PRODUCER);
|
||||
doc.setCreator(SENTINELS.CREATOR);
|
||||
doc.setCreationDate(new Date("2024-01-15T10:30:00Z"));
|
||||
doc.setModificationDate(new Date("2024-06-20T14:45:00Z"));
|
||||
|
||||
// /Lang on the catalog.
|
||||
doc.catalog.set(PDFName.of("Lang"), PDFString.of(SENTINELS.LANG));
|
||||
|
||||
// XMP metadata stream.
|
||||
const xmp = `<?xml version="1.0"?><x:xmpmeta xmlns:x="adobe:ns:meta/"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/"><dc:creator>${SENTINELS.XMP_CREATOR}</dc:creator><dc:title>${SENTINELS.XMP_TITLE}</dc:title></rdf:Description></rdf:RDF></x:xmpmeta>`;
|
||||
const xmpBytes = new TextEncoder().encode(xmp);
|
||||
const xmpDict = doc.context.obj({
|
||||
Type: "Metadata",
|
||||
Subtype: "XML",
|
||||
Length: xmpBytes.length,
|
||||
});
|
||||
const xmpStream = PDFRawStream.of(xmpDict, xmpBytes);
|
||||
doc.catalog.set(PDFName.of("Metadata"), doc.context.register(xmpStream));
|
||||
|
||||
// Annotation with author + comment.
|
||||
const annot = doc.context.obj({
|
||||
Type: "Annot",
|
||||
Subtype: "Text",
|
||||
Rect: [100, 100, 200, 200],
|
||||
Contents: PDFString.of(SENTINELS.ANNOT_COMMENT),
|
||||
T: PDFString.of(SENTINELS.ANNOT_AUTHOR),
|
||||
M: PDFString.of("D:20250101000000Z"),
|
||||
CreationDate: PDFString.of("D:20250101000000Z"),
|
||||
});
|
||||
page.node.set(
|
||||
PDFName.of("Annots"),
|
||||
doc.context.obj([doc.context.register(annot)]),
|
||||
);
|
||||
|
||||
// useObjectStreams: false so the raw fixture is human-readable for
|
||||
// initial inspection. Each strip method gets to choose its own output
|
||||
// encoding.
|
||||
return await doc.save({ useObjectStreams: false });
|
||||
}
|
||||
|
||||
interface ForensicReport {
|
||||
file: string;
|
||||
sizeBytes: number;
|
||||
survivors: {
|
||||
rawStrings: SentinelKey[];
|
||||
exiftoolTagsVisible: SentinelKey[];
|
||||
exiftoolPdfUpdateRevert: SentinelKey[];
|
||||
qpdfQdfDecompressed: SentinelKey[];
|
||||
walkAllStreams: SentinelKey[];
|
||||
};
|
||||
hasIncrementalUpdate: boolean;
|
||||
hasExifToolMarker: boolean;
|
||||
qpdfCheckResult: string;
|
||||
}
|
||||
|
||||
function findSentinels(haystack: string): SentinelKey[] {
|
||||
return (Object.keys(SENTINELS) as SentinelKey[]).filter((k) =>
|
||||
haystack.includes(SENTINELS[k]),
|
||||
);
|
||||
}
|
||||
|
||||
async function walkAllStreams(path: string): Promise<SentinelKey[]> {
|
||||
// Re-parse via pdf-lib, walk every indirect object, decompress
|
||||
// FlateDecode streams, search for sentinels.
|
||||
const pdfLib = await import("pdf-lib");
|
||||
const { PDFDocument, PDFRawStream, PDFName } = pdfLib;
|
||||
const bytes = readFileSync(path);
|
||||
let doc;
|
||||
try {
|
||||
doc = await PDFDocument.load(bytes, {
|
||||
updateMetadata: false,
|
||||
throwOnInvalidObject: false,
|
||||
});
|
||||
} catch (e) {
|
||||
// pdf-lib bailed on the structure (likely the ExifTool incremental
|
||||
// update). Fall back to raw walking via qpdf --qdf below.
|
||||
return [];
|
||||
}
|
||||
|
||||
const found = new Set<SentinelKey>();
|
||||
for (const [, obj] of doc.context.enumerateIndirectObjects()) {
|
||||
if (!(obj instanceof PDFRawStream)) continue;
|
||||
// Try plain text first, then FlateDecode if /Filter says so.
|
||||
let content = "";
|
||||
try {
|
||||
content = new TextDecoder("utf-8", { fatal: false }).decode(obj.contents);
|
||||
} catch {
|
||||
content = "";
|
||||
}
|
||||
findSentinels(content).forEach((k) => found.add(k));
|
||||
|
||||
// If this stream has a Filter dict, try inflating.
|
||||
const dict = obj.dict;
|
||||
const filter = dict.get(PDFName.of("Filter"));
|
||||
const filterStr = filter ? String(filter) : "";
|
||||
if (filterStr.includes("FlateDecode")) {
|
||||
try {
|
||||
const inflated = inflateSync(Buffer.from(obj.contents));
|
||||
const text = new TextDecoder("utf-8", { fatal: false }).decode(
|
||||
inflated,
|
||||
);
|
||||
findSentinels(text).forEach((k) => found.add(k));
|
||||
} catch {
|
||||
// Ignore decompression failures.
|
||||
}
|
||||
}
|
||||
}
|
||||
return [...found];
|
||||
}
|
||||
|
||||
function runForensics(label: string, path: string): ForensicReport {
|
||||
const bytes = readFileSync(path);
|
||||
const sizeBytes = bytes.length;
|
||||
|
||||
// 1. Raw strings — catches sentinels in unencoded streams.
|
||||
const stringsOutput = execFileSync("strings", [path], {
|
||||
encoding: "utf8",
|
||||
maxBuffer: 50 * 1024 * 1024,
|
||||
});
|
||||
const rawStrings = findSentinels(stringsOutput);
|
||||
|
||||
// 2. exiftool -a -G1 -s — every visible tag.
|
||||
const exifAll = execFileSync(
|
||||
"exiftool",
|
||||
["-a", "-G1", "-s", "-charset", "UTF8", path],
|
||||
{ encoding: "utf8" },
|
||||
);
|
||||
const exiftoolTagsVisible = findSentinels(exifAll);
|
||||
|
||||
// 3. exiftool -PDF-update:all= — try to revert ExifTool's incremental
|
||||
// updates. Only does anything on ExifTool-stripped files but we run
|
||||
// it on each to confirm.
|
||||
const revertCopy = path.replace(/\.pdf$/, "-revert.pdf");
|
||||
copyFileSync(path, revertCopy);
|
||||
let revertOutput = "";
|
||||
try {
|
||||
execFileSync(
|
||||
"exiftool",
|
||||
["-PDF-update:all=", "-overwrite_original", revertCopy],
|
||||
{ encoding: "utf8" },
|
||||
);
|
||||
} catch (e: unknown) {
|
||||
revertOutput += `revert command stderr: ${(e as Error).message}\n`;
|
||||
}
|
||||
try {
|
||||
const afterRevert = execFileSync(
|
||||
"exiftool",
|
||||
["-a", "-G1", "-s", "-charset", "UTF8", revertCopy],
|
||||
{ encoding: "utf8" },
|
||||
);
|
||||
revertOutput += afterRevert;
|
||||
} catch (e: unknown) {
|
||||
revertOutput += `read after revert: ${(e as Error).message}\n`;
|
||||
}
|
||||
const exiftoolPdfUpdateRevert = findSentinels(revertOutput);
|
||||
|
||||
// 4. qpdf --qdf — decompresses every stream, normalizes the file.
|
||||
const qdfPath = path.replace(/\.pdf$/, "-qdf.pdf");
|
||||
let qdfText = "";
|
||||
try {
|
||||
execFileSync("qpdf", ["--qdf", "--object-streams=disable", path, qdfPath], {
|
||||
encoding: "utf8",
|
||||
});
|
||||
qdfText = readFileSync(qdfPath, "utf8");
|
||||
} catch (e: unknown) {
|
||||
qdfText = `qpdf --qdf failed: ${(e as Error).message}`;
|
||||
}
|
||||
const qpdfQdfDecompressed = findSentinels(qdfText);
|
||||
|
||||
// 5. qpdf --check — structural verification.
|
||||
let qpdfCheckResult = "";
|
||||
try {
|
||||
qpdfCheckResult = execFileSync("qpdf", ["--check", path], {
|
||||
encoding: "utf8",
|
||||
});
|
||||
} catch (e: unknown) {
|
||||
qpdfCheckResult = `qpdf --check failed: ${(e as Error).message}\n${(e as { stdout?: string }).stdout ?? ""}`;
|
||||
}
|
||||
|
||||
// 6. Trailer chain check — is there a /Prev (incremental updates)?
|
||||
const rawText = bytes.toString("latin1");
|
||||
const hasIncrementalUpdate = /\/Prev\s+\d+/.test(rawText);
|
||||
|
||||
// 7. ExifTool marker — explicit signature ExifTool emits.
|
||||
const hasExifToolMarker = rawText.includes("BeginExifToolUpdate");
|
||||
|
||||
return {
|
||||
file: label,
|
||||
sizeBytes,
|
||||
survivors: {
|
||||
rawStrings,
|
||||
exiftoolTagsVisible,
|
||||
exiftoolPdfUpdateRevert,
|
||||
qpdfQdfDecompressed,
|
||||
walkAllStreams: [], // filled in below (async)
|
||||
},
|
||||
hasIncrementalUpdate,
|
||||
hasExifToolMarker,
|
||||
qpdfCheckResult: qpdfCheckResult.split("\n").slice(0, 3).join(" | "),
|
||||
};
|
||||
}
|
||||
|
||||
async function main() {
|
||||
console.log("Sentinels embedded in fixture:");
|
||||
for (const [k, v] of Object.entries(SENTINELS)) {
|
||||
console.log(` ${k}: ${v}`);
|
||||
}
|
||||
console.log();
|
||||
|
||||
// 1. Generate the rich fixture.
|
||||
const fixture = await generateRichPdf();
|
||||
const inputPath = `${SCRATCH}/input.pdf`;
|
||||
writeFileSync(inputPath, fixture);
|
||||
console.log(`Generated input fixture: ${inputPath} (${fixture.length} bytes)`);
|
||||
|
||||
// 2. Strip via our strategy.
|
||||
const ourStrategy = new PdfStrategy();
|
||||
const ourResult = await ourStrategy.strip({
|
||||
bytes: new Uint8Array(fixture),
|
||||
options: {
|
||||
preserveOrientation: false,
|
||||
preserveColorProfile: false,
|
||||
preserveTimestamps: false,
|
||||
},
|
||||
});
|
||||
if (!ourResult.ok) {
|
||||
console.error("our strategy failed:", ourResult.error);
|
||||
process.exit(1);
|
||||
}
|
||||
const ourPath = `${SCRATCH}/our-stripped.pdf`;
|
||||
writeFileSync(ourPath, Buffer.from(ourResult.value.bytes));
|
||||
console.log(`Our strategy: ${ourPath} (${ourResult.value.bytes.length} bytes, removed ${ourResult.value.metadataRemoved})`);
|
||||
|
||||
// 3. Strip via ExifTool.
|
||||
const exiftoolPath = `${SCRATCH}/exiftool-stripped.pdf`;
|
||||
copyFileSync(inputPath, exiftoolPath);
|
||||
const etOutput = execFileSync(
|
||||
"exiftool",
|
||||
["-all=", "-overwrite_original", exiftoolPath],
|
||||
{ encoding: "utf8" },
|
||||
);
|
||||
console.log(`ExifTool -all=: ${exiftoolPath}`);
|
||||
console.log(` exiftool output: ${etOutput.trim()}`);
|
||||
|
||||
// 4. Strip via Ghostscript pdfwrite (clean rewrite baseline).
|
||||
const gsPath = `${SCRATCH}/gs-stripped.pdf`;
|
||||
try {
|
||||
execFileSync(
|
||||
"gs",
|
||||
[
|
||||
"-dQUIET",
|
||||
"-dBATCH",
|
||||
"-dNOPAUSE",
|
||||
"-sDEVICE=pdfwrite",
|
||||
"-dPDFSETTINGS=/default",
|
||||
`-sOutputFile=${gsPath}`,
|
||||
inputPath,
|
||||
],
|
||||
{ encoding: "utf8" },
|
||||
);
|
||||
console.log(`Ghostscript pdfwrite: ${gsPath}`);
|
||||
} catch (e: unknown) {
|
||||
console.log(`Ghostscript failed: ${(e as Error).message}`);
|
||||
}
|
||||
|
||||
console.log();
|
||||
|
||||
// 5. Recovery battery on each output (and on the input as a sanity
|
||||
// check that sentinels are present pre-strip).
|
||||
const targets = [
|
||||
{ label: "input (raw fixture)", path: inputPath },
|
||||
{ label: "our strategy", path: ourPath },
|
||||
{ label: "exiftool -all=", path: exiftoolPath },
|
||||
];
|
||||
if (existsSync(gsPath)) {
|
||||
targets.push({ label: "ghostscript pdfwrite", path: gsPath });
|
||||
}
|
||||
|
||||
const reports: ForensicReport[] = [];
|
||||
for (const t of targets) {
|
||||
console.log(`=== ${t.label} ===`);
|
||||
const report = runForensics(t.label, t.path);
|
||||
report.survivors.walkAllStreams = await walkAllStreams(t.path);
|
||||
reports.push(report);
|
||||
|
||||
console.log(` size: ${report.sizeBytes} bytes`);
|
||||
console.log(` has /Prev: ${report.hasIncrementalUpdate}`);
|
||||
console.log(` has BeginExifToolUpdate marker: ${report.hasExifToolMarker}`);
|
||||
console.log(` qpdf --check: ${report.qpdfCheckResult}`);
|
||||
console.log(` raw strings sentinels: ${JSON.stringify(report.survivors.rawStrings)}`);
|
||||
console.log(` exiftool visible tags: ${JSON.stringify(report.survivors.exiftoolTagsVisible)}`);
|
||||
console.log(` after -PDF-update:all= revert: ${JSON.stringify(report.survivors.exiftoolPdfUpdateRevert)}`);
|
||||
console.log(` qpdf --qdf decompressed: ${JSON.stringify(report.survivors.qpdfQdfDecompressed)}`);
|
||||
console.log(` walk all streams (pdf-lib): ${JSON.stringify(report.survivors.walkAllStreams)}`);
|
||||
console.log();
|
||||
}
|
||||
|
||||
// Write report as JSON for the markdown writeup.
|
||||
writeFileSync(
|
||||
`${SCRATCH}/report.json`,
|
||||
JSON.stringify({ sentinels: SENTINELS, reports }, null, 2),
|
||||
);
|
||||
console.log(`Wrote report: ${SCRATCH}/report.json`);
|
||||
}
|
||||
|
||||
main().catch((e) => {
|
||||
console.error(e);
|
||||
process.exit(1);
|
||||
});
|
||||
54
vite.config.web.ts
Normal file
54
vite.config.web.ts
Normal file
|
|
@ -0,0 +1,54 @@
|
|||
import { defineConfig } from "vite";
|
||||
import react from "@vitejs/plugin-react";
|
||||
import { VitePWA } from "vite-plugin-pwa";
|
||||
import { resolve } from "node:path";
|
||||
import type { Plugin } from "vite";
|
||||
|
||||
function webCspPlugin(): Plugin {
|
||||
return {
|
||||
name: "web-csp",
|
||||
transformIndexHtml(_html, ctx) {
|
||||
const isDev = ctx.server !== undefined;
|
||||
const scriptSrc = isDev
|
||||
? "'self' 'unsafe-inline' 'wasm-unsafe-eval'"
|
||||
: "'self' 'wasm-unsafe-eval'";
|
||||
const styleSrc = "'self' 'unsafe-inline'";
|
||||
const connectSrc = isDev ? "'self' ws://localhost:*" : "'self'";
|
||||
return [
|
||||
{
|
||||
tag: "meta",
|
||||
attrs: {
|
||||
"http-equiv": "Content-Security-Policy",
|
||||
content: `default-src 'none'; script-src ${scriptSrc}; style-src ${styleSrc}; img-src 'self' data: blob:; font-src 'self'; connect-src ${connectSrc}; worker-src 'self' blob:; base-uri 'none'; frame-ancestors 'none'`,
|
||||
},
|
||||
injectTo: "head-prepend",
|
||||
},
|
||||
];
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
export default defineConfig({
|
||||
root: resolve(__dirname, "src/web"),
|
||||
publicDir: resolve(__dirname, "public"),
|
||||
build: {
|
||||
outDir: resolve(__dirname, "dist/web"),
|
||||
emptyOutDir: true,
|
||||
},
|
||||
plugins: [
|
||||
react(),
|
||||
webCspPlugin(),
|
||||
VitePWA({
|
||||
registerType: "autoUpdate",
|
||||
manifest: false,
|
||||
workbox: {
|
||||
globPatterns: ["**/*.{js,css,html,ico,png,svg,webmanifest}"],
|
||||
maximumFileSizeToCacheInBytes: 10 * 1024 * 1024,
|
||||
},
|
||||
includeAssets: ["icon-192.png", "icon-512.png"],
|
||||
devOptions: {
|
||||
enabled: true,
|
||||
},
|
||||
}),
|
||||
],
|
||||
});
|
||||
Loading…
Add table
Reference in a new issue