digger/docs/decisions/0002-intermediate-representation.md
Randa 96d4409845 docs(adr): path-stable parent_id for clean chunk-family replace/delete
Decouple the two identifiers so a changed file's whole chunk family can be
replaced/deleted by one stable key, with no orphans and no transient double-hit:

- id (primary key) = sha256(path|content_hash|chunk_index) — content-addressed
  per the brief; chunk_index=0 for a whole doc; changes when content changes.
- parent_id = sha256(path) — path-stable, identical across chunks and edits.

- ADR 0003 + meilisearch-settings.json: add parent_id to filterableAttributes
  (delete-by-filter needs it; it was previously distinctAttribute-only, so the
  documented delete-by-parent_id could not have executed).
- ADR 0003/0008: replace-on-change = delete family by filter(parent_id) then
  PUT new chunks; sink tasks confirmed before StateStore commit (crash-convergent).
- ADR 0002/0004 + ir-schema.json + ir-examples.jsonl: updated formulas, dropped
  the old 'parent_id == id for whole docs' framing (kept as a rejected alternative).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 15:42:39 +04:00

6.8 KiB

ADR 0002 — The Canonical Document (IR) is the contract

Status: Accepted (v1) Schema: ir-schema.json · Examples: ir-examples.jsonl

Context

The IR is the most important artifact in the repo: every extractor produces it, every sink consumes it, and the standalone path serializes it to disk. It must be rich enough to drive keyword search now and vector/hybrid search later, while staying independent of any search engine. The brief enumerates required content: identity, source metadata, provenance, content (flat text + structured segments), format-specific metadata, derived fields, and processing status.

Two engine-imposed limits and three confirmed decisions constrain the design:

  • Meilisearch primary keys allow only [a-zA-Z0-9_-], ≤511 bytes → IDs must be hashes, not paths.
  • Meilisearch filterable values cap at 468 bytes → raw paths cannot be filter values.
  • Confirmed: one document per path (dedup), chunk long documents in v1, vectors designed-for but off.

Decision

Define a single, versioned, JSON-serializable Canonical Document schema (schema_version: "1.0"). One IR record is one indexable unit:

  • A whole document = one record with chunk_index=0, chunk_count=1.
  • A chunked document = N records sharing parent_id and content_hash, each with its own content.content_text span and locator.

id is content-addressed (it changes when the file changes); parent_id is path-stable (sha256(canonical_path)), identical for every chunk of a file and across every re-extraction. That split is what lets the sink replace or delete a file's whole chunk family by one path-derived key regardless of how the chunk count changed (ADR 0008).

Field groups (full detail in the schema)

  • Identity & chunking: id, parent_id, chunk_index, chunk_count, content_hash, schema_version.
  • source: path (display/open only), filename, extension, source_folder (derived ≤468-byte facet token), mime_type, file_type (facet enum), size_bytes, created_at/modified_at (+ _epoch ints for filter/sort), host, drive.
  • provenance: extractor_name/extractor_version, processed_at, optional model{ocr_*, asr_*}.
  • content: content_text (searchable), content_truncated, title, language (facet), language_probability, tags.
  • locator (per chunk, drives deep-links): page_number/page_end, slide_number, sheet_name, timestamp_start/timestamp_end, speaker (null in v1).
  • segments: structured native units (page|slide|sheet|row|table|paragraph|transcript|note) preserved for display and deep-linking; A/V segments carry start/end and optional word timestamps.
  • metadata: open, format-specific (author, keywords, sheet_names, duration_seconds, codec, exif, …); a few sub-keys are searchable.
  • embedding: reserved, null in v1; when populated in v2, the Meilisearch sink maps vector_vectors.digger_semantic.
  • status/warnings/errors: success|partial|failed|skipped plus structured diagnostics.

Identity formulas

content_hash = sha256_hex(file_bytes)
id           = sha256_hex(canonical_path + "|" + content_hash + "|" + chunk_index)   # chunk_index=0 for a whole doc
parent_id    = sha256_hex(canonical_path)                                            # path-stable, version-independent

canonical_path is the absolute, normalized path (case/separator-normalized per OS).

  • id is idempotent per version: re-extracting an unchanged file yields identical ids, so a PUT upsert is a no-op; a changed file yields new ids for every chunk, so old chunks are not silently overwritten.
  • parent_id is version-independent: it depends only on the path, so it is the same before and after an edit. The sink uses it to delete/replace a file's entire chunk family in one filtered operation (no need to remember the previous content_hash), and because old and new chunks of the same path share it, a query mid-replace still collapses them to a single hit (ADR 0003, ADR 0008).

Consequences

  • Engine-agnostic: no Meilisearch concept appears in the IR. A different sink maps the same records to its own model.
  • Standalone-ready: the FileSink writes these records to JSONL unchanged; the IR is the on-disk format.
  • Forward-compatible: chunk fields, locator, embedding, and tags are present from v1 (mostly null/default in the whole-doc, keyword-only case), so adding chunking and vectors later touches the Transformer and the sink mapping — never the schema shape.
  • Deep-linkable: because page/slide/sheet/timestamp live on each record, the UI can jump to the exact location of a hit without re-parsing the source.
  • Versioned: schema_version plus provenance.extractor_version/model let the reindex command detect records produced by an older schema or model and reprocess them (ADR 0008).
  • Validation: ir-schema.json is enforced in unit tests (round-trip serialize + schema-validate) so drift is caught in CI.
  • Docling interop: when a record originates from Docling (ADR 0006), content_hash is computed independently as SHA-256 of the file bytes — not taken from DoclingDocument.origin.binary_hash (a uint64, unsuitable for our stable IDs). DocumentOrigin.filename/mimetype map to source.filename/source.mime_type; core Office document properties are filled by the OfficeMetadataAugmenter, since Docling's backends do not expose them.

Alternatives considered

  • Path-derived IDs. Rejected — illegal characters and the 511-byte limit; also brittle across OSes. Hashing fixes both: id = sha256(path|content_hash|chunk_index) is valid and version-specific; parent_id = sha256(path) is valid and version-stable.
  • Content-dependent parent_id (sha256(path|content_hash), so parent_id == id for whole docs). Rejected — a content edit would change parent_id, forcing the pipeline to remember the previous parent_id to clean up old chunks and briefly exposing old+new families as two separate hits (they would no longer share a parent_id to collapse). Path-stable parent_id avoids both.
  • Separate schemas per file type. Rejected — multiplies the contract and complicates the sink/UI. One schema with optional, type-appropriate fields (locator, segments, metadata) is simpler and keeps a single index viable.
  • Embeddings inline as a top-level array. Deferred — kept under embedding{} with model id/version so vectors are self-describing and the sink controls the _vectors mapping; avoids committing the top-level shape to one engine's convention.