Decouple the two identifiers so a changed file's whole chunk family can be replaced/deleted by one stable key, with no orphans and no transient double-hit: - id (primary key) = sha256(path|content_hash|chunk_index) — content-addressed per the brief; chunk_index=0 for a whole doc; changes when content changes. - parent_id = sha256(path) — path-stable, identical across chunks and edits. - ADR 0003 + meilisearch-settings.json: add parent_id to filterableAttributes (delete-by-filter needs it; it was previously distinctAttribute-only, so the documented delete-by-parent_id could not have executed). - ADR 0003/0008: replace-on-change = delete family by filter(parent_id) then PUT new chunks; sink tasks confirmed before StateStore commit (crash-convergent). - ADR 0002/0004 + ir-schema.json + ir-examples.jsonl: updated formulas, dropped the old 'parent_id == id for whole docs' framing (kept as a rejected alternative). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6.8 KiB
ADR 0002 — The Canonical Document (IR) is the contract
Status: Accepted (v1)
Schema: ir-schema.json · Examples: ir-examples.jsonl
Context
The IR is the most important artifact in the repo: every extractor produces it, every sink consumes it, and the standalone path serializes it to disk. It must be rich enough to drive keyword search now and vector/hybrid search later, while staying independent of any search engine. The brief enumerates required content: identity, source metadata, provenance, content (flat text + structured segments), format-specific metadata, derived fields, and processing status.
Two engine-imposed limits and three confirmed decisions constrain the design:
- Meilisearch primary keys allow only
[a-zA-Z0-9_-], ≤511 bytes → IDs must be hashes, not paths. - Meilisearch filterable values cap at 468 bytes → raw paths cannot be filter values.
- Confirmed: one document per path (dedup), chunk long documents in v1, vectors designed-for but off.
Decision
Define a single, versioned, JSON-serializable Canonical Document schema (schema_version: "1.0"). One IR record is one indexable unit:
- A whole document = one record with
chunk_index=0,chunk_count=1. - A chunked document = N records sharing
parent_idandcontent_hash, each with its owncontent.content_textspan andlocator.
id is content-addressed (it changes when the file changes); parent_id is path-stable (sha256(canonical_path)), identical for every chunk of a file and across every re-extraction. That split is what lets the sink replace or delete a file's whole chunk family by one path-derived key regardless of how the chunk count changed (ADR 0008).
Field groups (full detail in the schema)
- Identity & chunking:
id,parent_id,chunk_index,chunk_count,content_hash,schema_version. source:path(display/open only),filename,extension,source_folder(derived ≤468-byte facet token),mime_type,file_type(facet enum),size_bytes,created_at/modified_at(+_epochints for filter/sort),host,drive.provenance:extractor_name/extractor_version,processed_at, optionalmodel{ocr_*, asr_*}.content:content_text(searchable),content_truncated,title,language(facet),language_probability,tags.locator(per chunk, drives deep-links):page_number/page_end,slide_number,sheet_name,timestamp_start/timestamp_end,speaker(null in v1).segments: structured native units (page|slide|sheet|row|table|paragraph|transcript|note) preserved for display and deep-linking; A/V segments carrystart/endand optional word timestamps.metadata: open, format-specific (author, keywords, sheet_names, duration_seconds, codec, exif, …); a few sub-keys are searchable.embedding: reserved, null in v1; when populated in v2, the Meilisearch sink mapsvector→_vectors.digger_semantic.status/warnings/errors:success|partial|failed|skippedplus structured diagnostics.
Identity formulas
content_hash = sha256_hex(file_bytes)
id = sha256_hex(canonical_path + "|" + content_hash + "|" + chunk_index) # chunk_index=0 for a whole doc
parent_id = sha256_hex(canonical_path) # path-stable, version-independent
canonical_path is the absolute, normalized path (case/separator-normalized per OS).
idis idempotent per version: re-extracting an unchanged file yields identicalids, so aPUTupsert is a no-op; a changed file yields newids for every chunk, so old chunks are not silently overwritten.parent_idis version-independent: it depends only on the path, so it is the same before and after an edit. The sink uses it to delete/replace a file's entire chunk family in one filtered operation (no need to remember the previouscontent_hash), and because old and new chunks of the same path share it, a query mid-replace still collapses them to a single hit (ADR 0003, ADR 0008).
Consequences
- Engine-agnostic: no Meilisearch concept appears in the IR. A different sink maps the same records to its own model.
- Standalone-ready: the FileSink writes these records to JSONL unchanged; the IR is the on-disk format.
- Forward-compatible: chunk fields,
locator,embedding, andtagsare present from v1 (mostly null/default in the whole-doc, keyword-only case), so adding chunking and vectors later touches the Transformer and the sink mapping — never the schema shape. - Deep-linkable: because page/slide/sheet/timestamp live on each record, the UI can jump to the exact location of a hit without re-parsing the source.
- Versioned:
schema_versionplusprovenance.extractor_version/modellet thereindexcommand detect records produced by an older schema or model and reprocess them (ADR 0008). - Validation:
ir-schema.jsonis enforced in unit tests (round-trip serialize + schema-validate) so drift is caught in CI. - Docling interop: when a record originates from Docling (ADR 0006),
content_hashis computed independently as SHA-256 of the file bytes — not taken fromDoclingDocument.origin.binary_hash(a uint64, unsuitable for our stable IDs).DocumentOrigin.filename/mimetypemap tosource.filename/source.mime_type; core Office document properties are filled by theOfficeMetadataAugmenter, since Docling's backends do not expose them.
Alternatives considered
- Path-derived IDs. Rejected — illegal characters and the 511-byte limit; also brittle across OSes. Hashing fixes both:
id = sha256(path|content_hash|chunk_index)is valid and version-specific;parent_id = sha256(path)is valid and version-stable. - Content-dependent
parent_id(sha256(path|content_hash), soparent_id == idfor whole docs). Rejected — a content edit would changeparent_id, forcing the pipeline to remember the previousparent_idto clean up old chunks and briefly exposing old+new families as two separate hits (they would no longer share aparent_idto collapse). Path-stableparent_idavoids both. - Separate schemas per file type. Rejected — multiplies the contract and complicates the sink/UI. One schema with optional, type-appropriate fields (
locator,segments,metadata) is simpler and keeps a single index viable. - Embeddings inline as a top-level array. Deferred — kept under
embedding{}with model id/version so vectors are self-describing and the sink controls the_vectorsmapping; avoids committing the top-level shape to one engine's convention.