digger/docs/decisions
Randa 068dad5340 docs(ir): add v2 example record with a populated bge-m3 1024-dim embedding
ir-examples.jsonl gains a fifth record — a Word chunk whose embedding{} is
populated (model_id bge-m3, dimensions 1024, a real 1024-length vector) —
so the embedding shape and the sink's vector->_vectors.digger_semantic
mapping have a concrete, schema-valid fixture. The four v1 records stay
embedding:null. id/parent_id are computed per the path-stable formula.
decisions/README labels it as the v2 illustration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 16:17:00 +04:00
..
0001-architecture-and-layering.md docs(adr): state Ollama support plainly in ADR 0001; drop 'earlier misstatement' meta-note 2026-07-01 15:28:43 +04:00
0002-intermediate-representation.md docs(adr): path-stable parent_id for clean chunk-family replace/delete 2026-07-01 15:42:39 +04:00
0003-meilisearch-index-design.md docs: set bge-m3 (1024-dim) as the planned default embedder 2026-07-01 16:10:43 +04:00
0004-chunking-strategy.md docs(adr): path-stable parent_id for clean chunk-family replace/delete 2026-07-01 15:42:39 +04:00
0005-model-backends-and-ollama.md docs: set bge-m3 (1024-dim) as the planned default embedder 2026-07-01 16:10:43 +04:00
0006-document-conversion-routing.md docs: second full sweep — fix factual errors, tighten contracts, normalize 2026-07-01 13:58:08 +04:00
0007-search-provider-and-ui.md docs(adr): record v2 UI-driven model selection (ADR 0005 + 0007 admin view) 2026-07-01 16:03:18 +04:00
0008-incremental-dedup-statestore.md docs(adr): path-stable parent_id for clean chunk-family replace/delete 2026-07-01 15:42:39 +04:00
0009-packaging-and-deployment.md docs: second full sweep — fix factual errors, tighten contracts, normalize 2026-07-01 13:58:08 +04:00
0010-ci-and-windows-runner.md docs: second full sweep — fix factual errors, tighten contracts, normalize 2026-07-01 13:58:08 +04:00
ir-examples.jsonl docs(ir): add v2 example record with a populated bge-m3 1024-dim embedding 2026-07-01 16:17:00 +04:00
ir-schema.json docs: set bge-m3 (1024-dim) as the planned default embedder 2026-07-01 16:10:43 +04:00
meilisearch-settings.json docs: set bge-m3 (1024-dim) as the planned default embedder 2026-07-01 16:10:43 +04:00
README.md docs(ir): add v2 example record with a populated bge-m3 1024-dim embedding 2026-07-01 16:17:00 +04:00

Architecture Decision Records

This directory holds the durable design decisions for digger, plus the two machine-readable contract artifacts. Read these before changing anything in the corresponding layer.

Contract artifacts

File What it is
ir-schema.json JSON Schema (draft 2020-12) for the Canonical Document (IR) v1.0 — the contract between the pipeline and any sink. The most important file in the repo.
ir-examples.jsonl Worked IR records: an Arabic scanned-PDF page, an English chunked report, a mixed-language A/V transcript chunk, and a skipped Access file — all v1 (embedding: null) — plus a v2 illustration: a Word chunk carrying a populated bge-m3 1024-dim embedding (the sink maps embedding.vector_vectors.digger_semantic).
meilisearch-settings.json The concrete digger_documents index settings (keyword v1; vector embedder declared but dormant).

ADRs

ADR Decision
0001 Strict layered architecture and the seven core interfaces
0002 The Canonical Document (IR) as the versioned contract
0003 Single Meilisearch index; chunk granularity collapsed by parent_id; limits
0004 Chunking in v1 as a Transformer concern (and the vector seam)
0005 ModelBackend interface; OCR/ASR/embed defaults; Ollama as host service
0006 Tiered, content-routed extraction (Docling / Qwen-OCR / Office libs / unoserver)
0007 Read-side SearchProvider interface and the FastAPI + HTMX UI
0008 Deduplication (one-per-path), the StateStore, incremental & delete semantics, reindex
0009 Docker Compose as primary distribution; zero-install; overridability
0010 Layered CI on Forgejo and the Windows runner

Status legend

  • Accepted — agreed and in force for v1.
  • Proposed — drafted, awaiting confirmation.
  • Superseded — replaced by a later ADR (linked).

All ADRs below are Accepted for v1 unless noted. They reflect the research in ../research/ (synthesized in ../research/SYNTHESIS.md) and the project brief ../digger-brief.md.