digger/docs/decisions/README.md
Randa 068dad5340 docs(ir): add v2 example record with a populated bge-m3 1024-dim embedding
ir-examples.jsonl gains a fifth record — a Word chunk whose embedding{} is
populated (model_id bge-m3, dimensions 1024, a real 1024-length vector) —
so the embedding shape and the sink's vector->_vectors.digger_semantic
mapping have a concrete, schema-valid fixture. The four v1 records stay
embedding:null. id/parent_id are computed per the path-stable formula.
decisions/README labels it as the v2 illustration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 16:17:00 +04:00

2.5 KiB

Architecture Decision Records

This directory holds the durable design decisions for digger, plus the two machine-readable contract artifacts. Read these before changing anything in the corresponding layer.

Contract artifacts

File What it is
ir-schema.json JSON Schema (draft 2020-12) for the Canonical Document (IR) v1.0 — the contract between the pipeline and any sink. The most important file in the repo.
ir-examples.jsonl Worked IR records: an Arabic scanned-PDF page, an English chunked report, a mixed-language A/V transcript chunk, and a skipped Access file — all v1 (embedding: null) — plus a v2 illustration: a Word chunk carrying a populated bge-m3 1024-dim embedding (the sink maps embedding.vector_vectors.digger_semantic).
meilisearch-settings.json The concrete digger_documents index settings (keyword v1; vector embedder declared but dormant).

ADRs

ADR Decision
0001 Strict layered architecture and the seven core interfaces
0002 The Canonical Document (IR) as the versioned contract
0003 Single Meilisearch index; chunk granularity collapsed by parent_id; limits
0004 Chunking in v1 as a Transformer concern (and the vector seam)
0005 ModelBackend interface; OCR/ASR/embed defaults; Ollama as host service
0006 Tiered, content-routed extraction (Docling / Qwen-OCR / Office libs / unoserver)
0007 Read-side SearchProvider interface and the FastAPI + HTMX UI
0008 Deduplication (one-per-path), the StateStore, incremental & delete semantics, reindex
0009 Docker Compose as primary distribution; zero-install; overridability
0010 Layered CI on Forgejo and the Windows runner

Status legend

  • Accepted — agreed and in force for v1.
  • Proposed — drafted, awaiting confirmation.
  • Superseded — replaced by a later ADR (linked).

All ADRs below are Accepted for v1 unless noted. They reflect the research in ../research/ (synthesized in ../research/SYNTHESIS.md) and the project brief ../digger-brief.md.