ir-examples.jsonl gains a fifth record — a Word chunk whose embedding{} is
populated (model_id bge-m3, dimensions 1024, a real 1024-length vector) —
so the embedding shape and the sink's vector->_vectors.digger_semantic
mapping have a concrete, schema-valid fixture. The four v1 records stay
embedding:null. id/parent_id are computed per the path-stable formula.
decisions/README labels it as the v2 illustration.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| 0001-architecture-and-layering.md | ||
| 0002-intermediate-representation.md | ||
| 0003-meilisearch-index-design.md | ||
| 0004-chunking-strategy.md | ||
| 0005-model-backends-and-ollama.md | ||
| 0006-document-conversion-routing.md | ||
| 0007-search-provider-and-ui.md | ||
| 0008-incremental-dedup-statestore.md | ||
| 0009-packaging-and-deployment.md | ||
| 0010-ci-and-windows-runner.md | ||
| ir-examples.jsonl | ||
| ir-schema.json | ||
| meilisearch-settings.json | ||
| README.md | ||
Architecture Decision Records
This directory holds the durable design decisions for digger, plus the two machine-readable contract artifacts. Read these before changing anything in the corresponding layer.
Contract artifacts
| File | What it is |
|---|---|
ir-schema.json |
JSON Schema (draft 2020-12) for the Canonical Document (IR) v1.0 — the contract between the pipeline and any sink. The most important file in the repo. |
ir-examples.jsonl |
Worked IR records: an Arabic scanned-PDF page, an English chunked report, a mixed-language A/V transcript chunk, and a skipped Access file — all v1 (embedding: null) — plus a v2 illustration: a Word chunk carrying a populated bge-m3 1024-dim embedding (the sink maps embedding.vector → _vectors.digger_semantic). |
meilisearch-settings.json |
The concrete digger_documents index settings (keyword v1; vector embedder declared but dormant). |
ADRs
| ADR | Decision |
|---|---|
| 0001 | Strict layered architecture and the seven core interfaces |
| 0002 | The Canonical Document (IR) as the versioned contract |
| 0003 | Single Meilisearch index; chunk granularity collapsed by parent_id; limits |
| 0004 | Chunking in v1 as a Transformer concern (and the vector seam) |
| 0005 | ModelBackend interface; OCR/ASR/embed defaults; Ollama as host service |
| 0006 | Tiered, content-routed extraction (Docling / Qwen-OCR / Office libs / unoserver) |
| 0007 | Read-side SearchProvider interface and the FastAPI + HTMX UI |
| 0008 | Deduplication (one-per-path), the StateStore, incremental & delete semantics, reindex |
| 0009 | Docker Compose as primary distribution; zero-install; overridability |
| 0010 | Layered CI on Forgejo and the Windows runner |
Status legend
- Accepted — agreed and in force for v1.
- Proposed — drafted, awaiting confirmation.
- Superseded — replaced by a later ADR (linked).
All ADRs below are Accepted for v1 unless noted. They reflect the research in ../research/ (synthesized in ../research/SYNTHESIS.md) and the project brief ../digger-brief.md.