digger/docs/decisions/0001-architecture-and-layering.md
2026-07-01 15:28:43 +04:00

7.4 KiB
Raw Permalink Blame History

ADR 0001 — Architecture and layering

Status: Accepted (v1)

Context

digger walks files, extracts their content (including scanned documents, Office files, and audio/video via local models), normalizes everything into one intermediate representation (IR), and feeds that into a swappable search backend. Two hard requirements from the brief shape every structural choice:

  1. The pipeline must be usable standalone, without Meilisearch — it can emit IR to disk on its own.
  2. The search backend must be swappable behind an interface.

Plus: strict layering (each stage talks to the next only through a documented interface), local-first inference, idempotent/incremental runs, and fail-isolated error handling.

Decision

Adopt a strict, one-directional pipeline of stages connected only through explicit Python Protocol interfaces. No stage imports a concrete implementation of another stage.

filesystem
  │
  ▼
[ Source / Walker ]      discovers files, yields FileRef + filesystem metadata
  │
  ▼
[ Router ]               picks an Extractor by type / MIME / content sniffing
  │
  ▼
[ Extractor (per format) ]  uses ModelBackend(s) as needed (OCR / ASR / embed)
  │
  ▼
[ Canonical Document (IR) ]  THE CONTRACT — serializable to JSONL on disk
  │
  ▼
[ Transformer / Enricher ]   normalize, detect language, CHUNK, (future) embed
  │
  ▼
[ Sink / Indexer (interface) ]   FileSink (IR→disk) | MeilisearchSink | …

Cross-cutting components:

  • ModelBackend — abstracts local OCR / ASR / (future) embedding models. Extractors depend on the interface, never a concrete model. Default backends talk to Ollama (OCR) and, in v1, Docling's built-in ASR (a dedicated faster-whisper extractor is the v2 ASR upgrade). See ADR 0005.
  • SearchProvider — the read-side mirror of Sink. The UI and any search API talk only to this. See ADR 0007.
  • StateStore — local SQLite recording content-hash + mtime per path to drive incremental runs and deletion handling. See ADR 0008.
  • Config — a single TOML file with env-var overrides, selecting active sink, model backends, source roots, concurrency, and per-format options.
  • CLIscan, extract (files → IR on disk), index (IR → sink), run (end-to-end), status, and reindex. The extract/index split is what makes the pipeline usable without a search engine.

The seven interfaces (defined in Sprint 0, before any extractor)

Interface Responsibility Key method(s)
Source Discover files under roots, yield references + fs metadata walk() -> Iterable[FileRef]
Extractor Turn one file into one or more IR records extract(FileRef, ModelBackend) -> Iterable[CanonicalDocument]
ModelBackend Local inference (OCR/ASR/embed) ocr_image(...), transcribe(...), embed(...)
Transformer Post-extraction enrichment incl. chunking transform(CanonicalDocument) -> Iterable[CanonicalDocument]
Sink Write IR to a destination upsert(Iterable[CanonicalDocument]), delete(ids)
SearchProvider Query-time read side search(...), suggest(...), health()
StateStore Track processed state seen(path, hash), mark(...), deletions(...)

A trivial FileSink (writes IR JSONL to disk) and a fake/mocked ModelBackend ship in Sprint 0 so the standalone path and the unit-test path both work on day one.

Swappability invariant

Three dependencies are pluggable behind interfaces and none is part of the contract: the search engine (behind Sink/SearchProvider), the model runtime (behind ModelBackend), and the extraction engine — including Docling (behind Extractor). Swapping any is a one-adapter change: a new search engine = one Sink + one SearchProvider; a different extraction engine = one Extractor. Docling's rich DoclingDocument is an internal transport used only inside an extractor (ADR 0006) — it never escapes past the Transformer boundary. The CanonicalDocument (IR) is the sole external contract (ADR 0002).

Consequences

  • Adding a new format = adding one Extractor (and possibly a ModelBackend), with no change to other layers. This is the core extensibility property the brief demands.
  • Swapping the search engine = writing one Sink + one SearchProvider. Nothing upstream or in the UI changes.
  • Testability is structural: interfaces let unit tests inject fakes for models and sinks, keeping the default CI tier fast and offline (see ADR 0010).
  • The IR sits exactly at the architectural waist; its stability is paramount (see ADR 0002).
  • Chunking lives in the Transformer, not in extractors, so it can be turned off for the standalone case (ADR 0004).

Alternatives considered

  • An all-in-one toolkit (e.g. drive everything through Unstructured/Tika). Rejected as the architecture: it couples extraction, OCR, and output shape together and fights the local-model and swappable-sink requirements. We still use such toolkits inside specific extractors (ADR 0006), but they sit behind the Extractor interface, not in place of it.
  • Adopting a RAG ingestion/orchestration framework as the backbone (LlamaIndex IngestionPipeline, Haystack, unstructured-ingest, RAGFlow). Surveyed in ../research/G-rag-ingestion-frameworks.md and re-examined with forking explicitly allowed in ../research/H-framework-fork-reevaluation.md. Our local Ollama models are fully supported by these frameworks — Ollama (LLM + embeddings) is a first-class provider everywhere, and Docling's VLM pipeline can even drive Qwen2.5-VL via Ollama with our custom Arabic prompt — so model support is not a reason to reject them. The backbone is nonetheless rejected on three durable grounds: (a) no framework ships a Meilisearch keyword sink (we build MeilisearchSink + SearchProvider ourselves under every option), (b) adopting one demotes the IR from THE contract to a passenger inside their Node/Document type (which has no schema versioning), and (c) forking one to make our IR native is a whole-framework fork, not a schema swap — in llama-index-core, BaseNode threads through ~74 core files (ingestion, docstore hashing, transformations, node parsers, retrievers, serialization) on a fast release cadence, carrying a permanent rebase tax to buy only the doc-hash dedup we want to own anyway (~12 pw SQLite StateStore, ADR 0008). The ecosystem is vector-store-centric (chunk → embed → vector-store → LLM); digger's persisted-IR + swappable-keyword-sink shape sits in the gap between "file converter" and "RAG framework," so the custom pipeline is necessary, not merely preferred. We still reuse framework components where they fit (see ADR 0004, ADR 0006).
  • Letting the sink own normalization/chunking. Rejected: it would make the standalone (FileSink) path lossy and engine-specific, violating requirement 1.