Switch the committed userProvided embedder dimension 768 -> 1024 to match bge-m3, the planned v2 default (research B already recommended it as the Arabic+English choice; ADR 0005 previously carried e5-base/768 as the example default, an inconsistency this resolves). Free to change now since v1 writes no vectors. Updated: meilisearch-settings.json, ADR 0003/0005, ir-schema.json, SYNTHESIS, README, and A-meilisearch config/prose. Remaining 768 mentions are legitimate (alternative 768-dim models; nomic-embed-text's real dim). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4.1 KiB
digger
A modular, local-first file-ingestion search pipeline. It walks files on a machine, extracts their content — including scanned documents (OCR), Office files, and audio/video (transcription) — normalizes everything into one well-defined intermediate representation (IR), and feeds that into a swappable search backend (Meilisearch first) for full-text search.
Two hard requirements shape every decision:
- Runs standalone without a search engine — it can emit the IR to disk on its own; indexing is a separate, swappable stage.
- The search backend is swappable behind an interface.
All model inference (OCR / ASR / embeddings) runs against local models — no file content leaves the machine. Target platform is primarily Windows, but the code is cross-platform (Windows + Linux + macOS). v1 ships keyword search; vector/hybrid is designed-for but switched off.
Status: design phase. This branch contains the research and the design (no implementation yet). Implementation starts after the plan and Forgejo milestones are approved.
Start here
| Document | What it is |
|---|---|
docs/digger-brief.md |
The authoritative project brief / spec. |
docs/research/SYNTHESIS.md |
Per-layer technology decisions reconciled from all research. |
docs/decisions/ |
The ADRs and the two contract artifacts (IR schema + Meilisearch settings). Read docs/decisions/README.md first. |
docs/research/ |
Detailed findings (Meilisearch, model tooling, Office, A/V, frontend, CI, and the RAG-ingestion-framework build-vs-adopt-vs-fork survey). |
docs/PLANNING_KICKOFF.md |
The prompt to start the planning session (the next step after this design is approved). |
Architecture at a glance
filesystem → [Source] → [Router] → [Extractor] → Canonical Document (IR)
→ [Transformer: normalize · language · chunk · (future) embed]
→ [Sink] → { FileSink (IR→disk) | MeilisearchSink }
read side: [SearchProvider] ← FastAPI + HTMX UI
crosscut: ModelBackend (OCR/ASR/embed) · StateStore (SQLite) · Config (TOML) · CLI
Each stage talks to the next only through a documented interface; the IR is the contract at the waist. See ADR 0001 and ADR 0002.
Everything heavy is swappable behind an interface — the search engine (Sink/SearchProvider), the model runtime (ModelBackend), and the extraction engine including Docling (Extractor). DoclingDocument never escapes the extractor; the CanonicalDocument (IR) is the only contract.
Key technology choices (v1)
- Search: Meilisearch
v1.48.3— single index, chunk-granularity collapsed byparent_id, Arabic+English via Charabia, vector embedder declared-but-dormant (bge-m3, 1024-dim, populated in v2). - OCR: fine-tuned Qwen2.5-VL via local Ollama for scanned Arabic/handwriting/IDs/forms (base-model Arabic handwriting is only moderate — a Sprint-1 corpus-validation risk); Docling for native-digital. The OCR harness (Docling VLM vs thin wrapper) is a Sprint-1 bake-off.
- Office (v1): Docling extracts
.docx/.xlsx/.pptxcontent + a thinOfficeMetadataAugmenter(python-docx/python-pptx/openpyxl, core properties only). Legacy binary.doc/.xls/.pptand Access.mdb/.accdbare deferred to v2 (v1 routes them toskipped). - Audio/video (v1): Docling's built-in ASR (WhisperS2T,
large-v3), segment-level timestamps; a dedicated ffmpeg + faster-whisper extractor (word-level timestamps, VAD, diarization) is the v2 upgrade. - UI: FastAPI + Jinja2 + HTMX, engine-agnostic via the
SearchProviderinterface. - Packaging: one-command Docker Compose (CPU defaults, zero-config first run); every piece overridable.
- CI: Forgejo Actions — layered Linux tiers + a native Windows runner.
Languages
Arabic + English, including right-to-left UI and mixed-language documents.