digger/README.md
Randa 2e1f8f46bd docs: set bge-m3 (1024-dim) as the planned default embedder
Switch the committed userProvided embedder dimension 768 -> 1024 to match
bge-m3, the planned v2 default (research B already recommended it as the
Arabic+English choice; ADR 0005 previously carried e5-base/768 as the
example default, an inconsistency this resolves). Free to change now since
v1 writes no vectors.

Updated: meilisearch-settings.json, ADR 0003/0005, ir-schema.json,
SYNTHESIS, README, and A-meilisearch config/prose. Remaining 768 mentions
are legitimate (alternative 768-dim models; nomic-embed-text's real dim).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 16:10:43 +04:00

4.1 KiB

digger

A modular, local-first file-ingestion search pipeline. It walks files on a machine, extracts their content — including scanned documents (OCR), Office files, and audio/video (transcription) — normalizes everything into one well-defined intermediate representation (IR), and feeds that into a swappable search backend (Meilisearch first) for full-text search.

Two hard requirements shape every decision:

  1. Runs standalone without a search engine — it can emit the IR to disk on its own; indexing is a separate, swappable stage.
  2. The search backend is swappable behind an interface.

All model inference (OCR / ASR / embeddings) runs against local models — no file content leaves the machine. Target platform is primarily Windows, but the code is cross-platform (Windows + Linux + macOS). v1 ships keyword search; vector/hybrid is designed-for but switched off.

Status: design phase. This branch contains the research and the design (no implementation yet). Implementation starts after the plan and Forgejo milestones are approved.

Start here

Document What it is
docs/digger-brief.md The authoritative project brief / spec.
docs/research/SYNTHESIS.md Per-layer technology decisions reconciled from all research.
docs/decisions/ The ADRs and the two contract artifacts (IR schema + Meilisearch settings). Read docs/decisions/README.md first.
docs/research/ Detailed findings (Meilisearch, model tooling, Office, A/V, frontend, CI, and the RAG-ingestion-framework build-vs-adopt-vs-fork survey).
docs/PLANNING_KICKOFF.md The prompt to start the planning session (the next step after this design is approved).

Architecture at a glance

filesystem → [Source] → [Router] → [Extractor] → Canonical Document (IR)
           → [Transformer: normalize · language · chunk · (future) embed]
           → [Sink] → { FileSink (IR→disk) | MeilisearchSink }
  read side:  [SearchProvider] ← FastAPI + HTMX UI
  crosscut:   ModelBackend (OCR/ASR/embed) · StateStore (SQLite) · Config (TOML) · CLI

Each stage talks to the next only through a documented interface; the IR is the contract at the waist. See ADR 0001 and ADR 0002.

Everything heavy is swappable behind an interface — the search engine (Sink/SearchProvider), the model runtime (ModelBackend), and the extraction engine including Docling (Extractor). DoclingDocument never escapes the extractor; the CanonicalDocument (IR) is the only contract.

Key technology choices (v1)

  • Search: Meilisearch v1.48.3 — single index, chunk-granularity collapsed by parent_id, Arabic+English via Charabia, vector embedder declared-but-dormant (bge-m3, 1024-dim, populated in v2).
  • OCR: fine-tuned Qwen2.5-VL via local Ollama for scanned Arabic/handwriting/IDs/forms (base-model Arabic handwriting is only moderate — a Sprint-1 corpus-validation risk); Docling for native-digital. The OCR harness (Docling VLM vs thin wrapper) is a Sprint-1 bake-off.
  • Office (v1): Docling extracts .docx/.xlsx/.pptx content + a thin OfficeMetadataAugmenter (python-docx/python-pptx/openpyxl, core properties only). Legacy binary .doc/.xls/.ppt and Access .mdb/.accdb are deferred to v2 (v1 routes them to skipped).
  • Audio/video (v1): Docling's built-in ASR (WhisperS2T, large-v3), segment-level timestamps; a dedicated ffmpeg + faster-whisper extractor (word-level timestamps, VAD, diarization) is the v2 upgrade.
  • UI: FastAPI + Jinja2 + HTMX, engine-agnostic via the SearchProvider interface.
  • Packaging: one-command Docker Compose (CPU defaults, zero-config first run); every piece overridable.
  • CI: Forgejo Actions — layered Linux tiers + a native Windows runner.

Languages

Arabic + English, including right-to-left UI and mixed-language documents.