digger/docs/research/SYNTHESIS.md
Randa 2e1f8f46bd docs: set bge-m3 (1024-dim) as the planned default embedder
Switch the committed userProvided embedder dimension 768 -> 1024 to match
bge-m3, the planned v2 default (research B already recommended it as the
Arabic+English choice; ADR 0005 previously carried e5-base/768 as the
example default, an inconsistency this resolves). Free to change now since
v1 writes no vectors.

Updated: meilisearch-settings.json, ADR 0003/0005, ir-schema.json,
SYNTHESIS, README, and A-meilisearch config/prose. Remaining 768 mentions
are legitimate (alternative 768-dim models; nomic-embed-text's real dim).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 16:10:43 +04:00

13 KiB
Raw Permalink Blame History

Research Synthesis — digger

Date: 2026-07-01 Inputs: Research docs AI in this directory, plus the existing arabic-ocr repo and the confirmed Section-14 answers. Purpose: Reconcile the findings into one set of per-layer technology choices and record the resolved decisions. Written at synthesis time and since reconciled to match the final decisions; the §2 subsections (G/H/I) and the ADRs are authoritative where any detail differs.


1. Confirmed project parameters (from Section 14)

Parameter Decision
Runtime Python
Search (v1) Keyword only; vector/hybrid designed-for but switched off
Access control Single-user, no ACLs
Hardware CPU-capable, GPU-optional with auto-fallback. Dev box: no GPU, 128 GB unified RAM
Languages Arabic + English (RTL matters)
Scale Medium: 10k500k files, ~0.55 TB
Meilisearch Bundled in our Compose, same machine, zero-config first run
File access Local disks only (still cross-platform path care)
Granularity Chunk long docs in v1 (a whole document is the degenerate single-chunk case); chunking is a Transformer concern
Read path Option B — thin Python SearchProvider API; UI never knows the engine
UI tech FastAPI + Jinja2 + HTMX
Run mode CLI (scan/extract/index/run/status) now; seams for watched-folder/service later
CI Linux/Docker runner (docker, ubuntu-latest) exists; add Windows runner

2. Per-layer technology decisions

Search engine — Meilisearch (Agent A)

  • Pin getmeili/meilisearch:v1.48.3. Vector/hybrid is production-stable in v1.x; no experimental flags needed.
  • Single index digger_documents + localizedAttributes (Arabic + English tokenization; object form — see the settings file) for mixed-language content.
  • Primary key id = SHA-256 hex of canonical_path + "|" + content_hash — Meili IDs allow only [a-zA-Z0-9_-], so raw paths can't be used; the hash is 64 hex chars, safe.
  • Reserve vectors now, at zero cost: declare a dormant userProvided embedder (dimensions: 1024, matching the planned default embedder bge-m3) at index creation. Vectors are populated only in v2; because v1 writes none, the dimension is still changeable if the embedder choice changes before the first vector exists (ADR 0003, ADR 0005).
  • Arabic: native via Charabia (article segmentation, diacritic stripping). Configure Arabic+English stop words manually (no built-ins). Keep typo oneTypo minimum at 5 (correct for short Arabic roots).
  • Two hard limits that shape the IR: (1) 65,535 word-positions per field — long OCR'd PDFs / long transcripts silently truncate; (2) 468-byte filterable-value cap — raw paths can't be filter values; derive a short source_folder token instead.

Extraction — Docling front-half + narrow custom extractors (→ ADR 0006)

Docling is the content extractor for every native-digital format — PDF-with-text, DOCX, PPTX, XLSX, HTML, EPUB, ODF, LaTeX, Markdown, EML, CSV, WebVTT — emitting DoclingDocument as an internal transport mapped to the IR. Everything Docling can't do is a narrow custom extractor behind the same Extractor interface:

  • Scanned Arabic / handwritten OCRQwen2.5-VL via Ollama + the arabic-ocr prompt (best evaluated stack for Arabic handwriting/IDs/forms — fine-tuned; the base model is only moderate, so Sprint-1 validates on the real corpus). The model + prompt are fixed; the harness — Docling's VLM pipeline (ApiVlmOptions) vs a thin Ollama wrapper — is a Sprint-1 bake-off. Printed-Latin scans → Docling + EasyOCR (no VLM cost).
  • Office core properties → a thin python-docx/python-pptx/openpyxl augmenter inside the Docling extractor (docx2python dropped).
  • A/V (v1) → Docling's built-in ASR (large-v3); the dedicated faster-whisper extractor is v2.
  • Deferred to v2: legacy binary .doc/.xls/.ppt + Access .mdb/.accdb (v1 → status: skipped); Outlook .msg/ZIP/edge-XML via Unstructured. MarkItDown is a safety-net fallback.

Why build our own (not a RAG framework or one universal toolkit): surveyed LlamaIndex/Haystack/unstructured-ingest/RAGFlow and rejected — none provides a Meilisearch keyword sink, a pluggable Arabic-VLM OCR step, or a persisted engine-agnostic IR; forking one is a whole-framework fork. We reuse components (Docling + its docling-core chunkers). Apache Tika is not used (JVM, Tesseract-only, no model hooks).

Office & legacy — specifics (Agent C)

  • Office content comes from Docling (see Extraction above); the dedicated libs are a thin metadata augmenter only. python-calamine deferred; xlrd used only for legacy .xls (v2).
  • Legacy binary Office (.doc/.xls/.ppt) deferred to v2 — unoserver/LibreOffice conversion → OOXML → Docling (+ xlrd for .xls). In v1 these route to status: skipped; the ~800 MB LibreOffice container is a v2 addition, keeping the v1 Compose lighter. Never antiword/catdoc (dead).
  • Access (.mdb/.accdb) deferred to v2 — Windows-only via pyodbc + ACE even then (mdbtools .accdb unreliable). In v1 → skipped with a structured reason.
  • No COM/Office automation ever. Windows long-path (\\?\, registry key), file-lock (catch PermissionError), and encoding care throughout.

Audio/video (Agent D)

  • faster-whisper v1.2.x, model large-v3, compute_type=int8, CPU-first. ~1.5 GB weights (23 GB peak RAM), ~56× real-time on a decent multi-core CPU (turbo reaches ~1015×). GPU upgrade = change two flags. Avoid large-v3-turbo for Arabic — turbo degrades on lower-resource languages (OpenAI documents Thai/Cantonese; Arabic regression is reported in third-party benchmarks); allow as opt-in override. Document Byne/whisper-large-v3-arabic as a configurable model override for better Arabic WER.
  • ffmpeg via plain subprocess (16 kHz mono WAV). Bundle in Docker; static-ffmpeg for non-Docker dev. Resolver: explicit config → shutil.whichstatic_ffmpeg.
  • Diarization deferred to v2 (gated HF model, CPU-prohibitive). IR reserves speaker (null in v1) and optional words so v2 layers on without schema change.
  • Long media: vad_filter=True; transcribe in a killable ProcessPoolExecutor with timeout; enforce max_media_size_bytes.
  • v1/v2 split (revised). v1 handles A/V via Docling's built-in ASR (WhisperS2T, overridden to large-v3) as just-another-Docling-format — segment-level timestamps, the simplest v1 slice. The dedicated faster-whisper extractor above (word-level timestamps, VAD, fine-tuned Arabic, diarization) becomes the v2 upgrade. The IR is unchanged either way (segment times carried; words/speaker reserved null).

Frontend (Agent E)

  • FastAPI 0.136.x + HTMX 2.0.x + Jinja2, meilisearch-python-sdk (async) for the adapter.
  • SearchProvider Protocol with search()/suggest()/health() returning engine-agnostic Hit/SearchResult/FacetBucket dataclasses. Hit.snippet is pre-rendered HTML with <mark> only (sanitized) — no Meilisearch field name (_formatted, facetDistribution) ever reaches a template. mode: keyword|semantic|hybrid is in the signature now (always keyword in v1).
  • Search-as-you-type: hx-trigger="input changed delay:300ms", hx-sync="this:replace", hx-push-url="true"; one /search route returns full page or #results partial based on the HX-Request header (bookmarkable + degrades gracefully).
  • Arabic RTL: root dir="rtl", CSS logical properties only, dir="auto" on input + snippets, <bdi> around paths/numbers, self-hosted Cairo/Noto Sans Arabic, no letter-spacing.
  • Open-file / deep-link: server-side /api/open (os-native open) instead of browser-blocked file://; PDF #page=N; audio/video embedded HTML5 player pre-seeked #t=N. Always show provenance (path, page/slide/sheet, timestamp).
  • suggest() returns empty in v1 (Meili has no native suggest API).

CI / Windows runner (Agent F)

  • Two workflows under .forgejo/workflows/: ci.yml (Linux) — unit on every push (ruff, ruff format, mypy, pytest+coverage), Meilisearch integration on PRs to main (service container, pinned image); ci-windows.yml — Windows path/encoding/file-lock unit tests on the windows label. Heavy OCR/ASR gated behind a heavy marker + workflow_dispatch, never in the default pipeline.
  • All uses: fully-qualified to https://code.forgejo.org/actions/... (bare refs expand against DEFAULT_ACTIONS_URL = https://data.forgejo.org, not GitHub — qualify to avoid ambiguity). Use code.forgejo.org/forgejo/upload-artifact for artifacts.
  • Windows runner = Crown0815/Forgejo-runner-windows-builder (unofficial, pin v12.12.0), runs host-native (no containers) → no service containers on Windows (integration stays Linux-only). User must provide a Windows host or KVM Windows 11 VM; pre-install Python 3.12, Git, poppler, ffmpeg; add a Defender exclusion; register with --labels "windows:host,self-hosted:host" and override ACTIONS_RUNTIME_URL to the Linux host LAN IP. forgejo-stack gets a register-windows-runner.sh helper + docs/windows-runner.md; no docker-compose.yml change.

Tooling decision — build our own, reuse Docling (research G/H/I)

The outcome is folded into Extraction above. Full evidence: the framework build-vs-adopt survey G-rag-ingestion-frameworks.md, the fork/extend re-evaluation H-framework-fork-reevaluation.md, and the Docling front-half spike I-docling-front-half.md. Components adopted: docling-core chunkers (HybridChunker) in the Transformer, DoclingDocument as an internal Extractor→Transformer transport, and LlamaIndex's doc_id + doc_hash pattern as reference for our StateStore.


3. Cross-cutting architecture implications

  1. The IR carries chunking fields from day one and v1 chunks long documents: chunk_index, chunk_count, parent_id (content-addressed id vs. path-stable parent_id = sha256(path), so a file's whole chunk family is replaced/deleted by one stable key); plus a content_truncated safety flag for the 65,535-word ceiling.
  2. The IR carries the vector seam as an embedding{ model_id, model_version, dimensions, vector } object (null in v1); the sink maps embedding.vector → Meilisearch _vectors.digger_semantic.
  3. source_folder is a derived, length-bounded token (≤468 bytes) so folder faceting works within Meili's filterable-value cap; raw path is displayed-only.
  4. Provenance is structural: page (PDF), slide (PPTX), sheet (XLSX), and transcript-segment timestamps (A/V) live in the IR's structured segments so the UI can deep-link.
  5. Model access is uniform: OCR, ASR, and (future) embeddings all sit behind one ModelBackend interface; the v1 default backends are Ollama (host service) for OCR and Docling's built-in ASR for A/V (a dedicated faster-whisper extractor is the v2 ASR upgrade), all endpoint/flag-configurable.
  6. Ollama deployment default = host service (resolves the user's open question): native install, OLLAMA_HOST=0.0.0.0:11434, pipeline reaches it via host.docker.internal with extra_hosts: host-gateway so Windows/macOS/Linux converge on one default; the dev VM keeps 192.168.122.1 as an override. Fully overridable; Dockerized-Ollama is documented as the alternative for GPU-reproducible setups.

4. Trade-offs — resolved

The three trade-offs posed at synthesis time are now decided (authoritative in the ADRs):

  1. Deduplicationone document per path (ADR 0008, ADR 0003).
  2. Chunkingchunk long docs in v1 as a Transformer concern (ADR 0004); distinctAttribute = parent_id collapses a document's chunks to one hit per file.
  3. Windows CI host → a KVM Windows 11 VM on the Linux box (ADR 0010).

5. Key version pins (verified 2026-07-01)

Component Pin
Meilisearch getmeili/meilisearch:v1.48.3
Docling 2.107.x
faster-whisper 1.2.x (large-v3, int8)
FastAPI / HTMX 0.136.x / 2.0.x
python-docx / openpyxl / python-pptx 1.2.x / 3.1.5 / 1.0.2 — metadata augmentation only in v1 (docx2python dropped; calamine + xlrd deferred to v2 with legacy formats)
unoserver 3.7
static-ffmpeg 2.13
Crown0815 Windows runner v12.12.0
Python 3.12