Switch the committed userProvided embedder dimension 768 -> 1024 to match bge-m3, the planned v2 default (research B already recommended it as the Arabic+English choice; ADR 0005 previously carried e5-base/768 as the example default, an inconsistency this resolves). Free to change now since v1 writes no vectors. Updated: meilisearch-settings.json, ADR 0003/0005, ir-schema.json, SYNTHESIS, README, and A-meilisearch config/prose. Remaining 768 mentions are legitimate (alternative 768-dim models; nomic-embed-text's real dim). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
13 KiB
Research Synthesis — digger
Date: 2026-07-01
Inputs: Research docs A–I in this directory, plus the existing arabic-ocr repo and the confirmed Section-14 answers.
Purpose: Reconcile the findings into one set of per-layer technology choices and record the resolved decisions. Written at synthesis time and since reconciled to match the final decisions; the §2 subsections (G/H/I) and the ADRs are authoritative where any detail differs.
1. Confirmed project parameters (from Section 14)
| Parameter | Decision |
|---|---|
| Runtime | Python |
| Search (v1) | Keyword only; vector/hybrid designed-for but switched off |
| Access control | Single-user, no ACLs |
| Hardware | CPU-capable, GPU-optional with auto-fallback. Dev box: no GPU, 128 GB unified RAM |
| Languages | Arabic + English (RTL matters) |
| Scale | Medium: 10k–500k files, ~0.5–5 TB |
| Meilisearch | Bundled in our Compose, same machine, zero-config first run |
| File access | Local disks only (still cross-platform path care) |
| Granularity | Chunk long docs in v1 (a whole document is the degenerate single-chunk case); chunking is a Transformer concern |
| Read path | Option B — thin Python SearchProvider API; UI never knows the engine |
| UI tech | FastAPI + Jinja2 + HTMX |
| Run mode | CLI (scan/extract/index/run/status) now; seams for watched-folder/service later |
| CI | Linux/Docker runner (docker, ubuntu-latest) exists; add Windows runner |
2. Per-layer technology decisions
Search engine — Meilisearch (Agent A)
- Pin
getmeili/meilisearch:v1.48.3. Vector/hybrid is production-stable in v1.x; no experimental flags needed. - Single index
digger_documents+localizedAttributes(Arabic + English tokenization; object form — see the settings file) for mixed-language content. - Primary key
id= SHA-256 hex ofcanonical_path + "|" + content_hash— Meili IDs allow only[a-zA-Z0-9_-], so raw paths can't be used; the hash is 64 hex chars, safe. - Reserve vectors now, at zero cost: declare a dormant
userProvidedembedder (dimensions: 1024, matching the planned default embedderbge-m3) at index creation. Vectors are populated only in v2; because v1 writes none, the dimension is still changeable if the embedder choice changes before the first vector exists (ADR 0003, ADR 0005). - Arabic: native via Charabia (article segmentation, diacritic stripping). Configure Arabic+English stop words manually (no built-ins). Keep typo
oneTypominimum at 5 (correct for short Arabic roots). - Two hard limits that shape the IR: (1) 65,535 word-positions per field — long OCR'd PDFs / long transcripts silently truncate; (2) 468-byte filterable-value cap — raw paths can't be filter values; derive a short
source_foldertoken instead.
Extraction — Docling front-half + narrow custom extractors (→ ADR 0006)
Docling is the content extractor for every native-digital format — PDF-with-text, DOCX, PPTX, XLSX, HTML, EPUB, ODF, LaTeX, Markdown, EML, CSV, WebVTT — emitting DoclingDocument as an internal transport mapped to the IR. Everything Docling can't do is a narrow custom extractor behind the same Extractor interface:
- Scanned Arabic / handwritten OCR → Qwen2.5-VL via Ollama + the
arabic-ocrprompt (best evaluated stack for Arabic handwriting/IDs/forms — fine-tuned; the base model is only moderate, so Sprint-1 validates on the real corpus). The model + prompt are fixed; the harness — Docling's VLM pipeline (ApiVlmOptions) vs a thin Ollama wrapper — is a Sprint-1 bake-off. Printed-Latin scans → Docling + EasyOCR (no VLM cost). - Office core properties → a thin
python-docx/python-pptx/openpyxlaugmenter inside the Docling extractor (docx2pythondropped). - A/V (v1) → Docling's built-in ASR (
large-v3); the dedicated faster-whisper extractor is v2. - Deferred to v2: legacy binary
.doc/.xls/.ppt+ Access.mdb/.accdb(v1 →status: skipped); Outlook.msg/ZIP/edge-XML via Unstructured. MarkItDown is a safety-net fallback.
Why build our own (not a RAG framework or one universal toolkit): surveyed LlamaIndex/Haystack/unstructured-ingest/RAGFlow and rejected — none provides a Meilisearch keyword sink, a pluggable Arabic-VLM OCR step, or a persisted engine-agnostic IR; forking one is a whole-framework fork. We reuse components (Docling + its docling-core chunkers). Apache Tika is not used (JVM, Tesseract-only, no model hooks).
Office & legacy — specifics (Agent C)
- Office content comes from Docling (see Extraction above); the dedicated libs are a thin metadata augmenter only.
python-calaminedeferred;xlrdused only for legacy.xls(v2). - Legacy binary Office (
.doc/.xls/.ppt) deferred to v2 — unoserver/LibreOffice conversion → OOXML → Docling (+ xlrd for.xls). In v1 these route tostatus: skipped; the ~800 MB LibreOffice container is a v2 addition, keeping the v1 Compose lighter. Never antiword/catdoc (dead). - Access (
.mdb/.accdb) deferred to v2 — Windows-only via pyodbc + ACE even then (mdbtools.accdbunreliable). In v1 →skippedwith a structured reason. - No COM/Office automation ever. Windows long-path (
\\?\, registry key), file-lock (catch PermissionError), and encoding care throughout.
Audio/video (Agent D)
- faster-whisper v1.2.x, model
large-v3,compute_type=int8, CPU-first. ~1.5 GB weights (2–3 GB peak RAM), ~5–6× real-time on a decent multi-core CPU (turbo reaches ~10–15×). GPU upgrade = change two flags. Avoidlarge-v3-turbofor Arabic — turbo degrades on lower-resource languages (OpenAI documents Thai/Cantonese; Arabic regression is reported in third-party benchmarks); allow as opt-in override. DocumentByne/whisper-large-v3-arabicas a configurable model override for better Arabic WER. - ffmpeg via plain
subprocess(16 kHz mono WAV). Bundle in Docker;static-ffmpegfor non-Docker dev. Resolver: explicit config →shutil.which→static_ffmpeg. - Diarization deferred to v2 (gated HF model, CPU-prohibitive). IR reserves
speaker(null in v1) and optionalwordsso v2 layers on without schema change. - Long media:
vad_filter=True; transcribe in a killableProcessPoolExecutorwith timeout; enforcemax_media_size_bytes. - v1/v2 split (revised). v1 handles A/V via Docling's built-in ASR (WhisperS2T, overridden to
large-v3) as just-another-Docling-format — segment-level timestamps, the simplest v1 slice. The dedicated faster-whisper extractor above (word-level timestamps, VAD, fine-tuned Arabic, diarization) becomes the v2 upgrade. The IR is unchanged either way (segment times carried;words/speakerreserved null).
Frontend (Agent E)
- FastAPI 0.136.x + HTMX 2.0.x + Jinja2,
meilisearch-python-sdk(async) for the adapter. SearchProviderProtocol withsearch()/suggest()/health()returning engine-agnosticHit/SearchResult/FacetBucketdataclasses.Hit.snippetis pre-rendered HTML with<mark>only (sanitized) — no Meilisearch field name (_formatted,facetDistribution) ever reaches a template.mode: keyword|semantic|hybridis in the signature now (alwayskeywordin v1).- Search-as-you-type:
hx-trigger="input changed delay:300ms",hx-sync="this:replace",hx-push-url="true"; one/searchroute returns full page or#resultspartial based on theHX-Requestheader (bookmarkable + degrades gracefully). - Arabic RTL: root
dir="rtl", CSS logical properties only,dir="auto"on input + snippets,<bdi>around paths/numbers, self-hosted Cairo/Noto Sans Arabic, no letter-spacing. - Open-file / deep-link: server-side
/api/open(os-native open) instead of browser-blockedfile://; PDF#page=N; audio/video embedded HTML5 player pre-seeked#t=N. Always show provenance (path, page/slide/sheet, timestamp). suggest()returns empty in v1 (Meili has no native suggest API).
CI / Windows runner (Agent F)
- Two workflows under
.forgejo/workflows/:ci.yml(Linux) — unit on every push (ruff, ruff format, mypy, pytest+coverage), Meilisearch integration on PRs to main (service container, pinned image);ci-windows.yml— Windows path/encoding/file-lock unit tests on thewindowslabel. Heavy OCR/ASR gated behind aheavymarker +workflow_dispatch, never in the default pipeline. - All
uses:fully-qualified tohttps://code.forgejo.org/actions/...(bare refs expand againstDEFAULT_ACTIONS_URL=https://data.forgejo.org, not GitHub — qualify to avoid ambiguity). Usecode.forgejo.org/forgejo/upload-artifactfor artifacts. - Windows runner = Crown0815/Forgejo-runner-windows-builder (unofficial, pin
v12.12.0), runs host-native (no containers) → no service containers on Windows (integration stays Linux-only). User must provide a Windows host or KVM Windows 11 VM; pre-install Python 3.12, Git, poppler, ffmpeg; add a Defender exclusion; register with--labels "windows:host,self-hosted:host"and overrideACTIONS_RUNTIME_URLto the Linux host LAN IP. forgejo-stack gets aregister-windows-runner.shhelper +docs/windows-runner.md; no docker-compose.yml change.
Tooling decision — build our own, reuse Docling (research G/H/I)
The outcome is folded into Extraction above. Full evidence: the framework build-vs-adopt survey G-rag-ingestion-frameworks.md, the fork/extend re-evaluation H-framework-fork-reevaluation.md, and the Docling front-half spike I-docling-front-half.md. Components adopted: docling-core chunkers (HybridChunker) in the Transformer, DoclingDocument as an internal Extractor→Transformer transport, and LlamaIndex's doc_id + doc_hash pattern as reference for our StateStore.
3. Cross-cutting architecture implications
- The IR carries chunking fields from day one and v1 chunks long documents:
chunk_index,chunk_count,parent_id(content-addressedidvs. path-stableparent_id = sha256(path), so a file's whole chunk family is replaced/deleted by one stable key); plus acontent_truncatedsafety flag for the 65,535-word ceiling. - The IR carries the vector seam as an
embedding{ model_id, model_version, dimensions, vector }object (null in v1); the sink mapsembedding.vector→ Meilisearch_vectors.digger_semantic. source_folderis a derived, length-bounded token (≤468 bytes) so folder faceting works within Meili's filterable-value cap; rawpathis displayed-only.- Provenance is structural: page (PDF), slide (PPTX), sheet (XLSX), and transcript-segment timestamps (A/V) live in the IR's structured segments so the UI can deep-link.
- Model access is uniform: OCR, ASR, and (future) embeddings all sit behind one
ModelBackendinterface; the v1 default backends are Ollama (host service) for OCR and Docling's built-in ASR for A/V (a dedicated faster-whisper extractor is the v2 ASR upgrade), all endpoint/flag-configurable. - Ollama deployment default = host service (resolves the user's open question): native install,
OLLAMA_HOST=0.0.0.0:11434, pipeline reaches it viahost.docker.internalwithextra_hosts: host-gatewayso Windows/macOS/Linux converge on one default; the dev VM keeps192.168.122.1as an override. Fully overridable; Dockerized-Ollama is documented as the alternative for GPU-reproducible setups.
4. Trade-offs — resolved
The three trade-offs posed at synthesis time are now decided (authoritative in the ADRs):
- Deduplication → one document per path (ADR 0008, ADR 0003).
- Chunking → chunk long docs in v1 as a Transformer concern (ADR 0004);
distinctAttribute = parent_idcollapses a document's chunks to one hit per file. - Windows CI host → a KVM Windows 11 VM on the Linux box (ADR 0010).
5. Key version pins (verified 2026-07-01)
| Component | Pin |
|---|---|
| Meilisearch | getmeili/meilisearch:v1.48.3 |
| Docling | 2.107.x |
| faster-whisper | 1.2.x (large-v3, int8) |
| FastAPI / HTMX | 0.136.x / 2.0.x |
| python-docx / openpyxl / python-pptx | 1.2.x / 3.1.5 / 1.0.2 — metadata augmentation only in v1 (docx2python dropped; calamine + xlrd deferred to v2 with legacy formats) |
| unoserver | 3.7 |
| static-ffmpeg | 2.13 |
| Crown0815 Windows runner | v12.12.0 |
| Python | 3.12 |