Randa 2e1f8f46bd docs: set bge-m3 (1024-dim) as the planned default embedder

Switch the committed userProvided embedder dimension 768 -> 1024 to match
bge-m3, the planned v2 default (research B already recommended it as the
Arabic+English choice; ADR 0005 previously carried e5-base/768 as the
example default, an inconsistency this resolves). Free to change now since
v1 writes no vectors.

Updated: meilisearch-settings.json, ADR 0003/0005, ir-schema.json,
SYNTHESIS, README, and A-meilisearch config/prose. Remaining 768 mentions
are legitimate (alternative 768-dim models; nomic-embed-text's real dim).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-07-01 16:10:43 +04:00

13 KiB

Raw Permalink Blame History

Research Synthesis — digger

Date: 2026-07-01 Inputs: Research docs A–I in this directory, plus the existing arabic-ocr repo and the confirmed Section-14 answers. Purpose: Reconcile the findings into one set of per-layer technology choices and record the resolved decisions. Written at synthesis time and since reconciled to match the final decisions; the §2 subsections (G/H/I) and the ADRs are authoritative where any detail differs.

1. Confirmed project parameters (from Section 14)

Parameter	Decision
Runtime	Python
Search (v1)	Keyword only; vector/hybrid designed-for but switched off
Access control	Single-user, no ACLs
Hardware	CPU-capable, GPU-optional with auto-fallback. Dev box: no GPU, 128 GB unified RAM
Languages	Arabic + English (RTL matters)
Scale	Medium: 10k–500k files, ~0.5–5 TB
Meilisearch	Bundled in our Compose, same machine, zero-config first run
File access	Local disks only (still cross-platform path care)
Granularity	Chunk long docs in v1 (a whole document is the degenerate single-chunk case); chunking is a Transformer concern
Read path	Option B — thin Python `SearchProvider` API; UI never knows the engine
UI tech	FastAPI + Jinja2 + HTMX
Run mode	CLI (`scan/extract/index/run/status`) now; seams for watched-folder/service later
CI	Linux/Docker runner (`docker`, `ubuntu-latest`) exists; add Windows runner

2. Per-layer technology decisions

Search engine — Meilisearch (Agent A)

Pin getmeili/meilisearch:v1.48.3. Vector/hybrid is production-stable in v1.x; no experimental flags needed.
Single index digger_documents + localizedAttributes (Arabic + English tokenization; object form — see the settings file) for mixed-language content.
Primary key id = SHA-256 hex of canonical_path + "|" + content_hash — Meili IDs allow only [a-zA-Z0-9_-], so raw paths can't be used; the hash is 64 hex chars, safe.
Reserve vectors now, at zero cost: declare a dormant userProvided embedder (dimensions: 1024, matching the planned default embedder bge-m3) at index creation. Vectors are populated only in v2; because v1 writes none, the dimension is still changeable if the embedder choice changes before the first vector exists (ADR 0003, ADR 0005).
Arabic: native via Charabia (article segmentation, diacritic stripping). Configure Arabic+English stop words manually (no built-ins). Keep typo oneTypo minimum at 5 (correct for short Arabic roots).
Two hard limits that shape the IR: (1) 65,535 word-positions per field — long OCR'd PDFs / long transcripts silently truncate; (2) 468-byte filterable-value cap — raw paths can't be filter values; derive a short source_folder token instead.

Extraction — Docling front-half + narrow custom extractors (→ ADR 0006)

Docling is the content extractor for every native-digital format — PDF-with-text, DOCX, PPTX, XLSX, HTML, EPUB, ODF, LaTeX, Markdown, EML, CSV, WebVTT — emitting DoclingDocument as an internal transport mapped to the IR. Everything Docling can't do is a narrow custom extractor behind the same Extractor interface:

Scanned Arabic / handwritten OCR → Qwen2.5-VL via Ollama + the arabic-ocr prompt (best evaluated stack for Arabic handwriting/IDs/forms — fine-tuned; the base model is only moderate, so Sprint-1 validates on the real corpus). The model + prompt are fixed; the harness — Docling's VLM pipeline (ApiVlmOptions) vs a thin Ollama wrapper — is a Sprint-1 bake-off. Printed-Latin scans → Docling + EasyOCR (no VLM cost).
Office core properties → a thin python-docx/python-pptx/openpyxl augmenter inside the Docling extractor (docx2python dropped).
A/V (v1) → Docling's built-in ASR (large-v3); the dedicated faster-whisper extractor is v2.
Deferred to v2: legacy binary .doc/.xls/.ppt + Access .mdb/.accdb (v1 → status: skipped); Outlook .msg/ZIP/edge-XML via Unstructured. MarkItDown is a safety-net fallback.

Why build our own (not a RAG framework or one universal toolkit): surveyed LlamaIndex/Haystack/unstructured-ingest/RAGFlow and rejected — none provides a Meilisearch keyword sink, a pluggable Arabic-VLM OCR step, or a persisted engine-agnostic IR; forking one is a whole-framework fork. We reuse components (Docling + its docling-core chunkers). Apache Tika is not used (JVM, Tesseract-only, no model hooks).

Office & legacy — specifics (Agent C)

Office content comes from Docling (see Extraction above); the dedicated libs are a thin metadata augmenter only. python-calamine deferred; xlrd used only for legacy .xls (v2).
Legacy binary Office (.doc/.xls/.ppt) deferred to v2 — unoserver/LibreOffice conversion → OOXML → Docling (+ xlrd for .xls). In v1 these route to status: skipped; the ~800 MB LibreOffice container is a v2 addition, keeping the v1 Compose lighter. Never antiword/catdoc (dead).
Access (.mdb/.accdb) deferred to v2 — Windows-only via pyodbc + ACE even then (mdbtools .accdb unreliable). In v1 → skipped with a structured reason.
No COM/Office automation ever. Windows long-path (\\?\, registry key), file-lock (catch PermissionError), and encoding care throughout.

Audio/video (Agent D)

faster-whisper v1.2.x, model large-v3, compute_type=int8, CPU-first. ~1.5 GB weights (2–3 GB peak RAM), ~5–6× real-time on a decent multi-core CPU (turbo reaches ~10–15×). GPU upgrade = change two flags. Avoid large-v3-turbo for Arabic — turbo degrades on lower-resource languages (OpenAI documents Thai/Cantonese; Arabic regression is reported in third-party benchmarks); allow as opt-in override. Document Byne/whisper-large-v3-arabic as a configurable model override for better Arabic WER.
ffmpeg via plain subprocess (16 kHz mono WAV). Bundle in Docker; static-ffmpeg for non-Docker dev. Resolver: explicit config → shutil.which → static_ffmpeg.
Diarization deferred to v2 (gated HF model, CPU-prohibitive). IR reserves speaker (null in v1) and optional words so v2 layers on without schema change.
Long media: vad_filter=True; transcribe in a killable ProcessPoolExecutor with timeout; enforce max_media_size_bytes.
v1/v2 split (revised). v1 handles A/V via Docling's built-in ASR (WhisperS2T, overridden to large-v3) as just-another-Docling-format — segment-level timestamps, the simplest v1 slice. The dedicated faster-whisper extractor above (word-level timestamps, VAD, fine-tuned Arabic, diarization) becomes the v2 upgrade. The IR is unchanged either way (segment times carried; words/speaker reserved null).

Frontend (Agent E)

FastAPI 0.136.x + HTMX 2.0.x + Jinja2, meilisearch-python-sdk (async) for the adapter.
SearchProvider Protocol with search()/suggest()/health() returning engine-agnostic Hit/SearchResult/FacetBucket dataclasses. Hit.snippet is pre-rendered HTML with <mark> only (sanitized) — no Meilisearch field name (_formatted, facetDistribution) ever reaches a template. mode: keyword|semantic|hybrid is in the signature now (always keyword in v1).
Search-as-you-type: hx-trigger="input changed delay:300ms", hx-sync="this:replace", hx-push-url="true"; one /search route returns full page or #results partial based on the HX-Request header (bookmarkable + degrades gracefully).
Arabic RTL: root dir="rtl", CSS logical properties only, dir="auto" on input + snippets, <bdi> around paths/numbers, self-hosted Cairo/Noto Sans Arabic, no letter-spacing.
Open-file / deep-link: server-side /api/open (os-native open) instead of browser-blocked file://; PDF #page=N; audio/video embedded HTML5 player pre-seeked #t=N. Always show provenance (path, page/slide/sheet, timestamp).
suggest() returns empty in v1 (Meili has no native suggest API).

CI / Windows runner (Agent F)

Two workflows under .forgejo/workflows/: ci.yml (Linux) — unit on every push (ruff, ruff format, mypy, pytest+coverage), Meilisearch integration on PRs to main (service container, pinned image); ci-windows.yml — Windows path/encoding/file-lock unit tests on the windows label. Heavy OCR/ASR gated behind a heavy marker + workflow_dispatch, never in the default pipeline.
All uses: fully-qualified to https://code.forgejo.org/actions/... (bare refs expand against DEFAULT_ACTIONS_URL = https://data.forgejo.org, not GitHub — qualify to avoid ambiguity). Use code.forgejo.org/forgejo/upload-artifact for artifacts.
Windows runner = Crown0815/Forgejo-runner-windows-builder (unofficial, pin v12.12.0), runs host-native (no containers) → no service containers on Windows (integration stays Linux-only). User must provide a Windows host or KVM Windows 11 VM; pre-install Python 3.12, Git, poppler, ffmpeg; add a Defender exclusion; register with --labels "windows:host,self-hosted:host" and override ACTIONS_RUNTIME_URL to the Linux host LAN IP. forgejo-stack gets a register-windows-runner.sh helper + docs/windows-runner.md; no docker-compose.yml change.

Tooling decision — build our own, reuse Docling (research G/H/I)

The outcome is folded into Extraction above. Full evidence: the framework build-vs-adopt survey G-rag-ingestion-frameworks.md, the fork/extend re-evaluation H-framework-fork-reevaluation.md, and the Docling front-half spike I-docling-front-half.md. Components adopted: docling-core chunkers (HybridChunker) in the Transformer, DoclingDocument as an internal Extractor→Transformer transport, and LlamaIndex's doc_id + doc_hash pattern as reference for our StateStore.

3. Cross-cutting architecture implications

The IR carries chunking fields from day one and v1 chunks long documents: chunk_index, chunk_count, parent_id (content-addressed id vs. path-stable parent_id = sha256(path), so a file's whole chunk family is replaced/deleted by one stable key); plus a content_truncated safety flag for the 65,535-word ceiling.
The IR carries the vector seam as an embedding{ model_id, model_version, dimensions, vector } object (null in v1); the sink maps embedding.vector → Meilisearch _vectors.digger_semantic.
source_folder is a derived, length-bounded token (≤468 bytes) so folder faceting works within Meili's filterable-value cap; raw path is displayed-only.
Provenance is structural: page (PDF), slide (PPTX), sheet (XLSX), and transcript-segment timestamps (A/V) live in the IR's structured segments so the UI can deep-link.
Model access is uniform: OCR, ASR, and (future) embeddings all sit behind one ModelBackend interface; the v1 default backends are Ollama (host service) for OCR and Docling's built-in ASR for A/V (a dedicated faster-whisper extractor is the v2 ASR upgrade), all endpoint/flag-configurable.
Ollama deployment default = host service (resolves the user's open question): native install, OLLAMA_HOST=0.0.0.0:11434, pipeline reaches it via host.docker.internal with extra_hosts: host-gateway so Windows/macOS/Linux converge on one default; the dev VM keeps 192.168.122.1 as an override. Fully overridable; Dockerized-Ollama is documented as the alternative for GPU-reproducible setups.

4. Trade-offs — resolved

The three trade-offs posed at synthesis time are now decided (authoritative in the ADRs):

Deduplication → one document per path (ADR 0008, ADR 0003).
Chunking → chunk long docs in v1 as a Transformer concern (ADR 0004); distinctAttribute = parent_id collapses a document's chunks to one hit per file.
Windows CI host → a KVM Windows 11 VM on the Linux box (ADR 0010).

5. Key version pins (verified 2026-07-01)

Component	Pin
Meilisearch	`getmeili/meilisearch:v1.48.3`
Docling	2.107.x
faster-whisper	1.2.x (`large-v3`, int8)
FastAPI / HTMX	0.136.x / 2.0.x
python-docx / openpyxl / python-pptx	1.2.x / 3.1.5 / 1.0.2 — metadata augmentation only in v1 (docx2python dropped; calamine + xlrd deferred to v2 with legacy formats)
unoserver	3.7
static-ffmpeg	2.13
Crown0815 Windows runner	v12.12.0
Python	3.12

13 KiB Raw Permalink Blame History Unescape Escape