docs: set bge-m3 (1024-dim) as the planned default embedder

Switch the committed userProvided embedder dimension 768 -> 1024 to match bge-m3, the planned v2 default (research B already recommended it as the Arabic+English choice; ADR 0005 previously carried e5-base/768 as the example default, an inconsistency this resolves). Free to change now since v1 writes no vectors. Updated: meilisearch-settings.json, ADR 0003/0005, ir-schema.json, SYNTHESIS, README, and A-meilisearch config/prose. Remaining 768 mentions are legitimate (alternative 768-dim models; nomic-embed-text's real dim). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 16:10:43 +04:00 · 2026-07-01 16:10:43 +04:00 · 2e1f8f46bd
commit 2e1f8f46bd
parent d65751e12d
7 changed files with 15 additions and 15 deletions
--- a/README.md
+++ b/README.md
@ -37,7 +37,7 @@ Each stage talks to the next only through a documented interface; the IR is the

 ## Key technology choices (v1)

- **Search:** Meilisearch `v1.48.3` — single index, chunk-granularity collapsed by `parent_id`, Arabic+English via Charabia, vector embedder declared-but-dormant.
+- **Search:** Meilisearch `v1.48.3` — single index, chunk-granularity collapsed by `parent_id`, Arabic+English via Charabia, vector embedder declared-but-dormant (`bge-m3`, 1024-dim, populated in v2).
 - **OCR:** fine-tuned Qwen2.5-VL via local Ollama for scanned Arabic/handwriting/IDs/forms (base-model Arabic handwriting is only moderate — a Sprint-1 corpus-validation risk); Docling for native-digital. The OCR harness (Docling VLM vs thin wrapper) is a Sprint-1 bake-off.
 - **Office (v1):** Docling extracts `.docx/.xlsx/.pptx` content + a thin `OfficeMetadataAugmenter` (python-docx/python-pptx/openpyxl, core properties only). Legacy binary `.doc/.xls/.ppt` and Access `.mdb/.accdb` are **deferred to v2** (v1 routes them to `skipped`).
 - **Audio/video (v1):** Docling's built-in ASR (WhisperS2T, `large-v3`), segment-level timestamps; a dedicated ffmpeg + faster-whisper extractor (word-level timestamps, VAD, diarization) is the **v2** upgrade.
--- a/docs/decisions/0003-meilisearch-index-design.md
+++ b/docs/decisions/0003-meilisearch-index-design.md
@ -47,10 +47,10 @@ So a query collapses to **one hit per logical file-at-path** while the **best-ma
 Declare one embedder at index creation, in Sprint 0, at **zero cost**:

 ```json
-"embedders": { "digger_semantic": { "source": "userProvided", "dimensions": 768 } }
+"embedders": { "digger_semantic": { "source": "userProvided", "dimensions": 1024 } }
 ```

-`userProvided` means Meilisearch never calls out — the pipeline supplies vectors when ready (v2). Dimensions are **768** (the planned default). Because v1 writes **no** `_vectors`, the dimension can still be finalized in v2 when the embedding model is chosen — a change only forces re-generation once vectors exist, and none do in v1. The v2 model must emit this dimension (e.g. a 768-dim multilingual embedder; note `bge-m3` is 1024-dim — [ADR 0005](0005-model-backends-and-ollama.md)). Documents are keyword-only until v2. Hybrid is a per-query parameter (`semanticRatio`) requiring no settings change to enable later.
+`userProvided` means Meilisearch never calls out — the pipeline supplies vectors when ready (v2). Dimensions are **1024**, matching the planned default embedder **`bge-m3`** (multilingual, Arabic-capable, served by the same Ollama instance — [ADR 0005](0005-model-backends-and-ollama.md)). Because v1 writes **no** `_vectors`, the dimension can still be changed in v2 if the embedder choice changes before any vector exists — a change only forces re-generation once vectors exist, and none do in v1. Documents are keyword-only until v2. Hybrid is a per-query parameter (`semanticRatio`) requiring no settings change to enable later.

 ### Operations

--- a/docs/decisions/0005-model-backends-and-ollama.md
+++ b/docs/decisions/0005-model-backends-and-ollama.md
@ -37,7 +37,7 @@ Extractors receive a `ModelBackend` and call capabilities; they never import a m
 | OCR | **Qwen2.5-VL via Ollama** (`qwen2.5vl:7b`) + the `arabic-ocr` document-aware prompt | The **model and prompt are fixed**; the *invocation harness* is an implementation detail behind the `Extractor` interface, chosen by a **Sprint-1 bake-off** on the real Arabic corpus. **(1) Docling's VLM pipeline** — `ApiVlmOptions` → Ollama, `prompt=` our prompt, `response_format=MARKDOWN`, `images_scale` tuned for ~300 DPI — unifies OCR under the Docling path (uniform DoclingDocument→IR + chunking, least code): **leading candidate**. **(2) A thin direct Ollama wrapper** (~50 lines; full control of DPI, streaming, per-page timeout, `num_ctx`): **fallback** if (1) distorts the Arabic output, can't render high enough DPI, or handles slow-CPU streaming poorly. The IR is identical either way. 3B variant selectable for speed. See [ADR 0006](0006-document-conversion-routing.md). |
 | ASR — **v1** | **Docling ASR pipeline (WhisperS2T backend), model `large-v3`** | For v1, audio/video is handled by `DoclingExtractor` as just-another-format ([ADR 0006](0006-document-conversion-routing.md)) — one extraction engine, fewest moving parts. **Override the preset to `large-v3`** (Docling's ASR default is `WHISPER_TINY`; avoid the turbo variant, which degrades on lower-resource languages). Local, CPU-capable (CTranslate2 int8). Docling surfaces **segment-level** timestamps (`start`/`end`) only — sufficient for the "jump to the moment" deep-link (Whisper computes word-level timestamps, but Docling does not expose them in `DoclingDocument`). |
 | ASR — **v2** | **Dedicated faster-whisper extractor** (behind `ASRBackend`) | Upgrade for **word-level** timestamps, **Silero VAD** (filters hallucination on silence), a fine-tuned Arabic override (e.g. `Byne/whisper-large-v3-arabic`), and diarization (WhisperX/pyannote). ~5–6× real-time on a decent multi-core CPU for large-v3 int8 (the turbo variant reaches ~10–15×); ~1.5 GB weights, 2–3 GB peak RAM; GPU = `device=cuda, compute_type=float16`. Lands with the v2 A/V work. |
-| Embed — **v2 only** | **local multilingual embedder (TBD)** | Must emit **768 dims** to match the committed `userProvided` embedder ([ADR 0003](0003-meilisearch-index-design.md)) — e.g. `multilingual-e5-base` (768). Note `bge-m3` is **1024-dim**, so choosing it means committing 1024 instead (revisable before the first vector, since v1 stores none). Interface present in v1; not invoked. |
+| Embed — **v2 only** | **`bge-m3` via Ollama (1024-dim)** | Planned default (research B: best Arabic+English multilingual retrieval; dense/sparse/multi-vector; 8192-token context; served by the same Ollama instance — no new infra). Must emit **1024 dims** to match the committed `userProvided` embedder ([ADR 0003](0003-meilisearch-index-design.md)). Still revisable before the first vector exists (v1 stores none); switching to a 768-dim model (e.g. `multilingual-e5-base`) would mean re-committing the embedder dimension first. Interface present in v1; not invoked. |

 **Ollama is a first-class provider across the ecosystem** (LlamaIndex, Haystack, RAGFlow, txtai, PrivateGPT, AnythingLLM) — using our local models is standard, not exotic. Notably, **Docling's own VLM pipeline can drive Qwen2.5-VL via Ollama with our custom Arabic prompt** (`ApiVlmOptions`, default `http://localhost:11434/v1/chat/completions`). digger therefore exposes Docling-VLM as an **alternate, opt-in OCR backend** behind this same interface — most useful for models that natively emit DocTags (e.g. GraniteDocling). The OCR **harness is decided by a Sprint-1 bake-off** (see the OCR row above): **Docling-VLM is the leading candidate** (unifies OCR under the Docling path); the thin direct Ollama wrapper is the **fallback** if the bake-off shows Docling's Markdown re-parsing distorts the Arabic output, page DPI is too low, or slow-CPU streaming is mishandled. (Accuracy note: `ApiVlmOptions` is *not* formally deprecated in docling 2.107.0 — only `HuggingFaceVlmOptions` is — but it carries a runtime migration hint, the successor preset system does not support arbitrary custom prompts, and its `MARKDOWN` response path zeroes bounding boxes.) See [`../research/H-framework-fork-reevaluation.md`](../research/H-framework-fork-reevaluation.md) §4 and [`../research/I-docling-front-half.md`](../research/I-docling-front-half.md) §3.2.

--- a/docs/decisions/ir-schema.json
+++ b/docs/decisions/ir-schema.json
@ -198,7 +198,7 @@
      "properties": {
        "model_id": { "type": "string" },
        "model_version": { "type": "string" },
-        "dimensions": { "type": "integer", "description": "Embedding dimensionality; finalized with the V2 model choice and equal to the Meilisearch embedder's `dimensions`. Because v1 writes no vectors, it can still be set before the first vector exists." },
+        "dimensions": { "type": "integer", "description": "Embedding dimensionality; equal to the Meilisearch embedder's `dimensions` (1024 for the planned bge-m3 default). Because v1 writes no vectors, it can still be changed before the first vector exists." },
        "vector": { "type": "array", "items": { "type": "number" }, "description": "Length equals `dimensions`." }
      }
    },
--- a/docs/decisions/meilisearch-settings.json
+++ b/docs/decisions/meilisearch-settings.json
@ -80,7 +80,7 @@
  "embedders": {
    "digger_semantic": {
      "source": "userProvided",
-      "dimensions": 768
+      "dimensions": 1024
    }
  }
 }
--- a/docs/research/A-meilisearch.md
+++ b/docs/research/A-meilisearch.md
@ -103,7 +103,7 @@ Comments inline (strip before sending — JSON does not support comments).
  "embedders": {
    "digger_semantic": {
      "source": "userProvided",
-      "dimensions": 768
+      "dimensions": 1024
    }
  }
 }
@ -114,7 +114,7 @@ Comments inline (strip before sending — JSON does not support comments).
 - **`searchableAttributes` order matters** for the `attribute` ranking rule: `content_text` is first (dominant for relevance), `filename` after it (boosts exact filename hits). `path` is **not** searchable — it is display/open-only, so path text never influences content relevance.
 - **`filterableAttributes`** — `source.parent_dir` is omitted because filterable attribute values are hard-limited to **468 bytes** (LMDB constraint). Full paths can exceed this; store a normalized parent-dir token (≤ 468 bytes) if you want faceted folder filtering.
 - **`modified_at_epoch` and `created_at_epoch`** are Unix epoch integers, enabling range filters (`modified_at_epoch > 1700000000`) without date parsing in the filter language.
- **`embedders.digger_semantic`** with `userProvided` + `dimensions: 768` reserves the vector slot. Meilisearch does not call any external service for userProvided embedders; it simply expects documents to optionally include a `_vectors.digger_semantic` array of 768 floats. Documents without the field are indexed and searched normally in keyword mode.
+- **`embedders.digger_semantic`** with `userProvided` + `dimensions: 1024` reserves the vector slot. Meilisearch does not call any external service for userProvided embedders; it simply expects documents to optionally include a `_vectors.digger_semantic` array of 1024 floats (matching the planned `bge-m3` default). Documents without the field are indexed and searched normally in keyword mode.
 - **`stopWords: []`** — Start empty; add Arabic and English stop words once you have a representative corpus. See Section 4 for Arabic-specific guidance.
 - **`maxTotalHits: 10000`** — raised from the default 1000 to support pagination over large result sets without hitting the cap; keep bounded to protect performance.
 - **`distinctAttribute: "parent_id"`** — collapses a document's chunks to one hit per logical file-at-path while surfacing the best-matching chunk. `parent_id = sha256(path)` is path-stable, so `distinct` is a no-op for a single-chunk file and unchanged across edits ([ADR 0002](../decisions/0002-intermediate-representation.md)). This is fully compatible with "one document per path" (each path has a unique `parent_id`); it is *not* used for content-hash deduplication (that remains one document per path, per the confirmed decision).
@ -283,14 +283,14 @@ Since our IR `id` is derived from `path + content_hash`:
 - All inference is local (requirement). Embeddings are generated by our pipeline, not by Meilisearch.
 - Decouples embedding generation from indexing: the pipeline can include/omit `_vectors` per document independently.
 - Meilisearch does not call any external service; it simply stores and searches the vectors.
- `dimensions: 768` covers most local embedding models (e.g. BAAI/bge-base, nomic-embed-text, multilingual-e5-base).
+- `dimensions` is set to **1024** to match digger's planned default embedder `bge-m3` (best Arabic+English retrieval — research B). Common 768-dim models (BAAI/bge-base, nomic-embed-text, multilingual-e5-base) would need the embedder re-declared at 768 before use.

 **`userProvided` embedder config:**
 ```json
 {
  "digger_semantic": {
    "source": "userProvided",
-    "dimensions": 768
+    "dimensions": 1024
  }
 }
 ```
@ -318,14 +318,14 @@ Documents without `_vectors.digger_semantic` are indexed normally and participat
    "response": {
      "data": ["{{embedding}}", "{{..}}"]
    },
-    "dimensions": 768,
+    "dimensions": 1024,
    "documentTemplate": "{{doc.content_text | truncatewords: 200}}"
  }
 }
 ```
 This makes Meilisearch call the embedding service on every document add/update. More convenient but adds coupling: indexing fails if the embedding service is down. For digger's standalone-pipeline requirement, `userProvided` is preferable.

-**Warning:** Changing `source`, `model`, `dimensions`, or `documentTemplate` on an existing embedder triggers **complete re-generation** of all embeddings. Pick dimensions and stick with them. 768 is a safe, widely-supported value.
+**Warning:** Changing `source`, `model`, `dimensions`, or `documentTemplate` on an existing embedder triggers **complete re-generation** of all embeddings. Pick dimensions and stick with them — digger commits to **1024** (the `bge-m3` default). Because v1 stores no vectors, this is free to finalize now and only becomes locked once vectors exist.

 **Hybrid search at query time:**
 ```json
@ -339,7 +339,7 @@ This makes Meilisearch call the embedding service on every document add/update.
 ```
 `semanticRatio` ranges from 0.0 (pure keyword) to 1.0 (pure semantic); default 0.5 balances both. This is a per-query parameter, not a settings parameter — no re-indexing needed to tune it.

-**Key design implication:** Because the `userProvided` embedder is declared at index creation with zero cost, do this in v1 Sprint 0 scaffolding. Changing dimensions later requires re-indexing everything. Pick 768 now.
+**Key design implication:** Because the `userProvided` embedder is declared at index creation with zero cost, do this in v1 Sprint 0 scaffolding. Changing dimensions later requires re-indexing everything. Commit **1024** now (the `bge-m3` default).

 ---

@ -488,7 +488,7 @@ POST /keys

 3. **Re-indexing cost on setting changes**: Adding a new `filterableAttribute` or changing `searchableAttributes` order triggers a full re-index of the entire corpus. With 500K files this can take hours. Plan the full attribute list before first production import.

-4. **Embedder config lock-in**: Changing embedder `dimensions` requires re-indexing all vectors. Commit to 768 dimensions now. If a better model uses different dimensions, you'll need a new embedder name (supported — multiple embedders are allowed on one index) rather than changing the existing one.
+4. **Embedder config lock-in**: Changing embedder `dimensions` requires re-indexing all vectors. Commit to **1024** dimensions now (the `bge-m3` default). If a future model uses different dimensions, you'll need a new embedder name (supported — multiple embedders are allowed on one index) rather than changing the existing one.

 5. **Arabic query quality**: Diacritic normalization is good, but very short Arabic queries (1–2 words) may produce overly broad results due to root-sharing words. Test with representative queries; may need to add domain-specific synonyms.

--- a/docs/research/SYNTHESIS.md
+++ b/docs/research/SYNTHESIS.md
@ -32,7 +32,7 @@
 - **Pin `getmeili/meilisearch:v1.48.3`.** Vector/hybrid is production-stable in v1.x; no experimental flags needed.
 - **Single index `digger_documents`** + `localizedAttributes` (Arabic + English tokenization; object form — see the settings file) for mixed-language content.
 - **Primary key `id` = SHA-256 hex of `canonical_path + "|" + content_hash`** — Meili IDs allow only `[a-zA-Z0-9_-]`, so raw paths can't be used; the hash is 64 hex chars, safe.
- **Reserve vectors now, at zero cost:** declare a dormant `userProvided` embedder (planned `dimensions: 768`) at index creation. Vectors are populated only in v2; because v1 writes none, the dimension is still finalizable with the chosen v2 embedding model before the first vector exists ([ADR 0003](../decisions/0003-meilisearch-index-design.md), [ADR 0005](../decisions/0005-model-backends-and-ollama.md)).
+- **Reserve vectors now, at zero cost:** declare a dormant `userProvided` embedder (`dimensions: 1024`, matching the planned default embedder **`bge-m3`**) at index creation. Vectors are populated only in v2; because v1 writes none, the dimension is still changeable if the embedder choice changes before the first vector exists ([ADR 0003](../decisions/0003-meilisearch-index-design.md), [ADR 0005](../decisions/0005-model-backends-and-ollama.md)).
 - Arabic: native via Charabia (article segmentation, diacritic stripping). Configure Arabic+English stop words manually (no built-ins). Keep typo `oneTypo` minimum at 5 (correct for short Arabic roots).
 - **Two hard limits that shape the IR:** (1) **65,535 word-positions per field** — long OCR'd PDFs / long transcripts silently truncate; (2) **468-byte filterable-value cap** — raw paths can't be filter values; derive a short `source_folder` token instead.