docs: research findings and v1 design (IR contract, index, ADRs)
Design phase output for digger — no implementation yet. Research (docs/research/): six findings docs (Meilisearch, local-model tooling incl. the existing arabic-ocr setup, Office/legacy, audio/video, frontend/UX, Forgejo CI + Windows runner) plus SYNTHESIS.md. Design (docs/decisions/): the Canonical Document IR JSON Schema v1.0 (the contract) with worked examples, the concrete Meilisearch settings, and ADRs 0001–0010 covering architecture/layering, the IR, index design (single index, chunk-granularity collapsed by parent_id), chunking, model backends + Ollama deployment, conversion routing, the read-side SearchProvider + HTMX UI, dedup/StateStore/incremental/reindex, Docker-Compose packaging, and layered CI with a native Windows runner. Confirmed decisions baked in: Arabic+English; one document per path; chunk long docs in v1; vectors designed-for but switched off; Ollama as a host service; Windows CI on a KVM VM. Also adds project README, CLAUDE.md, the brief, and .gitignore. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
57b51329f7
commit
5cc8c99109
25 changed files with 4113 additions and 0 deletions
39
.gitignore
vendored
Normal file
39
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,39 @@
|
|||
# --- Editors / IDEs ---
|
||||
.idea/
|
||||
.vscode/
|
||||
*.iml
|
||||
.obsidian/
|
||||
**/.obsidian/
|
||||
|
||||
# --- Python ---
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*.egg-info/
|
||||
.eggs/
|
||||
build/
|
||||
dist/
|
||||
.venv/
|
||||
venv/
|
||||
*_env/
|
||||
.mypy_cache/
|
||||
.ruff_cache/
|
||||
.pytest_cache/
|
||||
.coverage
|
||||
coverage.xml
|
||||
htmlcov/
|
||||
|
||||
# --- Secrets / local config ---
|
||||
.env
|
||||
.env.*
|
||||
!.env.example
|
||||
|
||||
# --- Runtime / data ---
|
||||
*.log
|
||||
/data/
|
||||
ir-output/
|
||||
*.ms/
|
||||
meili_data/
|
||||
|
||||
# --- OS ---
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
26
CLAUDE.md
Normal file
26
CLAUDE.md
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
# File-Ingestion Search Pipeline
|
||||
|
||||
Authoritative spec — read before planning or any architectural decision:
|
||||
@docs/digger-brief.md
|
||||
|
||||
## Always
|
||||
- Use the superpowers brainstorm → plan → execute workflow and TDD (red → green → refactor).
|
||||
- Do NOT build features until I approve the plan. Create the Forgejo V1 milestone + issues
|
||||
only AFTER the design is approved.
|
||||
- Every branch gets its own git worktree at `../<repo>.worktrees/<issue>-<slug>` (sibling to
|
||||
the repo), branched as `<type>/<issue>-<slug>` (feat|fix|chore|docs|refactor|test) off the
|
||||
latest main. Never work on main. Remove the worktree and delete the branch after merge.
|
||||
- Dev loop: pick a Forgejo issue (via the Forgejo MCP) → worktree + branch → tests-first →
|
||||
PR (Closes #N) → review + green CI → **I approve and merge. Never self-merge.**
|
||||
- Every sprint ships a working end-to-end slice.
|
||||
|
||||
## Invariants (never compromise)
|
||||
- Pipeline runs standalone without Meilisearch; the search backend is swappable behind an interface.
|
||||
- The intermediate representation (IR) is the contract — keep it stable and engine-agnostic.
|
||||
- All model inference (OCR / ASR / embeddings) is local; no file content leaves the machine.
|
||||
- v1 = keyword search only; design for vector/hybrid but keep it switched off.
|
||||
|
||||
## Layout
|
||||
- docs/digger-brief.md — full spec
|
||||
- docs/research/ — subagent findings
|
||||
- docs/decisions/ — ADRs, IR schema, Meilisearch settings
|
||||
47
README.md
47
README.md
|
|
@ -0,0 +1,47 @@
|
|||
# digger
|
||||
|
||||
A modular, **local-first file-ingestion search pipeline**. It walks files on a machine, extracts their content — including scanned documents (OCR), Office files, and audio/video (transcription) — normalizes everything into one well-defined **intermediate representation (IR)**, and feeds that into a **swappable search backend** (Meilisearch first) for full-text search.
|
||||
|
||||
Two hard requirements shape every decision:
|
||||
|
||||
1. **Runs standalone without a search engine** — it can emit the IR to disk on its own; indexing is a separate, swappable stage.
|
||||
2. **The search backend is swappable** behind an interface.
|
||||
|
||||
All model inference (OCR / ASR / embeddings) runs against **local models** — no file content leaves the machine. Target platform is primarily **Windows**, but the code is cross-platform (Windows + Linux + macOS). v1 ships **keyword search**; vector/hybrid is designed-for but switched off.
|
||||
|
||||
> **Status: design phase.** This branch contains the research and the design (no implementation yet). Implementation starts after the plan and Forgejo milestones are approved.
|
||||
|
||||
## Start here
|
||||
|
||||
| Document | What it is |
|
||||
|---|---|
|
||||
| [`docs/digger-brief.md`](docs/digger-brief.md) | The authoritative project brief / spec. |
|
||||
| [`docs/research/SYNTHESIS.md`](docs/research/SYNTHESIS.md) | Per-layer technology decisions reconciled from all research. |
|
||||
| [`docs/decisions/`](docs/decisions/) | The ADRs and the two contract artifacts (IR schema + Meilisearch settings). Read [`docs/decisions/README.md`](docs/decisions/README.md) first. |
|
||||
| [`docs/research/`](docs/research/) | Detailed findings (Meilisearch, model tooling, Office, A/V, frontend, CI). |
|
||||
|
||||
## Architecture at a glance
|
||||
|
||||
```
|
||||
filesystem → [Source] → [Router] → [Extractor] → Canonical Document (IR)
|
||||
→ [Transformer: normalize · language · chunk · (future) embed]
|
||||
→ [Sink] → { FileSink (IR→disk) | MeilisearchSink }
|
||||
read side: [SearchProvider] ← FastAPI + HTMX UI
|
||||
crosscut: ModelBackend (OCR/ASR/embed) · StateStore (SQLite) · Config (TOML) · CLI
|
||||
```
|
||||
|
||||
Each stage talks to the next only through a documented interface; the IR is the contract at the waist. See [ADR 0001](docs/decisions/0001-architecture-and-layering.md) and [ADR 0002](docs/decisions/0002-intermediate-representation.md).
|
||||
|
||||
## Key technology choices (v1)
|
||||
|
||||
- **Search:** Meilisearch `v1.48.3` — single index, chunk-granularity collapsed by `parent_id`, Arabic+English via Charabia, vector embedder declared-but-dormant.
|
||||
- **OCR:** Qwen2.5-VL via local Ollama (best for Arabic handwriting/IDs/forms); Docling for native-digital documents.
|
||||
- **Office/legacy:** docx2python / openpyxl / python-pptx; unoserver (LibreOffice) for legacy binaries; Access Windows-only in v1.
|
||||
- **Audio/video:** ffmpeg + faster-whisper `large-v3` (CPU-first); diarization deferred to V2.
|
||||
- **UI:** FastAPI + Jinja2 + HTMX, engine-agnostic via the `SearchProvider` interface.
|
||||
- **Packaging:** one-command Docker Compose (CPU defaults, zero-config first run); every piece overridable.
|
||||
- **CI:** Forgejo Actions — layered Linux tiers + a native Windows runner.
|
||||
|
||||
## Languages
|
||||
|
||||
Arabic + English, including right-to-left UI and mixed-language documents.
|
||||
73
docs/decisions/0001-architecture-and-layering.md
Normal file
73
docs/decisions/0001-architecture-and-layering.md
Normal file
|
|
@ -0,0 +1,73 @@
|
|||
# ADR 0001 — Architecture and layering
|
||||
|
||||
**Status:** Accepted (v1)
|
||||
|
||||
## Context
|
||||
|
||||
digger walks files, extracts their content (including scanned documents, Office files, and audio/video via local models), normalizes everything into one intermediate representation (IR), and feeds that into a swappable search backend. Two hard requirements from the brief shape every structural choice:
|
||||
|
||||
1. The pipeline must be **usable standalone, without Meilisearch** — it can emit IR to disk on its own.
|
||||
2. The search backend must be **swappable** behind an interface.
|
||||
|
||||
Plus: strict layering (each stage talks to the next only through a documented interface), local-first inference, idempotent/incremental runs, and fail-isolated error handling.
|
||||
|
||||
## Decision
|
||||
|
||||
Adopt a strict, one-directional pipeline of stages connected only through explicit Python `Protocol` interfaces. No stage imports a concrete implementation of another stage.
|
||||
|
||||
```
|
||||
filesystem
|
||||
│
|
||||
▼
|
||||
[ Source / Walker ] discovers files, yields FileRef + filesystem metadata
|
||||
│
|
||||
▼
|
||||
[ Router ] picks an Extractor by type / MIME / content sniffing
|
||||
│
|
||||
▼
|
||||
[ Extractor (per format) ] uses ModelBackend(s) as needed (OCR / ASR / embed)
|
||||
│
|
||||
▼
|
||||
[ Canonical Document (IR) ] THE CONTRACT — serializable to JSONL on disk
|
||||
│
|
||||
▼
|
||||
[ Transformer / Enricher ] normalize, detect language, CHUNK, (future) embed
|
||||
│
|
||||
▼
|
||||
[ Sink / Indexer (interface) ] FileSink (IR→disk) | MeilisearchSink | …
|
||||
```
|
||||
|
||||
Cross-cutting components:
|
||||
|
||||
- **ModelBackend** — abstracts local OCR / ASR / (future) embedding models. Extractors depend on the interface, never a concrete model. Default backends talk to Ollama (OCR) and faster-whisper (ASR). See [ADR 0005](0005-model-backends-and-ollama.md).
|
||||
- **SearchProvider** — the read-side mirror of `Sink`. The UI and any search API talk only to this. See [ADR 0007](0007-search-provider-and-ui.md).
|
||||
- **StateStore** — local SQLite recording content-hash + mtime per path to drive incremental runs and deletion handling. See [ADR 0008](0008-incremental-dedup-statestore.md).
|
||||
- **Config** — a single TOML file with env-var overrides, selecting active sink, model backends, source roots, concurrency, and per-format options.
|
||||
- **CLI** — `scan`, `extract` (files → IR on disk), `index` (IR → sink), `run` (end-to-end), `status`, and `reindex`. The `extract`/`index` split is what makes the pipeline usable without a search engine.
|
||||
|
||||
### The seven interfaces (defined in Sprint 0, before any extractor)
|
||||
|
||||
| Interface | Responsibility | Key method(s) |
|
||||
|---|---|---|
|
||||
| `Source` | Discover files under roots, yield references + fs metadata | `walk() -> Iterable[FileRef]` |
|
||||
| `Extractor` | Turn one file into one or more IR records | `extract(FileRef, ModelBackend) -> Iterable[CanonicalDocument]` |
|
||||
| `ModelBackend` | Local inference (OCR/ASR/embed) | `ocr_image(...)`, `transcribe(...)`, `embed(...)` |
|
||||
| `Transformer` | Post-extraction enrichment incl. chunking | `transform(CanonicalDocument) -> Iterable[CanonicalDocument]` |
|
||||
| `Sink` | Write IR to a destination | `upsert(Iterable[CanonicalDocument])`, `delete(ids)` |
|
||||
| `SearchProvider` | Query-time read side | `search(...)`, `suggest(...)`, `health()` |
|
||||
| `StateStore` | Track processed state | `seen(path, hash)`, `mark(...)`, `deletions(...)` |
|
||||
|
||||
A trivial **FileSink** (writes IR JSONL to disk) and a **fake/mocked ModelBackend** ship in Sprint 0 so the standalone path and the unit-test path both work on day one.
|
||||
|
||||
## Consequences
|
||||
|
||||
- **Adding a new format = adding one `Extractor`** (and possibly a `ModelBackend`), with no change to other layers. This is the core extensibility property the brief demands.
|
||||
- **Swapping the search engine = writing one `Sink` + one `SearchProvider`.** Nothing upstream or in the UI changes.
|
||||
- **Testability is structural:** interfaces let unit tests inject fakes for models and sinks, keeping the default CI tier fast and offline (see [ADR 0010](0010-ci-and-windows-runner.md)).
|
||||
- The IR sits exactly at the architectural waist; its stability is paramount (see [ADR 0002](0002-intermediate-representation.md)).
|
||||
- Chunking lives in the Transformer, **not** in extractors, so it can be turned off for the standalone case ([ADR 0004](0004-chunking-strategy.md)).
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **An all-in-one toolkit (e.g. drive everything through Unstructured/Tika).** Rejected as the *architecture*: it couples extraction, OCR, and output shape together and fights the local-model and swappable-sink requirements. We still *use* such toolkits inside specific extractors ([ADR 0006](0006-document-conversion-routing.md)), but they sit behind the `Extractor` interface, not in place of it.
|
||||
- **Letting the sink own normalization/chunking.** Rejected: it would make the standalone (FileSink) path lossy and engine-specific, violating requirement 1.
|
||||
59
docs/decisions/0002-intermediate-representation.md
Normal file
59
docs/decisions/0002-intermediate-representation.md
Normal file
|
|
@ -0,0 +1,59 @@
|
|||
# ADR 0002 — The Canonical Document (IR) is the contract
|
||||
|
||||
**Status:** Accepted (v1)
|
||||
**Schema:** [`ir-schema.json`](ir-schema.json) · **Examples:** [`ir-examples.jsonl`](ir-examples.jsonl)
|
||||
|
||||
## Context
|
||||
|
||||
The IR is the most important artifact in the repo: every extractor produces it, every sink consumes it, and the standalone path serializes it to disk. It must be rich enough to drive keyword search now and vector/hybrid search later, while staying independent of any search engine. The brief enumerates required content: identity, source metadata, provenance, content (flat text + structured segments), format-specific metadata, derived fields, and processing status.
|
||||
|
||||
Two engine-imposed limits and three confirmed decisions constrain the design:
|
||||
|
||||
- Meilisearch primary keys allow only `[a-zA-Z0-9_-]`, ≤511 bytes → IDs must be hashes, not paths.
|
||||
- Meilisearch filterable values cap at 468 bytes → raw paths cannot be filter values.
|
||||
- Confirmed: **one document per path** (dedup), **chunk long documents in v1**, **vectors designed-for but off**.
|
||||
|
||||
## Decision
|
||||
|
||||
Define a single, versioned, JSON-serializable **Canonical Document** schema (`schema_version: "1.0"`). One IR record is one indexable unit:
|
||||
|
||||
- A **whole document** = one record with `chunk_index=0`, `chunk_count=1`, `parent_id == id`.
|
||||
- A **chunked document** = N records sharing `parent_id` and `content_hash`, each with its own `content.content_text` span and `locator`.
|
||||
|
||||
### Field groups (full detail in the schema)
|
||||
|
||||
- **Identity & chunking:** `id`, `parent_id`, `chunk_index`, `chunk_count`, `content_hash`, `schema_version`.
|
||||
- **`source`:** `path` (display/open only), `filename`, `extension`, `source_folder` (derived ≤468-byte facet token), `mime_type`, `file_type` (facet enum), `size_bytes`, `created_at`/`modified_at` (+ `_epoch` ints for filter/sort), `host`, `drive`.
|
||||
- **`provenance`:** `extractor_name`/`extractor_version`, `processed_at`, optional `model{ocr_*, asr_*}`.
|
||||
- **`content`:** `content_text` (searchable), `content_truncated`, `title`, `language` (facet), `language_probability`, `tags`.
|
||||
- **`locator`** (per chunk, drives deep-links): `page_number`/`page_end`, `slide_number`, `sheet_name`, `timestamp_start`/`timestamp_end`, `speaker` (null in v1).
|
||||
- **`segments`:** structured native units (`page|slide|sheet|row|table|paragraph|transcript|note`) preserved for display and deep-linking; A/V segments carry `start`/`end` and optional word timestamps.
|
||||
- **`metadata`:** open, format-specific (author, keywords, sheet_names, duration_seconds, codec, exif, …); a few sub-keys are searchable.
|
||||
- **`embedding`:** reserved, **null in v1**; when populated in V2, the Meilisearch sink maps `vector` → `_vectors.digger_semantic`.
|
||||
- **`status`/`warnings`/`errors`:** `success|partial|failed|skipped` plus structured diagnostics.
|
||||
|
||||
### Identity formulas
|
||||
|
||||
```
|
||||
content_hash = sha256_hex(file_bytes)
|
||||
whole.id = sha256_hex(canonical_path + "|" + content_hash)
|
||||
chunk.id = sha256_hex(canonical_path + "|" + content_hash + "|" + chunk_index)
|
||||
parent_id = sha256_hex(canonical_path + "|" + content_hash) # for both
|
||||
```
|
||||
|
||||
`canonical_path` is the absolute, normalized path (case/separator-normalized per OS). This makes IDs stable and idempotent: re-extracting an unchanged file yields identical IDs, so sink upserts are no-ops.
|
||||
|
||||
## Consequences
|
||||
|
||||
- **Engine-agnostic:** no Meilisearch concept appears in the IR. A different sink maps the same records to its own model.
|
||||
- **Standalone-ready:** the FileSink writes these records to JSONL unchanged; the IR *is* the on-disk format.
|
||||
- **Forward-compatible:** chunk fields, `locator`, `embedding`, and `tags` are present from v1 (mostly null/default in the whole-doc, keyword-only case), so adding chunking and vectors later touches the Transformer and the sink mapping — never the schema shape.
|
||||
- **Deep-linkable:** because page/slide/sheet/timestamp live on each record, the UI can jump to the exact location of a hit without re-parsing the source.
|
||||
- **Versioned:** `schema_version` plus `provenance.extractor_version`/`model` let the `reindex` command detect records produced by an older schema or model and reprocess them ([ADR 0008](0008-incremental-dedup-statestore.md)).
|
||||
- **Validation:** `ir-schema.json` is enforced in unit tests (round-trip serialize + schema-validate) so drift is caught in CI.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Path-derived IDs.** Rejected — illegal characters and the 511-byte limit; also brittle across OSes. Hash-of-(path|content_hash) is stable and valid.
|
||||
- **Separate schemas per file type.** Rejected — multiplies the contract and complicates the sink/UI. One schema with optional, type-appropriate fields (`locator`, `segments`, `metadata`) is simpler and keeps a single index viable.
|
||||
- **Embeddings inline as a top-level array.** Deferred — kept under `embedding{}` with model id/version so vectors are self-describing and the sink controls the `_vectors` mapping; avoids committing the top-level shape to one engine's convention.
|
||||
73
docs/decisions/0003-meilisearch-index-design.md
Normal file
73
docs/decisions/0003-meilisearch-index-design.md
Normal file
|
|
@ -0,0 +1,73 @@
|
|||
# ADR 0003 — Meilisearch index design
|
||||
|
||||
**Status:** Accepted (v1)
|
||||
**Settings:** [`meilisearch-settings.json`](meilisearch-settings.json) · **Engine pin:** `getmeili/meilisearch:v1.48.3`
|
||||
|
||||
## Context
|
||||
|
||||
Meilisearch is the first sink. The index design must serve keyword search well for Arabic + English now, reserve cleanly for vectors later, and respect Meilisearch's hard limits — all while honoring the confirmed decisions (one document per path; chunk long docs in v1).
|
||||
|
||||
This ADR records the reconciliation of two perspectives the brief asked for:
|
||||
|
||||
- **Relevance/UX:** chunk-level granularity for long-document recall; rich facets (type, language, folder, date); Arabic-aware tokenization; highlighted, cropped snippets; typo tolerance tuned for Arabic.
|
||||
- **Data-modeling/pipeline:** stable IDs; one logical document per file; clean deletes; no lossy transforms; stay under the 468-byte filterable and 65,535-word-position limits; declare the full attribute set up front (changing it triggers a full reindex).
|
||||
|
||||
## Decision
|
||||
|
||||
**Single index `digger_documents`** with `localizedAttributes: ["ara","eng"]` for mixed-language content.
|
||||
|
||||
### The chunk/parent reconciliation (the crux)
|
||||
|
||||
Index at **chunk granularity** (each IR record is one Meilisearch document) but set:
|
||||
|
||||
```
|
||||
distinctAttribute = "parent_id"
|
||||
```
|
||||
|
||||
So a query collapses to **one hit per logical file-at-path** while the **best-matching chunk** surfaces, carrying its own `locator` for deep-linking. Every record has a `parent_id` (equal to its own `id` for whole documents), so `distinct` is a no-op for unchunked files and a collapse for chunked ones. This satisfies the relevance view (chunk recall) and the data-modeling view (one result per file) at once, and is fully compatible with "one document per path" because each path has a unique `parent_id`.
|
||||
|
||||
### Attributes (see the settings file for the exact list)
|
||||
|
||||
- **Primary key:** `id`.
|
||||
- **Searchable (ordered for the `attribute` ranking rule):** `content.content_text`, `content.title`, `source.filename`, `content.tags`, `metadata.author/keywords/subject`. `source.path` is **not** searchable (it is display/open only).
|
||||
- **Filterable:** `source.file_type`, `source.mime_type`, `content.language`, `status`, `provenance.extractor_name`, `source.source_folder`, `source.host`, `source.drive`, `source.modified_at_epoch`, `source.created_at_epoch`, `source.size_bytes`, `content.content_truncated`. **Not `source.path`** (468-byte cap).
|
||||
- **Sortable:** `source.modified_at_epoch`, `source.created_at_epoch`, `source.size_bytes`, `source.filename`.
|
||||
- **Ranking rules:** Meilisearch defaults (`words, typo, proximity, attribute, sort, exactness`).
|
||||
- **Faceting:** `maxValuesPerFacet: 200`, sort by count.
|
||||
- **Pagination:** `maxTotalHits: 10000` (page-based pagination for exact counts in the UI).
|
||||
|
||||
### Arabic + English
|
||||
|
||||
- Charabia handles Arabic natively (definite-article segmentation, diacritic/tashkeel stripping) — no flag needed.
|
||||
- Typo tolerance left at `oneTypo ≥ 5` (correct for short Arabic roots), disabled on `id`, `parent_id`, `content_hash`, `source.path`.
|
||||
- Stop words start empty; an Arabic + English list is added once a representative corpus exists (a maintained list ships as config, not hard-coded).
|
||||
|
||||
### Vectors reserved, disabled
|
||||
|
||||
Declare one embedder at index creation, in Sprint 0, at **zero cost**:
|
||||
|
||||
```json
|
||||
"embedders": { "digger_semantic": { "source": "userProvided", "dimensions": 768 } }
|
||||
```
|
||||
|
||||
`userProvided` means Meilisearch never calls out — the pipeline supplies vectors when ready (V2). Dimensions are **committed at 768** now; changing them later forces a full reindex. v1 ships no `_vectors`; documents are keyword-only until then. Hybrid is a per-query parameter (`semanticRatio`) requiring no settings change to enable later.
|
||||
|
||||
### Operations
|
||||
|
||||
- Index created with explicit `primaryKey: "id"`; settings applied via `PATCH`; all changes are async tasks tracked before the pipeline marks files done.
|
||||
- Documents written with **`PUT` (add-or-replace)** in the main loop so reprocessing yields clean records; **`POST` (add-or-update)** reserved for targeted enrichment (e.g. adding vectors later).
|
||||
- Deletes: by id, by batch, or by filter (e.g. purge a removed folder).
|
||||
- Three keys: master (`.env`, setup only), search-only (read API), indexer (pipeline). See [ADR 0007](0007-search-provider-and-ui.md) and [ADR 0009](0009-packaging-and-deployment.md).
|
||||
|
||||
## Consequences
|
||||
|
||||
- Long documents are fully searchable (no silent truncation) because each chunk is its own document under the 65,535-word ceiling; `content_truncated` remains a safety flag.
|
||||
- Folder faceting works via the bounded `source_folder` token; full paths remain available for display/open via `displayedAttributes: ["*"]`.
|
||||
- The full filterable/sortable set is declared up front, avoiding an expensive reindex of a 500k-document corpus when a facet is added later.
|
||||
- The `SearchProvider` interface hides all of this; switching to a per-language two-index strategy in V2 (if Arabic relevance demands it) does not touch the UI.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Whole-document indexing + truncate-and-flag.** Rejected for v1 per the confirmed "chunk long docs in v1" decision; it loses content past 65,535 words. Retained only as the `content_truncated` safety net.
|
||||
- **Dedup by content hash (`distinctAttribute = content_hash`, paths as array).** Rejected per the confirmed "one document per path" decision; `distinctAttribute` is instead spent on `parent_id` for chunk collapsing.
|
||||
- **Separate index per file type or per language now.** Deferred to V2; single index + `localizedAttributes` is simpler and adequate, and the read interface makes the swap transparent.
|
||||
47
docs/decisions/0004-chunking-strategy.md
Normal file
47
docs/decisions/0004-chunking-strategy.md
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
# ADR 0004 — Chunking strategy
|
||||
|
||||
**Status:** Accepted (v1)
|
||||
|
||||
## Context
|
||||
|
||||
Confirmed decision: **chunk long documents in v1.** The driver is Meilisearch's 65,535 word-position-per-field limit — a long scanned PDF or a multi-hour transcript would otherwise truncate silently. Chunking also improves long-document keyword relevance and is exactly the granularity future vector search wants, so it doubles as the vector seam.
|
||||
|
||||
The brief requires chunking to be a **Transformer concern, not baked into extractors**, so it can be switched off for the standalone case.
|
||||
|
||||
## Decision
|
||||
|
||||
Chunking is performed by a `Transformer` that runs **after** extraction and **before** the sink. Extractors always emit whole-document IR records (with `segments`); the chunking transformer splits them into chunk records when needed.
|
||||
|
||||
### Algorithm (v1)
|
||||
|
||||
1. **Pass-through when small.** If `content_text` fits comfortably under the limit (configurable `max_words_per_chunk`, default well below 65,535, e.g. 1,500–3,000 words to also suit future embedding context windows), emit the record unchanged: `chunk_count=1`, `chunk_index=0`, `parent_id=id`.
|
||||
2. **Structure-aware splitting when large.** Split along `segments` boundaries first (page, slide, sheet, transcript segment), packing consecutive segments into a chunk until the word budget is reached. Never split inside a word; avoid splitting mid-segment unless a single segment alone exceeds the budget (then hard-wrap with overlap).
|
||||
3. **Carry the locator.** Each chunk's `locator` is set from the first segment it contains (e.g. starting `page_number`, or `timestamp_start`), with `page_end`/`timestamp_end` from the last. This is what makes a search hit deep-link to the right page/slide/moment.
|
||||
4. **Assign identity.** `parent_id = whole-document id`; `chunk.id = sha256(path|content_hash|chunk_index)`; set `chunk_count` on every sibling.
|
||||
5. **Optional overlap.** A small configurable token/word overlap between adjacent chunks improves recall at boundaries (default modest; off for sheet/slide-aligned splits).
|
||||
|
||||
### Configuration
|
||||
|
||||
```toml
|
||||
[chunking]
|
||||
enabled = true # false => whole-document records only (standalone-friendly)
|
||||
max_words_per_chunk = 2000
|
||||
overlap_words = 100
|
||||
split_on = ["page", "slide", "sheet", "transcript", "paragraph"]
|
||||
```
|
||||
|
||||
When `enabled = false`, the transformer is a pass-through and the pipeline behaves in whole-document mode (with the `content_truncated` flag as the only guard). This keeps the standalone/simple case available.
|
||||
|
||||
## Consequences
|
||||
|
||||
- No silent content loss: every chunk stays under the field limit.
|
||||
- Deep-linking is precise because chunk boundaries follow native structure and carry locators.
|
||||
- The same chunk granularity is reused for V2 embeddings — when vectors arrive, each chunk record gets an `embedding`, no re-chunking or schema change.
|
||||
- Results stay clean because the Meilisearch sink sets `distinctAttribute = parent_id` ([ADR 0003](0003-meilisearch-index-design.md)), collapsing a document's chunks to one hit while surfacing the best-matching chunk.
|
||||
- Deletes/updates operate per parent: when a file changes or is removed, all chunk records sharing its `parent_id` are replaced or deleted together ([ADR 0008](0008-incremental-dedup-statestore.md)).
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Fixed-size character/token windows ignoring structure.** Rejected as the default — it produces locators that don't map to a page/slide/timestamp, hurting the deep-link UX. Hard-wrap with overlap is used only as the fallback when a single segment exceeds the budget.
|
||||
- **Chunk inside extractors.** Rejected — violates the brief's separation and would prevent the standalone whole-document mode.
|
||||
- **Defer chunking to V2.** Rejected per the confirmed decision; long Arabic scans and long media are exactly the corpus that needs it.
|
||||
77
docs/decisions/0005-model-backends-and-ollama.md
Normal file
77
docs/decisions/0005-model-backends-and-ollama.md
Normal file
|
|
@ -0,0 +1,77 @@
|
|||
# ADR 0005 — Model backends and Ollama deployment
|
||||
|
||||
**Status:** Accepted (v1)
|
||||
**Research:** [`../research/B-local-model-tooling.md`](../research/B-local-model-tooling.md), [`../research/D-audio-video.md`](../research/D-audio-video.md)
|
||||
|
||||
## Context
|
||||
|
||||
All model inference (OCR, ASR, future embeddings) must run against **local** models — no file content leaves the machine. Extractors must depend on an interface, never a concrete model. Hardware is CPU-capable with optional GPU and auto-fallback (the dev box has no GPU but 128 GB RAM). Languages are Arabic + English. The user explicitly asked how to deploy Ollama (host service vs Dockerized) optimizing for ease of use, given a VM→host setup at `192.168.122.1`.
|
||||
|
||||
## Decision
|
||||
|
||||
### One `ModelBackend` interface, three capabilities
|
||||
|
||||
```python
|
||||
class OCRBackend(Protocol):
|
||||
def ocr_image(self, image: Image, *, lang_hint: str | None = None) -> OCRResult: ...
|
||||
|
||||
class ASRBackend(Protocol):
|
||||
def transcribe(self, audio_path: str, *, language: str | None = None,
|
||||
word_timestamps: bool = False) -> ASRResult: ...
|
||||
|
||||
class EmbedBackend(Protocol): # declared in v1, used in V2
|
||||
def embed(self, texts: list[str]) -> list[list[float]]: ...
|
||||
|
||||
class ModelBackend(Protocol):
|
||||
ocr: OCRBackend
|
||||
asr: ASRBackend
|
||||
embed: EmbedBackend | None
|
||||
```
|
||||
|
||||
Extractors receive a `ModelBackend` and call capabilities; they never import a model client. A fake backend (deterministic stub) is injected in unit tests so the default CI tier needs no models.
|
||||
|
||||
### Default concrete backends (all overridable via config)
|
||||
|
||||
| Capability | Default backend | Notes |
|
||||
|---|---|---|
|
||||
| OCR | **Qwen2.5-VL via Ollama** (`qwen2.5vl:7b`) | Wraps the existing `arabic-ocr` flow: PDF→PIL happens in the extractor; the backend speaks single page images. Best evaluated option for Arabic handwriting + IDs/certificates/tables/forms. 3B variant selectable for speed. See [ADR 0006](0006-document-conversion-routing.md). |
|
||||
| ASR | **faster-whisper `large-v3`, `int8`, CPU-first** | ~1.5 GB RAM, ~10× real-time on CPU; GPU = `device=cuda, compute_type=float16`. Avoid `large-v3-turbo` for Arabic; allow as opt-in. `Byne/whisper-large-v3-arabic` documented as a configurable override. |
|
||||
| Embed | **bge-m3 via Ollama** (768-dim target) — **V2 only** | Multilingual (Arabic+English). Interface present in v1; not invoked. |
|
||||
|
||||
GPU/CPU selection is automatic with CPU fallback; the device is configurable. Per-file timeouts and max-size skips are enforced at the extractor level (killable `ProcessPoolExecutor` for ASR), so a hung or huge file never crashes a run.
|
||||
|
||||
### Ollama deployment: **host service** (default), endpoint configurable
|
||||
|
||||
Ollama runs as a **native host service** (`ollama serve` with `OLLAMA_HOST=0.0.0.0:11434`); the pipeline reaches it via the `OLLAMA_HOST` env var. The Docker Compose pipeline service sets:
|
||||
|
||||
```yaml
|
||||
environment:
|
||||
- OLLAMA_HOST=${OLLAMA_HOST:-http://host.docker.internal:11434}
|
||||
extra_hosts:
|
||||
- "host.docker.internal:host-gateway" # makes the same default resolve on Linux too
|
||||
```
|
||||
|
||||
So Windows, macOS, and Linux converge on one default. The dev VM keeps `http://192.168.122.1:11434` (the libvirt gateway) as an override. Documented in `.env.example`:
|
||||
|
||||
```ini
|
||||
OLLAMA_HOST=http://host.docker.internal:11434 # Windows/macOS Docker default
|
||||
# OLLAMA_HOST=http://192.168.122.1:11434 # Linux KVM VM (current dev setup)
|
||||
# OLLAMA_HOST=http://localhost:11434 # native, no Docker
|
||||
```
|
||||
|
||||
**Rationale (ease of use, the user's priority):** Windows is the primary target and native Ollama for Windows is a one-click install; Docker-Desktop GPU passthrough on Windows needs WSL2 + NVIDIA CUDA — high friction. The dev box is CPU-only, so containerizing Ollama buys nothing there. Models live in `~/.ollama` and are shared across projects. The endpoint stays fully overridable.
|
||||
|
||||
**Documented alternative (not default):** a Dockerized Ollama service (named volume + `--gpus=all`) for fully self-contained, GPU-reproducible deployments; the pipeline then targets `http://ollama:11434`. This is a Compose override, not a code change.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Local-first is enforced structurally: the only network egress of content is to the configured local Ollama/ASR endpoint, never a cloud API. Cloud SDKs (e.g. Unstructured's hosted client) are forbidden in code ([ADR 0006](0006-document-conversion-routing.md)).
|
||||
- The OCR prompt and behavior already validated in `arabic-ocr` are preserved; the wrapper adds config injection (host, model, ctx, timeout), nothing more.
|
||||
- Switching models (3B vs 7B OCR; a fine-tuned Arabic Whisper) is a config change; `provenance.model` records what produced each record so `reindex` can re-run when a model improves ([ADR 0008](0008-incremental-dedup-statestore.md)).
|
||||
- Embeddings are a drop-in: the same Ollama instance serves `bge-m3` in V2 with no new infrastructure.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Dockerized Ollama as the default.** Rejected for v1 ease-of-use on Windows/CPU; kept as a documented override.
|
||||
- **In-process OCR (Tesseract/docTR/PaddleOCR/Surya).** Rejected as primary — inadequate Arabic handwriting (Tesseract/docTR/Paddle) or non-commercial weights + printed-only (Surya). See [ADR 0006](0006-document-conversion-routing.md).
|
||||
- **whisper.cpp / WhisperX as the ASR default.** whisper.cpp offers no advantage in a Python pipeline; WhisperX is the natural V2 upgrade when word-level alignment/diarization is wanted.
|
||||
57
docs/decisions/0006-document-conversion-routing.md
Normal file
57
docs/decisions/0006-document-conversion-routing.md
Normal file
|
|
@ -0,0 +1,57 @@
|
|||
# ADR 0006 — Document conversion and routing
|
||||
|
||||
**Status:** Accepted (v1)
|
||||
**Research:** [`../research/B-local-model-tooling.md`](../research/B-local-model-tooling.md), [`../research/C-office-legacy.md`](../research/C-office-legacy.md)
|
||||
|
||||
## Context
|
||||
|
||||
"Adding a format = adding an Extractor." But within extractors we should prefer proven libraries over reinventing — provided they honor the local-model constraint. The research evaluated the major everything-to-structured toolkits (Docling, MarkItDown, Unstructured, Tika) and per-format Office libraries, and assessed OCR/VLM stacks for Arabic.
|
||||
|
||||
A key efficiency insight: most cost is OCR inference. Documents that already contain a text layer should never be sent to the VLM.
|
||||
|
||||
## Decision
|
||||
|
||||
### Route by content, into a tiered set of extractors
|
||||
|
||||
The `Router` selects an extractor by file type / MIME / content sniff. The PDF path additionally probes for an embedded text layer:
|
||||
|
||||
| Input | Route | Tool |
|
||||
|---|---|---|
|
||||
| PDF **with** text layer; DOCX, PPTX, XLSX, HTML, EPUB (native-digital) | structural extraction (no inference) | **Docling** (adopt wholesale) |
|
||||
| PDF **without** text layer; JPEG/PNG; scanned/handwritten/IDs/forms | OCR via ModelBackend | **Qwen2.5-VL via Ollama** ([ADR 0005](0005-model-backends-and-ollama.md)) |
|
||||
| `.docx` | structured text/tables/props | **docx2python** (+ **python-docx** for DOM) |
|
||||
| `.xlsx` | sheets/rows/props | **openpyxl** read-only; **python-calamine** fast-path for large files |
|
||||
| `.pptx` | slides/notes/tables/props | **python-pptx** |
|
||||
| legacy `.doc`, `.ppt` (and `.xls` fallback) | convert → OOXML → parse | **unoserver** (persistent LibreOffice), isolated container |
|
||||
| legacy `.xls` | direct | **xlrd** |
|
||||
| `.mdb`/`.accdb` | Windows-only | **pyodbc + ACE** (capability-gated); cross-platform deferred to V2 |
|
||||
| audio/video | extract + transcribe | **ffmpeg + faster-whisper** ([ADR 0005](0005-model-backends-and-ollama.md)) |
|
||||
| email/edge (EML, MSG, CSV, ZIP) | per-format | **Unstructured** (local mode) — later sprint |
|
||||
| Office file Docling skips/fails | fallback | **MarkItDown** |
|
||||
|
||||
Apache **Tika is not used** (JVM weight, Tesseract-only OCR, no local-model hooks).
|
||||
|
||||
### Legacy Office conversion (unoserver)
|
||||
|
||||
LibreOffice runs as a **persistent listener** (`unoserver`) in its **own container** (separate from the pipeline) with metric-compatible fonts (`fonts-crosextra-carlito/caladea`, `fonts-liberation`). Each conversion uses a unique `UserInstallation` profile (`Path(...).as_uri()`), verifies the output file exists (exit codes are unreliable), and runs under a subprocess timeout. Converted OOXML is then parsed by the libraries above.
|
||||
|
||||
### Access (capability-gated, Windows-only in v1)
|
||||
|
||||
`.mdb`/`.accdb` are read via `pyodbc` + the free Microsoft ACE ODBC redistributable (no Office install). A startup capability check probes for the driver and, if absent or on a non-Windows platform, records the file as `status: skipped` with a clear message — never a crash. Cross-platform Access (`mdbtools`) is deferred to V2 because `.accdb` support is unreliable.
|
||||
|
||||
### Local-only enforcement
|
||||
|
||||
Unstructured runs in **local mode only** (`partition(strategy="fast"|"hi_res")`); the hosted `UnstructuredClient` is forbidden. MarkItDown's OCR plugin, if ever enabled, points only at the local Ollama endpoint. No extractor may call a cloud API with file content.
|
||||
|
||||
## Consequences
|
||||
|
||||
- OCR inference is spent only where needed; native-digital PDFs and Office files are parsed structurally and fast.
|
||||
- Each format's extractor produces the same IR (`content_text` + `segments` + `metadata` + `locator`), so the rest of the pipeline is uniform.
|
||||
- LibreOffice's ~800 MB footprint stays out of the pipeline image (separate service), and is gated by a capability check so its absence degrades gracefully.
|
||||
- Tables, slides, sheets, and pages are preserved as `segments`, feeding both deep-linking and structure-aware chunking ([ADR 0004](0004-chunking-strategy.md)).
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Docling's own VLM pipeline for scanned content.** Equivalent inference cost to calling Qwen directly but more indirection; we route scanned content to the OCR backend directly and use Docling for native-digital only (cleaner boundary).
|
||||
- **office_oxide (Rust, all six formats).** Promising but v0.1.x; revisit as a performance/fallback option behind the `Extractor` interface.
|
||||
- **antiword/catdoc for legacy binaries.** Rejected — abandoned projects; unoserver is the reliable path.
|
||||
58
docs/decisions/0007-search-provider-and-ui.md
Normal file
58
docs/decisions/0007-search-provider-and-ui.md
Normal file
|
|
@ -0,0 +1,58 @@
|
|||
# ADR 0007 — Read-side SearchProvider and UI
|
||||
|
||||
**Status:** Accepted (v1)
|
||||
**Research:** [`../research/E-frontend-ux.md`](../research/E-frontend-ux.md)
|
||||
|
||||
## Context
|
||||
|
||||
The UI is "just another client": it must not reach into the pipeline and must not hardcode Meilisearch. Confirmed decisions: **Option B** (UI → thin Python search API, engine-agnostic) and **FastAPI + Jinja2 + HTMX**. Single-user, no access control, but never expose the engine master key. Arabic + English (RTL). v1 is keyword-only; the interface must already express semantic/hybrid so adding them later is zero-churn for the UI.
|
||||
|
||||
## Decision
|
||||
|
||||
### The `SearchProvider` interface is the read-side firewall
|
||||
|
||||
```python
|
||||
class SearchProvider(Protocol):
|
||||
async def search(self, query: str, *, page: int = 1, hits_per_page: int = 20,
|
||||
filters: dict[str, list[str]] | None = None,
|
||||
sort: str | None = None,
|
||||
mode: Literal["keyword","semantic","hybrid"] = "keyword",
|
||||
facet_attributes: list[str] | None = None) -> SearchResult: ...
|
||||
async def suggest(self, prefix: str, *, limit: int = 5) -> list[str]: ...
|
||||
async def health(self) -> bool: ...
|
||||
```
|
||||
|
||||
Engine-agnostic return types: `SearchResult{query, hits, total_hits, page, hits_per_page, total_pages, processing_time_ms, facets}`, `Hit{...}`, `FacetBucket{value, count}`.
|
||||
|
||||
- `Hit.snippet` is **pre-rendered, sanitized HTML containing only `<mark>` tags**. No Meilisearch field name (`_formatted`, `facetDistribution`, `estimatedTotalHits`, …) ever reaches a template — the adapter translates everything first.
|
||||
- `Hit` carries provenance/deep-link fields: `path`, `filename`, `mime_type`, `source_folder`, `language`, `modified_at`, plus `page_number`, `timestamp_seconds`, `slide_number`, `sheet_name`.
|
||||
- `filters` is a plain dict (`{"file_type": ["pdf"], "language": ["ar"]}`); the adapter builds the engine filter string.
|
||||
- `mode` is present now (always `"keyword"` in v1); semantic/hybrid is added by the adapter later with no UI change.
|
||||
|
||||
### The Meilisearch adapter
|
||||
|
||||
Uses `meilisearch-python-sdk` (async, matches FastAPI). Maps `SearchResult`/`Hit` from raw responses; requests highlighting/cropping (`attributesToHighlight=["content.content_text"]`, `cropLength≈30`, `<mark>` tags) and translates `facetDistribution` into `FacetBucket` lists. Holds only the **search-only key**. A `sanitize_snippet()` (allowlist: `mark`) guards XSS before snippets reach templates.
|
||||
|
||||
### The UI (FastAPI + Jinja2 + HTMX)
|
||||
|
||||
- One `GET /search` route returns a **full page** or just the `#results` partial based on the `HX-Request` header → bookmarkable, back/forward-safe, and degrades without JS.
|
||||
- Search-as-you-type: `hx-trigger="input changed delay:300ms"`, `hx-sync="this:replace"` (abort stale), `hx-push-url="true"`.
|
||||
- Faceted filters (file type, language, source folder, date), page-based pagination with exact counts, empty/error states. No infinite scroll in v1.
|
||||
- **Open/locate the file:** a server-side `POST /api/open` (os-native open; PDF `#page=N`) instead of browser-blocked `file://`; audio/video via an embedded HTML5 player pre-seeked to `#t=N`. Provenance (path, page/slide/sheet, timestamp) shown on every hit.
|
||||
- **Admin `GET /status`:** last run, counts (indexed/processed/failed/skipped), recent failures, backend health — sourced from the StateStore and the engine stats.
|
||||
|
||||
### Arabic RTL
|
||||
|
||||
Root `dir="rtl"`; CSS **logical properties only** (no `margin-left`/`float`); `dir="auto"` on the search input and snippets; `<bdi>` around paths/numbers/timestamps embedded in RTL; self-hosted Cairo / Noto Sans Arabic; no `letter-spacing` on Arabic; pagination bar `dir="ltr"`.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Swapping the search engine = one new `SearchProvider` adapter; templates and routes are untouched.
|
||||
- The browser never sees the master key; with Option B it sees no engine key at all.
|
||||
- Adding semantic/hybrid in V2 is an adapter change behind the existing `mode` parameter.
|
||||
- `suggest()` returns an empty list in v1 (Meilisearch has no native suggestions API); a prefix-search implementation can fill it later without an interface change.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Option A — InstantSearch / instant-meilisearch direct to engine.** Best turnkey UX but couples the UI to Meilisearch, needs a Node build step, exposes a search key to the browser, and its widgets aren't RTL-aware. Rejected for v1; could sit in front of the same API in V2.
|
||||
- **Streamlit / NiceGUI.** Streamlit re-renders the whole page per keystroke and has no URL state; NiceGUI is WebSocket/Vue-bound and harder to keep engine-agnostic. HTMX+API is lighter, bookmarkable, RTL-native, and build-step-free.
|
||||
65
docs/decisions/0008-incremental-dedup-statestore.md
Normal file
65
docs/decisions/0008-incremental-dedup-statestore.md
Normal file
|
|
@ -0,0 +1,65 @@
|
|||
# ADR 0008 — Deduplication, StateStore, incremental & reindex
|
||||
|
||||
**Status:** Accepted (v1)
|
||||
|
||||
## Context
|
||||
|
||||
Re-running over a directory must not reprocess unchanged files and must handle additions, changes, and deletions cleanly. The brief requires content-hash + mtime tracking, stable IDs for idempotent upserts, deletion propagation, an explicit reindex/backfill command, and a deduplication policy. Confirmed: **one document per path.**
|
||||
|
||||
## Decision
|
||||
|
||||
### Deduplication: one document per path
|
||||
|
||||
The same content at multiple paths produces **one logical document per path** (each with its own `parent_id`). This keeps deletion/move tracking simple and predictable. `distinctAttribute` is therefore free to be used for chunk collapsing (`parent_id`, [ADR 0003](0003-meilisearch-index-design.md)) rather than content-dedup. (Content-level dedup — one record, many paths — is recorded as a rejected alternative; revisit only if duplicate hits become a real problem.)
|
||||
|
||||
### StateStore (local SQLite)
|
||||
|
||||
A `StateStore` keyed by **canonical path** records, per file:
|
||||
|
||||
| Column | Use |
|
||||
|---|---|
|
||||
| `path` (PK) | canonical absolute path |
|
||||
| `content_hash` | SHA-256 of bytes — detects content change |
|
||||
| `mtime`, `size` | cheap pre-check before hashing |
|
||||
| `parent_id` | links to the indexed logical document (and its chunk family) |
|
||||
| `chunk_count` | how many sink records exist for this path |
|
||||
| `status` | last outcome (success/partial/failed/skipped) |
|
||||
| `extractor_name`, `extractor_version` | provenance for targeted reindex |
|
||||
| `model_fingerprint` | OCR/ASR model id+version used (nullable) |
|
||||
| `schema_version` | IR schema that produced the records |
|
||||
| `last_seen_run` | run id/timestamp — drives deletion detection |
|
||||
| `processed_at` | last processing time |
|
||||
|
||||
### Incremental run logic
|
||||
|
||||
1. **Skip unchanged:** if `mtime` and `size` match and re-hash (when needed) equals stored `content_hash`, skip.
|
||||
2. **Process new/changed:** extract → transform/chunk → upsert via the sink (`PUT` replace). Because IDs are deterministic ([ADR 0002](0002-intermediate-representation.md)), upserts are idempotent.
|
||||
3. **Replace cleanly on change:** when a file's content changes, its `parent_id` changes (hash component changes), so the old chunk family is deleted (by old `parent_id`) and the new one written — no orphans.
|
||||
4. **Detect deletions:** any path in the StateStore not seen in the current run is treated as deleted → delete its records from the sink (by `parent_id`) and remove the row.
|
||||
5. **Quarantine failures:** `failed`/`skipped` files are recorded with structured reasons and surfaced in the `status` summary and a dead-letter list; one bad file never aborts the run.
|
||||
|
||||
### Concurrency
|
||||
|
||||
A configurable worker pool processes files in parallel, while GPU/CPU-bound model calls are **serialized/queued** through the `ModelBackend` (e.g. a bounded queue in front of Ollama / ASR) so inference isn't oversubscribed. Sink writes are batched (respecting Meilisearch payload limits) and their async tasks are tracked to `succeeded` before the StateStore marks files done.
|
||||
|
||||
### Reindex / backfill command
|
||||
|
||||
`digger reindex` reprocesses or re-emits based on recorded provenance, without manual deletion. Selectors:
|
||||
|
||||
- `--schema-older-than 1.0` — records from an older IR schema.
|
||||
- `--extractor pdf_ocr --version-below 0.2.0` — a specific extractor upgrade.
|
||||
- `--model-changed` — when the configured OCR/ASR (or, in V2, embedding) model fingerprint differs from what produced the records.
|
||||
- `--embed` (V2) — generate/refresh vectors for existing records via `POST` partial update, without re-extracting content.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Re-running is cheap and safe; large corpora aren't reprocessed wholesale.
|
||||
- Deletions and content changes propagate to the sink without orphaned chunks.
|
||||
- Model/schema upgrades have a first-class, targeted reprocessing path — no "manual scramble."
|
||||
- The `status` view ([ADR 0007](0007-search-provider-and-ui.md)) reads directly from the StateStore.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Dedup by content hash (one record, many paths).** Rejected per the confirmed decision; complicates moves/deletes and would consume `distinctAttribute` needed for chunk collapsing.
|
||||
- **Stateless full re-scan each run.** Rejected — violates the incremental requirement and is infeasible at 10k–500k files.
|
||||
- **mtime-only change detection.** Rejected as sole signal (mtime is unreliable across copies/restores); content hash is authoritative, with mtime/size as a fast pre-filter.
|
||||
50
docs/decisions/0009-packaging-and-deployment.md
Normal file
50
docs/decisions/0009-packaging-and-deployment.md
Normal file
|
|
@ -0,0 +1,50 @@
|
|||
# ADR 0009 — Packaging and deployment
|
||||
|
||||
**Status:** Accepted (v1)
|
||||
|
||||
## Context
|
||||
|
||||
Zero-install by default: an operator should get a working system without manually installing dependencies. The stack pulls in heavy native pieces (Meilisearch, an OCR runtime, an ASR runtime + ffmpeg, LibreOffice for legacy Office). The lever is to **bundle these, not ask the user to install them** — while keeping every bundled piece overridable so modularity isn't lost. Confirmed: **Docker Compose is the default**; no Docker-free native installer needed for v1.
|
||||
|
||||
## Decision
|
||||
|
||||
### Primary distribution — one-command Docker Compose stack
|
||||
|
||||
`docker compose up` brings up, wired together:
|
||||
|
||||
| Service | Image / contents | Notes |
|
||||
|---|---|---|
|
||||
| `meilisearch` | `getmeili/meilisearch:v1.48.3` | CPU defaults; named volume; master key from `.env`; **index auto-created + settings applied on first run** ([ADR 0003](0003-meilisearch-index-design.md)). |
|
||||
| `pipeline` | digger app + ffmpeg + Docling deps | Runs CLI `scan/extract/index/run/status/reindex`; reaches Ollama via `OLLAMA_HOST` ([ADR 0005](0005-model-backends-and-ollama.md)). |
|
||||
| `ui` | FastAPI + HTMX app | Holds the search-only key; talks to Meilisearch via the `SearchProvider` adapter ([ADR 0007](0007-search-provider-and-ui.md)). |
|
||||
| `converter` | Ubuntu + LibreOffice + unoserver + fonts | Isolated ~800 MB image for legacy Office; called over the internal network ([ADR 0006](0006-document-conversion-routing.md)). |
|
||||
|
||||
Zero-config first run: ship an example config and example `.env`, auto-create the index, default to CPU. Point it at a folder → working search.
|
||||
|
||||
### Models / GPU
|
||||
|
||||
The model runtime lives **outside Docker by default** (host Ollama; faster-whisper in the pipeline image on CPU). GPU is an explicit opt-in upgrade (documented), avoiding forcing every user through WSL2 + NVIDIA Container Toolkit. Because models sit behind `ModelBackend`, the runtime can be a host service or a container without code changes.
|
||||
|
||||
### Overridability (modularity preserved)
|
||||
|
||||
Every bundled service is independently overridable via config / Compose overrides: point at your own Meilisearch, your own Ollama, or your own model server. The **standalone, pip-installable pipeline** (no UI, no engine; FileSink → IR on disk) remains a supported lighter path for developers. Outside Docker, `static-ffmpeg` provides ffmpeg with zero manual install ([ADR 0005](0005-model-backends-and-ollama.md)).
|
||||
|
||||
### Secrets
|
||||
|
||||
Model endpoints and engine keys live in `.env` / config, never committed. `.env.example` documents every required key (`MEILI_MASTER_KEY`, the derived search-only and indexer keys, `OLLAMA_HOST`, optional `HF_TOKEN` for V2 diarization). `.gitignore` excludes `.env`.
|
||||
|
||||
### Dependency & version management
|
||||
|
||||
Pin dependencies with a lockfile; pin the Python version (3.12); pin every service image tag (never `latest`). Version pins are listed in [`../research/SYNTHESIS.md`](../research/SYNTHESIS.md) §5.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The only prerequisite for the default path is a container runtime; the README states this honestly with a copy-paste quickstart (no multi-step "install Tesseract, then ffmpeg, then…" list).
|
||||
- CPU-only first run works out of the box; GPU is a documented upgrade.
|
||||
- Developers can `pip install` the pipeline and run it engine-free.
|
||||
- Bundled ≠ locked-in: each piece is swappable.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Native Windows installer (Inno Setup/MSI bundling meilisearch.exe, ffmpeg, OCR runtime, portable LibreOffice).** Friendliest for a non-technical Windows end user but the most work to build/maintain cross-platform. Deferred past v1 per the confirmed decision; the architecture doesn't preclude it later.
|
||||
- **GPU-in-Docker as the default.** Rejected — high friction on Windows; GPU stays opt-in with the model runtime able to live on the host.
|
||||
58
docs/decisions/0010-ci-and-windows-runner.md
Normal file
58
docs/decisions/0010-ci-and-windows-runner.md
Normal file
|
|
@ -0,0 +1,58 @@
|
|||
# ADR 0010 — CI and the Windows runner
|
||||
|
||||
**Status:** Accepted (v1)
|
||||
**Research:** [`../research/F-forgejo-ci-windows.md`](../research/F-forgejo-ci-windows.md)
|
||||
|
||||
## Context
|
||||
|
||||
We host on a local Forgejo instance and want CI working from the very first commit — part of Sprint 0 scaffolding, before any extractor. Primary deployment target is Windows, so Linux-only CI must not hide Windows breakage. Confirmed: a Linux/Docker runner already exists (`local-runner`, labels `docker`, `ubuntu-latest`); the user will provide a **Windows 11 KVM VM** for the Windows runner. Forgejo infra lives in the sibling repo `../forgejo-stack`.
|
||||
|
||||
## Decision
|
||||
|
||||
### Layered, fast-by-default test tiers
|
||||
|
||||
| Tier | When | Where | Contents |
|
||||
|---|---|---|---|
|
||||
| **Unit** | every push | Linux + Windows | ruff, ruff format --check, mypy, `pytest -m "not integration and not heavy"` + coverage. Fakes for models and sinks. |
|
||||
| **Integration** | PRs to `main` | Linux only | real Meilisearch **service container** (pinned), mocked model backends; exercises index + query. |
|
||||
| **Heavy (real models)** | manual `workflow_dispatch` | a `heavy`-labelled runner | OCR/ASR against real local models; **never** in the default pipeline. |
|
||||
|
||||
`pytest` markers in `pyproject.toml`: `integration`, `heavy`, `windows_compat`; `addopts` defaults to `-m "not integration and not heavy"` so bare `pytest` is fast everywhere.
|
||||
|
||||
### Workflows (`.forgejo/workflows/`)
|
||||
|
||||
- **`ci.yml` (Linux, `ubuntu-latest`):** `unit` job on every push; `integration` job gated to PRs targeting `main`, with a pinned `getmeili/meilisearch` **service container** reached at `http://meilisearch:7700`.
|
||||
- **`ci-windows.yml` (`windows` label):** Windows path/encoding/file-lock unit tests (`windows_compat`) on PRs and pushes to `main`. **No `services:`** (host-mode runner has no Docker daemon → Meilisearch integration stays Linux-only). Pre-install Python on the host and call it from PATH (skip `setup-python` quirks on host runners).
|
||||
- **`ci-heavy.yml`:** `workflow_dispatch` only, targets the `heavy` label.
|
||||
|
||||
**Forgejo specifics:** all `uses:` are fully-qualified `https://code.forgejo.org/actions/...` (Forgejo resolves bare refs to its own mirror, not GitHub); artifacts use `code.forgejo.org/forgejo/upload-artifact`. Pin all action versions. Quality gates (ruff, mypy, unit) must pass to merge; coverage is reported. The same checks run in **pre-commit hooks**.
|
||||
|
||||
### Windows runner (Crown0815/Forgejo-runner-windows-builder)
|
||||
|
||||
- **Unofficial, pinned `v12.12.0`.** Host-native (no containers). Runs on the user's **Windows 11 KVM VM** on the Linux box (≥4 GB RAM, ≥60 GB disk).
|
||||
- **Setup:** add a Windows Defender exclusion → download the pinned `forgejo-runner-windows-amd64.exe` → register with `--labels "windows:host,self-hosted:host"` against `http://<linux-host-lan-ip>:3000` → override `ACTIONS_RUNTIME_URL`/`ACTIONS_RESULTS_URL` to that LAN IP (the `.localhost` URL won't resolve on Windows) → run as a service (NSSM).
|
||||
- **Vendored on the Windows host:** Python 3.12, Git, poppler, ffmpeg (heavy model inference is *not* expected on Windows CI).
|
||||
|
||||
### forgejo-stack integration (no compose change)
|
||||
|
||||
The Windows runner cannot be a Compose service (native Windows binary). Integration is documentation + a helper:
|
||||
|
||||
- `scripts/register-windows-runner.sh` — generates a registration token via the Forgejo admin API and prints the ready-to-paste PowerShell block (Defender exclusion, download, register, `config.yaml`, NSSM).
|
||||
- `.env.example` — `WINDOWS_RUNNER_VERSION=v12.12.0` (informational).
|
||||
- `docs/windows-runner.md` — full KVM VM + setup + troubleshooting guide (written in Sprint 0).
|
||||
- `docker-compose.yml` — **unchanged**; the Linux/Docker runner stays as-is.
|
||||
|
||||
This work is tracked as a Sprint-0 infrastructure task and lands in `forgejo-stack`, alongside committing both digger workflow files.
|
||||
|
||||
## Consequences
|
||||
|
||||
- A fresh clone + documented setup yields a **green Linux pipeline (unit + Meilisearch integration)** — the Phase-1 acceptance bar — with the Windows tier running natively on the VM.
|
||||
- Windows-only breakage (paths, encodings, file locks) is caught by CI rather than in production.
|
||||
- Heavy GPU/CPU model tests are isolated and never slow the default pipeline.
|
||||
- The Windows runner's unofficial status is mitigated by pinning and the known fallback (cross-compile the official Go runner).
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Linux-only CI, Windows later.** Offered but not chosen — the user opted to stand up the Windows runner now via the KVM VM.
|
||||
- **A second compose service for Windows CI.** Impossible — it's a native host binary, not a Linux container.
|
||||
- **GitHub-hosted Windows runners.** Out of scope: we host on a local Forgejo instance, local-first.
|
||||
34
docs/decisions/README.md
Normal file
34
docs/decisions/README.md
Normal file
|
|
@ -0,0 +1,34 @@
|
|||
# Architecture Decision Records
|
||||
|
||||
This directory holds the durable design decisions for digger, plus the two machine-readable contract artifacts. Read these before changing anything in the corresponding layer.
|
||||
|
||||
## Contract artifacts
|
||||
|
||||
| File | What it is |
|
||||
|---|---|
|
||||
| [`ir-schema.json`](ir-schema.json) | JSON Schema (draft 2020-12) for the **Canonical Document (IR) v1.0** — the contract between the pipeline and any sink. The most important file in the repo. |
|
||||
| [`ir-examples.jsonl`](ir-examples.jsonl) | Worked IR records: an Arabic scanned-PDF page, an English chunked report, a mixed-language A/V transcript chunk, and a skipped Access file. |
|
||||
| [`meilisearch-settings.json`](meilisearch-settings.json) | The concrete `digger_documents` index settings (keyword v1; vector embedder declared but dormant). |
|
||||
|
||||
## ADRs
|
||||
|
||||
| ADR | Decision |
|
||||
|---|---|
|
||||
| [0001](0001-architecture-and-layering.md) | Strict layered architecture and the seven core interfaces |
|
||||
| [0002](0002-intermediate-representation.md) | The Canonical Document (IR) as the versioned contract |
|
||||
| [0003](0003-meilisearch-index-design.md) | Single Meilisearch index; chunk granularity collapsed by `parent_id`; limits |
|
||||
| [0004](0004-chunking-strategy.md) | Chunking in v1 as a Transformer concern (and the vector seam) |
|
||||
| [0005](0005-model-backends-and-ollama.md) | `ModelBackend` interface; OCR/ASR/embed defaults; Ollama as host service |
|
||||
| [0006](0006-document-conversion-routing.md) | Tiered, content-routed extraction (Docling / Qwen-OCR / Office libs / unoserver) |
|
||||
| [0007](0007-search-provider-and-ui.md) | Read-side `SearchProvider` interface and the FastAPI + HTMX UI |
|
||||
| [0008](0008-incremental-dedup-statestore.md) | Deduplication (one-per-path), the StateStore, incremental & delete semantics, reindex |
|
||||
| [0009](0009-packaging-and-deployment.md) | Docker Compose as primary distribution; zero-install; overridability |
|
||||
| [0010](0010-ci-and-windows-runner.md) | Layered CI on Forgejo and the Windows runner |
|
||||
|
||||
## Status legend
|
||||
|
||||
- **Accepted** — agreed and in force for v1.
|
||||
- **Proposed** — drafted, awaiting confirmation.
|
||||
- **Superseded** — replaced by a later ADR (linked).
|
||||
|
||||
All ADRs below are **Accepted** for v1 unless noted. They reflect the research in [`../research/`](../research/) (synthesized in [`../research/SYNTHESIS.md`](../research/SYNTHESIS.md)) and the project brief [`../digger-brief.md`](../digger-brief.md).
|
||||
4
docs/decisions/ir-examples.jsonl
Normal file
4
docs/decisions/ir-examples.jsonl
Normal file
|
|
@ -0,0 +1,4 @@
|
|||
{"schema_version":"1.0","id":"a1b2c3d4e5f60718293a4b5c6d7e8f90112233445566778899aabbccddeeff00","parent_id":"a1b2c3d4e5f60718293a4b5c6d7e8f90112233445566778899aabbccddeeff00","chunk_index":0,"chunk_count":1,"content_hash":"9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08","source":{"path":"/data/scans/عقد_إيجار.pdf","relative_path":"scans/عقد_إيجار.pdf","filename":"عقد_إيجار.pdf","extension":"pdf","source_folder":"/data/scans","mime_type":"application/pdf","file_type":"pdf","size_bytes":418233,"created_at":"2026-03-11T09:12:00Z","modified_at":"2026-03-11T09:12:00Z","created_at_epoch":1773219120,"modified_at_epoch":1773219120,"host":"workstation-01","drive":null},"provenance":{"extractor_name":"pdf_ocr","extractor_version":"0.1.0","processed_at":"2026-07-01T12:00:00Z","model":{"ocr_model":"qwen2.5vl:7b","ocr_model_version":"ollama@2026-06","asr_model":null,"asr_model_version":null}},"content":{"content_text":"عقد إيجار بين الطرف الأول والطرف الثاني ...","content_truncated":false,"title":"عقد إيجار","language":"ar","language_probability":0.99,"tags":[]},"locator":{"page_number":1,"page_end":1,"slide_number":null,"sheet_name":null,"timestamp_start":null,"timestamp_end":null,"speaker":null},"segments":[{"kind":"page","index":0,"label":null,"text":"عقد إيجار بين الطرف الأول والطرف الثاني ...","page_number":1,"slide_number":null,"sheet_name":null,"start":null,"end":null,"speaker":null,"words":null}],"metadata":{"author":null,"page_count":1,"ocr_dpi":300},"embedding":null,"status":"success","warnings":[],"errors":[]}
|
||||
{"schema_version":"1.0","id":"bb11ee22cc33dd44aa55ff66001122334455667788990011223344556677a1c0","parent_id":"7d0a8468ed220400c0b8e6f335baa7e070ce880a37e2ac5995b9a97b809026de","chunk_index":3,"chunk_count":18,"content_hash":"3a7bd3e2360a3d29eea436fcfb7e44c735d117c42d1c1835420b6b9942dd4f1b","source":{"path":"/data/reports/annual_review_2025.pdf","relative_path":"reports/annual_review_2025.pdf","filename":"annual_review_2025.pdf","extension":"pdf","source_folder":"/data/reports","mime_type":"application/pdf","file_type":"pdf","size_bytes":9412233,"created_at":"2026-01-04T08:00:00Z","modified_at":"2026-01-20T16:42:00Z","created_at_epoch":1767513600,"modified_at_epoch":1768927320,"host":"workstation-01","drive":null},"provenance":{"extractor_name":"docling","extractor_version":"0.1.0","processed_at":"2026-07-01T12:05:00Z","model":null},"content":{"content_text":"Section 4. Operating results for the fiscal year showed revenue growth of ...","content_truncated":false,"title":"Annual Review 2025","language":"en","language_probability":0.98,"tags":[]},"locator":{"page_number":12,"page_end":13,"slide_number":null,"sheet_name":null,"timestamp_start":null,"timestamp_end":null,"speaker":null},"segments":[{"kind":"page","index":11,"label":null,"text":"Section 4. Operating results ...","page_number":12,"slide_number":null,"sheet_name":null,"start":null,"end":null,"speaker":null,"words":null}],"metadata":{"author":"Finance Dept","page_count":42,"has_text_layer":true},"embedding":null,"status":"success","warnings":[],"errors":[]}
|
||||
{"schema_version":"1.0","id":"f0e1d2c3b4a5968778695a4b3c2d1e0f00112233445566778899aabbccddee11","parent_id":"c4ca4238a0b923820dcc509a6f75849b8a2f3c557d3e9a4b6c1d0e2f3a4b5c6d","chunk_index":2,"chunk_count":40,"content_hash":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","source":{"path":"/data/media/interview.mp4","relative_path":"media/interview.mp4","filename":"interview.mp4","extension":"mp4","source_folder":"/data/media","mime_type":"video/mp4","file_type":"video","size_bytes":734003200,"created_at":"2026-05-02T14:00:00Z","modified_at":"2026-05-02T14:00:00Z","created_at_epoch":1778853600,"modified_at_epoch":1778853600,"host":"workstation-01","drive":null},"provenance":{"extractor_name":"audio_video","extractor_version":"0.1.0","processed_at":"2026-07-01T12:10:00Z","model":{"ocr_model":null,"ocr_model_version":null,"asr_model":"large-v3","asr_model_version":"openai/whisper-large-v3@abc123"}},"content":{"content_text":"مرحبا، شكرا لانضمامك إلينا اليوم. Let's start with your background ...","content_truncated":false,"title":"interview.mp4","language":"mixed","language_probability":0.91,"tags":[]},"locator":{"page_number":null,"page_end":null,"slide_number":null,"sheet_name":null,"timestamp_start":134.2,"timestamp_end":171.8,"speaker":null},"segments":[{"kind":"transcript","index":2,"label":null,"text":"مرحبا، شكرا لانضمامك إلينا اليوم.","page_number":null,"slide_number":null,"sheet_name":null,"start":134.2,"end":138.6,"speaker":null,"words":null},{"kind":"transcript","index":3,"label":null,"text":"Let's start with your background ...","page_number":null,"slide_number":null,"sheet_name":null,"start":139.0,"end":171.8,"speaker":null,"words":null}],"metadata":{"duration_seconds":3612.4,"codec":"h264/aac","sample_rate_hz":44100,"channels":2},"embedding":null,"status":"success","warnings":[],"errors":[]}
|
||||
{"schema_version":"1.0","id":"99887766554433221100ffeeddccbbaa99887766554433221100ffeeddccbb22","parent_id":"99887766554433221100ffeeddccbbaa99887766554433221100ffeeddccbb22","chunk_index":0,"chunk_count":1,"content_hash":"2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae","source":{"path":"/data/legacy/old_db.accdb","relative_path":"legacy/old_db.accdb","filename":"old_db.accdb","extension":"accdb","source_folder":"/data/legacy","mime_type":"application/x-msaccess","file_type":"access","size_bytes":2310144,"created_at":"2019-08-01T00:00:00Z","modified_at":"2024-11-15T10:00:00Z","created_at_epoch":1564617600,"modified_at_epoch":1731664800,"host":"linux-ci","drive":null},"provenance":{"extractor_name":"access","extractor_version":"0.1.0","processed_at":"2026-07-01T12:15:00Z","model":null},"content":{"content_text":"","content_truncated":false,"title":"old_db.accdb","language":"und","language_probability":null,"tags":[]},"locator":null,"segments":[],"metadata":{},"embedding":null,"status":"skipped","warnings":["Access extraction requires Windows with the ACE ODBC driver; file skipped on this platform"],"errors":[]}
|
||||
230
docs/decisions/ir-schema.json
Normal file
230
docs/decisions/ir-schema.json
Normal file
|
|
@ -0,0 +1,230 @@
|
|||
{
|
||||
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
||||
"$id": "https://digger.local/schemas/canonical-document-1.0.json",
|
||||
"title": "Canonical Document (IR) v1.0",
|
||||
"description": "The intermediate representation produced by extractors and consumed by sinks. One record = one indexable unit. A whole document with no chunking is a single record with chunk_index=0, chunk_count=1, parent_id==id. A chunked document is N records that share parent_id and content_hash, each carrying its own content_text span and locator. This schema is the contract between the pipeline and any search engine; it is engine-agnostic and versioned via schema_version.",
|
||||
"type": "object",
|
||||
"required": [
|
||||
"schema_version",
|
||||
"id",
|
||||
"parent_id",
|
||||
"chunk_index",
|
||||
"chunk_count",
|
||||
"content_hash",
|
||||
"source",
|
||||
"provenance",
|
||||
"content",
|
||||
"status"
|
||||
],
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"schema_version": {
|
||||
"type": "string",
|
||||
"const": "1.0",
|
||||
"description": "IR schema version. Bumped when the contract changes; recorded so a reindex/backfill can detect records produced by an older schema."
|
||||
},
|
||||
"id": {
|
||||
"type": "string",
|
||||
"pattern": "^[a-zA-Z0-9_-]{1,511}$",
|
||||
"description": "Primary key. Meilisearch allows only [a-zA-Z0-9_-] and <=511 bytes. For a whole document: sha256_hex(canonical_path + '|' + content_hash). For a chunk: sha256_hex(canonical_path + '|' + content_hash + '|' + chunk_index). Always a 64-char lowercase hex string in v1."
|
||||
},
|
||||
"parent_id": {
|
||||
"type": "string",
|
||||
"pattern": "^[a-zA-Z0-9_-]{1,511}$",
|
||||
"description": "The logical document this record belongs to. Equals id for a whole document. For chunks, equals sha256_hex(canonical_path + '|' + content_hash) (i.e. the would-be whole-document id). Used as the Meilisearch distinctAttribute so results collapse to one hit per logical file-at-path while the best-matching chunk surfaces."
|
||||
},
|
||||
"chunk_index": {
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
"description": "0-based index of this chunk within the parent document. 0 for whole documents."
|
||||
},
|
||||
"chunk_count": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"description": "Total number of chunks for the parent document. 1 for whole documents."
|
||||
},
|
||||
"content_hash": {
|
||||
"type": "string",
|
||||
"pattern": "^[a-f0-9]{64}$",
|
||||
"description": "SHA-256 hex of the raw file bytes. Drives incremental processing (skip unchanged files) and identity. Identical across all chunks of one file."
|
||||
},
|
||||
"source": {
|
||||
"type": "object",
|
||||
"description": "Filesystem provenance of the original file.",
|
||||
"required": ["path", "filename", "extension", "source_folder", "mime_type", "file_type", "size_bytes"],
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"path": {
|
||||
"type": "string",
|
||||
"description": "Absolute, OS-canonical path to the original file. DISPLAY/OPEN ONLY — never a Meilisearch filterable attribute (paths can exceed the 468-byte filterable-value cap)."
|
||||
},
|
||||
"relative_path": {
|
||||
"type": ["string", "null"],
|
||||
"description": "Path relative to the scanned source root, when known. Display only."
|
||||
},
|
||||
"filename": { "type": "string", "description": "Basename including extension." },
|
||||
"extension": { "type": "string", "description": "Lowercased file extension without the dot, e.g. 'pdf'." },
|
||||
"source_folder": {
|
||||
"type": "string",
|
||||
"maxLength": 468,
|
||||
"description": "Derived, length-bounded folder token for faceted folder filtering (stays under Meilisearch's 468-byte filterable-value cap). Typically the parent directory path truncated/normalized, or a configured root label. Filterable."
|
||||
},
|
||||
"mime_type": { "type": "string", "description": "Detected MIME type, e.g. 'application/pdf'. Filterable." },
|
||||
"file_type": {
|
||||
"type": "string",
|
||||
"enum": ["pdf", "image", "word", "excel", "powerpoint", "access", "audio", "video", "email", "text", "html", "other"],
|
||||
"description": "Coarse category used as the primary file-type facet. Filterable."
|
||||
},
|
||||
"size_bytes": { "type": "integer", "minimum": 0, "description": "File size in bytes. Filterable and sortable." },
|
||||
"created_at": { "type": ["string", "null"], "format": "date-time", "description": "ISO 8601 creation timestamp (display)." },
|
||||
"modified_at": { "type": ["string", "null"], "format": "date-time", "description": "ISO 8601 modification timestamp (display)." },
|
||||
"created_at_epoch": { "type": ["integer", "null"], "description": "Unix epoch seconds for range filtering/sorting without date parsing." },
|
||||
"modified_at_epoch": { "type": ["integer", "null"], "description": "Unix epoch seconds for range filtering/sorting." },
|
||||
"host": { "type": ["string", "null"], "description": "Machine/host identifier the file was read from. Filterable." },
|
||||
"drive": { "type": ["string", "null"], "description": "Drive/volume label (e.g. 'C' on Windows). Filterable." }
|
||||
}
|
||||
},
|
||||
"provenance": {
|
||||
"type": "object",
|
||||
"description": "Which extractor and models produced this record, and when. Drives the reindex/backfill command when a model or the schema improves.",
|
||||
"required": ["extractor_name", "extractor_version", "processed_at"],
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"extractor_name": { "type": "string", "description": "e.g. 'pdf_ocr', 'docling', 'office_docx', 'audio_video'. Filterable." },
|
||||
"extractor_version": { "type": "string", "description": "Semantic version of the extractor that produced this record." },
|
||||
"processed_at": { "type": "string", "format": "date-time", "description": "ISO 8601 timestamp of extraction." },
|
||||
"model": {
|
||||
"type": ["object", "null"],
|
||||
"description": "Model identities used, when any. Null for pure structural extraction (e.g. native-digital DOCX). Used to detect when a model upgrade requires re-extraction/re-embedding.",
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"ocr_model": { "type": ["string", "null"], "description": "e.g. 'qwen2.5vl:7b'." },
|
||||
"ocr_model_version": { "type": ["string", "null"] },
|
||||
"asr_model": { "type": ["string", "null"], "description": "e.g. 'large-v3'." },
|
||||
"asr_model_version": { "type": ["string", "null"], "description": "e.g. HF model id + commit." }
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"content": {
|
||||
"type": "object",
|
||||
"description": "The searchable payload for this record.",
|
||||
"required": ["content_text"],
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"content_text": {
|
||||
"type": "string",
|
||||
"description": "Flat plain text indexed for keyword search. For a chunk, only this chunk's span. The chunking transformer keeps each chunk under Meilisearch's 65,535 word-position field limit."
|
||||
},
|
||||
"content_truncated": {
|
||||
"type": "boolean",
|
||||
"default": false,
|
||||
"description": "True if content_text was truncated to respect the field limit (should be rare once chunking is active; retained as a safety flag)."
|
||||
},
|
||||
"title": { "type": ["string", "null"], "description": "Extracted or derived title (document property, first heading, or filename fallback). Searchable." },
|
||||
"language": {
|
||||
"type": ["string", "null"],
|
||||
"description": "Detected primary language as ISO 639 ('ar', 'en', 'und', or 'mixed'). Filterable facet."
|
||||
},
|
||||
"language_probability": { "type": ["number", "null"], "minimum": 0, "maximum": 1, "description": "Confidence of language detection, when available." },
|
||||
"tags": {
|
||||
"type": "array",
|
||||
"items": { "type": "string" },
|
||||
"default": [],
|
||||
"description": "Optional enrichment tags (future). Searchable/filterable."
|
||||
}
|
||||
}
|
||||
},
|
||||
"locator": {
|
||||
"type": ["object", "null"],
|
||||
"description": "Where this record's content sits inside the original file, for UI deep-linking. Populated per chunk/segment. Null when not applicable.",
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"page_number": { "type": ["integer", "null"], "minimum": 1, "description": "1-based PDF/image page where this chunk starts." },
|
||||
"page_end": { "type": ["integer", "null"], "minimum": 1, "description": "1-based last page this chunk spans, when multi-page." },
|
||||
"slide_number": { "type": ["integer", "null"], "minimum": 1, "description": "1-based PPTX slide." },
|
||||
"sheet_name": { "type": ["string", "null"], "description": "XLSX sheet name." },
|
||||
"timestamp_start": { "type": ["number", "null"], "minimum": 0, "description": "A/V: seconds into the media where this chunk starts (deep-link target)." },
|
||||
"timestamp_end": { "type": ["number", "null"], "minimum": 0, "description": "A/V: seconds into the media where this chunk ends." },
|
||||
"speaker": { "type": ["string", "null"], "description": "Speaker label. Null in v1 (diarization deferred to V2); populated without schema change later." }
|
||||
}
|
||||
},
|
||||
"segments": {
|
||||
"type": "array",
|
||||
"description": "Structured native segments preserved for richer display/deep-linking. A chunk's content_text is assembled from one or more of these. Optional; extractors emit what is natural for the format.",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["kind", "index", "text"],
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"kind": { "type": "string", "enum": ["page", "slide", "sheet", "row", "table", "paragraph", "transcript", "note"], "description": "Native structural unit type." },
|
||||
"index": { "type": "integer", "minimum": 0, "description": "0-based ordinal within its kind." },
|
||||
"label": { "type": ["string", "null"], "description": "Human label, e.g. sheet name or slide title." },
|
||||
"text": { "type": "string", "description": "Plain text of the segment." },
|
||||
"page_number": { "type": ["integer", "null"], "minimum": 1 },
|
||||
"slide_number": { "type": ["integer", "null"], "minimum": 1 },
|
||||
"sheet_name": { "type": ["string", "null"] },
|
||||
"start": { "type": ["number", "null"], "minimum": 0, "description": "A/V segment start seconds." },
|
||||
"end": { "type": ["number", "null"], "minimum": 0, "description": "A/V segment end seconds." },
|
||||
"speaker": { "type": ["string", "null"], "description": "Null in v1." },
|
||||
"words": {
|
||||
"type": ["array", "null"],
|
||||
"description": "Optional word-level timestamps (A/V). Null unless word_timestamps were requested.",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["word", "start", "end"],
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"word": { "type": "string" },
|
||||
"start": { "type": "number", "minimum": 0 },
|
||||
"end": { "type": "number", "minimum": 0 },
|
||||
"probability": { "type": ["number", "null"], "minimum": 0, "maximum": 1 }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"metadata": {
|
||||
"type": "object",
|
||||
"description": "Format-specific metadata. Documented per extractor; keys are stable but the set varies by format (e.g. author, keywords, subject, sheet_names, slide_count, duration_seconds, codec, sample_rate_hz, channels, exif). Selected sub-keys (author/keywords/subject) are searchable.",
|
||||
"additionalProperties": true
|
||||
},
|
||||
"embedding": {
|
||||
"type": ["object", "null"],
|
||||
"description": "Reserved for vector/hybrid search (V2). Null in v1. When present, the Meilisearch sink maps vector -> _vectors.digger_semantic; model_id/version drive reindex when the embedding model changes.",
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"model_id": { "type": "string" },
|
||||
"model_version": { "type": "string" },
|
||||
"dimensions": { "type": "integer", "const": 768, "description": "Committed at index creation; changing this requires a full reindex." },
|
||||
"vector": { "type": "array", "items": { "type": "number" }, "minItems": 768, "maxItems": 768 }
|
||||
}
|
||||
},
|
||||
"status": {
|
||||
"type": "string",
|
||||
"enum": ["success", "partial", "failed", "skipped"],
|
||||
"description": "Processing outcome. 'partial' = some content extracted with warnings; 'failed' = extraction error (still recorded, never crashes the run); 'skipped' = size/timeout/capability gate. Filterable."
|
||||
},
|
||||
"warnings": {
|
||||
"type": "array",
|
||||
"items": { "type": "string" },
|
||||
"default": [],
|
||||
"description": "Non-fatal issues (e.g. 'content truncated', 'font substitution during LibreOffice conversion')."
|
||||
},
|
||||
"errors": {
|
||||
"type": "array",
|
||||
"default": [],
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["stage", "message"],
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"stage": { "type": "string", "description": "Where it failed, e.g. 'ocr', 'pdf_render', 'asr', 'conversion'." },
|
||||
"message": { "type": "string" }
|
||||
}
|
||||
},
|
||||
"description": "Structured errors for failed/partial records. Drives the quarantine/dead-letter report."
|
||||
}
|
||||
}
|
||||
}
|
||||
85
docs/decisions/meilisearch-settings.json
Normal file
85
docs/decisions/meilisearch-settings.json
Normal file
|
|
@ -0,0 +1,85 @@
|
|||
{
|
||||
"_comment": "Concrete Meilisearch settings for the digger_documents index. Apply via PATCH /indexes/digger_documents/settings after creating the index with primaryKey 'id'. Strip this _comment key before sending (JSON has no comments; Meilisearch ignores unknown keys but keep payloads clean). See ADR 0003 for rationale. Pin engine: getmeili/meilisearch:v1.48.3.",
|
||||
|
||||
"searchableAttributes": [
|
||||
"content.content_text",
|
||||
"content.title",
|
||||
"source.filename",
|
||||
"content.tags",
|
||||
"metadata.author",
|
||||
"metadata.keywords",
|
||||
"metadata.subject"
|
||||
],
|
||||
|
||||
"filterableAttributes": [
|
||||
"source.file_type",
|
||||
"source.mime_type",
|
||||
"content.language",
|
||||
"status",
|
||||
"provenance.extractor_name",
|
||||
"source.source_folder",
|
||||
"source.host",
|
||||
"source.drive",
|
||||
"source.modified_at_epoch",
|
||||
"source.created_at_epoch",
|
||||
"source.size_bytes",
|
||||
"content.content_truncated"
|
||||
],
|
||||
|
||||
"sortableAttributes": [
|
||||
"source.modified_at_epoch",
|
||||
"source.created_at_epoch",
|
||||
"source.size_bytes",
|
||||
"source.filename"
|
||||
],
|
||||
|
||||
"displayedAttributes": ["*"],
|
||||
|
||||
"distinctAttribute": "parent_id",
|
||||
|
||||
"rankingRules": [
|
||||
"words",
|
||||
"typo",
|
||||
"proximity",
|
||||
"attribute",
|
||||
"sort",
|
||||
"exactness"
|
||||
],
|
||||
|
||||
"faceting": {
|
||||
"maxValuesPerFacet": 200,
|
||||
"sortFacetValuesBy": { "*": "count" }
|
||||
},
|
||||
|
||||
"typoTolerance": {
|
||||
"enabled": true,
|
||||
"minWordSizeForTypos": { "oneTypo": 5, "twoTypos": 9 },
|
||||
"disableOnWords": [],
|
||||
"disableOnAttributes": ["id", "parent_id", "content_hash", "source.path"],
|
||||
"disableOnNumbers": true
|
||||
},
|
||||
|
||||
"synonyms": {},
|
||||
|
||||
"stopWords": [],
|
||||
|
||||
"localizedAttributes": [
|
||||
{
|
||||
"attributePatterns": ["content.content_text", "content.title", "metadata.*", "content.tags"],
|
||||
"locales": ["ara", "eng"]
|
||||
}
|
||||
],
|
||||
|
||||
"pagination": { "maxTotalHits": 10000 },
|
||||
|
||||
"proximityPrecision": "byWord",
|
||||
|
||||
"searchCutoffMs": 1500,
|
||||
|
||||
"embedders": {
|
||||
"digger_semantic": {
|
||||
"source": "userProvided",
|
||||
"dimensions": 768
|
||||
}
|
||||
}
|
||||
}
|
||||
267
docs/digger-brief.md
Normal file
267
docs/digger-brief.md
Normal file
|
|
@ -0,0 +1,267 @@
|
|||
# Project Brief / Kickoff Prompt: File-Ingestion Search Pipeline
|
||||
|
||||
> Paste this whole document as your first message to Claude Code. It is written **to** Claude Code.
|
||||
> Items in `<ANGLE BRACKETS>` are decisions for me (the human) — ask me about them before you build.
|
||||
|
||||
---
|
||||
|
||||
## 1. Mission
|
||||
|
||||
Build a **modular file-ingestion pipeline** that walks files on a machine, extracts their content (including scanned documents, Office files, and audio/video), normalizes everything into a single well-defined **intermediate document model**, and feeds that into **Meilisearch** for full-text search.
|
||||
|
||||
Two hard requirements shape every decision:
|
||||
|
||||
1. **The pipeline must be usable standalone, without Meilisearch.** It must be able to read files and emit the intermediate representation (IR) to disk on its own. Indexing into a search engine is a separate, swappable stage.
|
||||
2. **The search backend must be swappable.** Meilisearch is the first target, but it sits behind an interface so it can be replaced by another full-text (or vector) search engine without touching the pipeline.
|
||||
|
||||
Target platform is **primarily Windows**, but the code must be **cross-platform** (Windows + Linux + macOS).
|
||||
|
||||
**Stack is Python** (chosen for its OCR / ML / document-parsing ecosystem). **Semantic / vector + hybrid search is a first-class goal:** it does not have to ship in v1, but the IR, the pipeline, and the index design must be built to accommodate it from day one so we never have to re-architect for it (see Section 9).
|
||||
|
||||
## 2. Non-negotiable principles
|
||||
|
||||
- **Strict layering.** Each stage talks to the next only through a documented interface, never through a concrete implementation.
|
||||
- **The intermediate representation is the contract.** It is the most important artifact in the repo. Design it first, document it, version it, and keep it stable.
|
||||
- **Local-first / privacy-first.** All content processing (OCR, transcription, any AI) runs against **local models**. No file content is sent to external services unless I explicitly opt in. State this constraint in the README and enforce it in code (no silent network egress of content).
|
||||
- **Prefer existing, well-maintained tools** over reinventing — provided they can hook into our own local models. Evaluate before adopting.
|
||||
- **Idempotent and incremental.** Re-running over a directory must not reprocess unchanged files and must handle additions, changes, and deletions cleanly.
|
||||
- **Fail loud, fail isolated.** One unreadable file must never abort a run. Failures are logged, quarantined, and reportable.
|
||||
- **Tested and CI-gated from day one.** The first commits stand up the test suite and a green CI pipeline on the local Forgejo instance (Section 11); features land behind passing tests, not after them.
|
||||
- **Zero-install by default.** The operator should get a working system without manually installing dependencies — bundle the heavy native pieces rather than asking the user to install them (Section 12). Batteries-included, but every bundled piece stays overridable.
|
||||
|
||||
## 3. How we work — superpowers-driven planning, then iterative delivery
|
||||
|
||||
You have the **`superpowers` plugin** (obra/superpowers). Use its **brainstorm → plan → execute** methodology, its TDD (red → green → refactor) discipline, and its subagent code-review throughout. Because superpowers prioritizes project-specific instructions, this brief should live where it will be honored: commit it as `docs/PROJECT_BRIEF.md` and anchor/reference it from `CLAUDE.md`.
|
||||
|
||||
**Do not build features until I approve the plan (step 6).**
|
||||
|
||||
1. **Read this brief**, run a **brainstorming** pass, and ask me the open questions in Section 14. Wait for my answers.
|
||||
2. **Research** (Section 6): spawn subagents in parallel; each writes findings to `docs/research/`.
|
||||
3. **Index-design discussion** (Section 7): produce the IR schema, the Meilisearch index/schema design (incl. the vector/embedder reservations from Section 9), the read-side search API + UI approach (Section 10), and ADRs — all in `docs/`.
|
||||
4. **Plan** (superpowers `write-plan`): turn the design into an **agile, sprint-based V1** where every sprint ships a working **end-to-end** slice (Section 13). Use role-based agents (PM, Tech Lead, …) to detail it.
|
||||
5. **Create the Forgejo V1 milestone with every task as an issue**, plus a **coarse V2 milestone** (Section 13).
|
||||
6. **Present the plan + milestone to me for approval. Stop here until I sign off.**
|
||||
7. **Deliver sprint by sprint** using the Forgejo dev loop (Section 13): pick an issue → branch + TDD → PR (`Closes #N`) → code-review + green CI → **I approve and merge**. The first sprint lands repo scaffolding, a **green CI pipeline** (Section 11), and the core interfaces before any extractor.
|
||||
|
||||
## 4. Proposed architecture (critique and refine this — don't take it as final)
|
||||
|
||||
Treat the following as a strong starting point that your design agents should challenge and improve, not a spec to implement blindly.
|
||||
|
||||
```
|
||||
file system
|
||||
│
|
||||
▼
|
||||
[ Source / Walker ] discovers files, yields file references + filesystem metadata
|
||||
│
|
||||
▼
|
||||
[ Router ] picks an Extractor based on type/mime/content sniffing
|
||||
│
|
||||
▼
|
||||
[ Extractor (per format) ] uses Model Backends as needed (OCR / ASR / VLM / embeddings)
|
||||
│
|
||||
▼
|
||||
[ Canonical Document (the "middle type" / IR) ] <-- THE CONTRACT, serializable to JSONL on disk
|
||||
│
|
||||
▼
|
||||
[ Transformer / Enricher ] normalization, language detection, optional chunking, optional embeddings
|
||||
│
|
||||
▼
|
||||
[ Sink / Indexer (interface) ] Meilisearch adapter is one impl; a "file/null" sink writes IR to disk
|
||||
```
|
||||
|
||||
Supporting components that cut across the stages:
|
||||
|
||||
- **Model Backend interface** — abstracts local models: OCR, automatic speech recognition (ASR/transcription), vision-language understanding, and (optionally) embeddings. Concrete backends are configured by endpoint/runtime (e.g. a local server or in-process library). Extractors depend on the interface, never a specific model.
|
||||
- **SearchProvider interface (read side)** — the query-time mirror of the `Sink`. The UI and any search API talk only to this; the Meilisearch implementation is one adapter. Supports keyword, semantic, and hybrid query modes (see Sections 9 and 10) so swapping engines never touches the UI.
|
||||
- **State Store** — records what has been processed (e.g. a local SQLite DB) keyed by a content hash + path, to drive incremental runs and deletion handling.
|
||||
- **Config** — a single config file (TOML or YAML) with env-var overrides. Selects the active sink, model backends, source roots, concurrency, and per-format options.
|
||||
- **CLI** — subcommands such as `scan`, `extract` (files → IR on disk), `index` (IR → sink), `run` (end-to-end), and `status`. The `extract` + `index` split is what makes the pipeline usable without a search engine.
|
||||
|
||||
### The Canonical Document (the "middle type")
|
||||
|
||||
This is the central deliverable of the design phase. It must be a serializable schema (JSON/JSONL) that is rich enough to drive search but independent of any search engine. Design it to include at least:
|
||||
|
||||
- **Identity:** a stable `id` derived from path + content hash; the `content_hash` itself.
|
||||
- **Source metadata:** absolute/relative path, filename, extension, detected mime type, size, created/modified timestamps, host/drive, and (where relevant) network-share/UNC origin.
|
||||
- **Provenance:** which extractor + version produced it, and when.
|
||||
- **Content:** a plain-text field for search, **plus** optional structured segments that preserve native structure where it matters — pages (PDF), sheets/rows (spreadsheets), slides (PPTX), and tables. For audio/video, transcript segments with timestamps and (if available) speaker labels.
|
||||
- **Format-specific metadata:** e.g. Office document properties, image EXIF, media duration/codec.
|
||||
- **Derived fields:** detected language; optional tags; optional embeddings.
|
||||
- **Processing status:** success/partial/failed, plus warnings and errors.
|
||||
|
||||
Decide explicitly: **whole-document vs. chunked indexing.** Long documents may need to be split into chunks for good search relevance. Whatever you choose, the chunking must be a Transformer concern, not baked into extractors, so it can be turned off for the standalone case.
|
||||
|
||||
## 5. Scope and priorities
|
||||
|
||||
Build in this order. The architecture must make adding a new format = adding a new Extractor, with no changes to other layers.
|
||||
|
||||
**Priority 1 — Scanned documents (PDF, JPEG, PNG).**
|
||||
These are images of text, handled by a local OCR / document-understanding model. **The actual approach and model are decided by the research (Agents B/D and the synthesis), not assumed up front.** I've experimented locally and can hand you my findings and model setup as **one input to weigh** — ask me for them — but treat them as a starting point, not the answer; if the research points to something better, propose that. Build the full end-to-end path (walk → OCR → IR → Meilisearch) for this priority first so the whole architecture is exercised early.
|
||||
|
||||
**Priority 2 — Microsoft Office files** (`.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx`, `.mdb`/`.accdb`, etc.).
|
||||
Note the split between modern OOXML formats and legacy binary formats, and the cross-platform difficulty of Access databases — flag these in research.
|
||||
|
||||
**Priority 3 — Audio and video files.**
|
||||
Extract audio, transcribe with a local ASR model, then treat the transcript like text (optionally with AI summarization/understanding), preserving timestamps in the IR. **The specific extraction/transcription approach and models are decided by the research (Agent D), not assumed** — the above is the shape of the pipeline, not a fixed tool choice.
|
||||
|
||||
## 6. Research subagents (run in parallel, write findings to `docs/research/`)
|
||||
|
||||
Spawn focused subagents. Each must read primary sources, verify current capabilities and versions (your training data may be stale — check the live docs), and produce a written findings doc with recommendations and trade-offs. At minimum:
|
||||
|
||||
- **Agent A — Meilisearch.** Read the Meilisearch documentation thoroughly, starting at https://www.meilisearch.com/ and its docs. Cover: index creation and the **settings API** (searchable / filterable / sortable attributes, ranking rules, primary key, distinct attribute), **faceting**, synonyms, stop words, typo tolerance, document upsert/delete semantics, batch/task handling, and **document size / field limits**. Also investigate **vector / hybrid / AI-powered search and embedders** — since we already run local models, semantic search may be worth supporting. Note version requirements.
|
||||
- **Agent B — Local model integration & end-to-end tooling.** Research existing frameworks that convert many file types into a structured/markdown intermediate and that allow plugging in **our own local models**. Evaluate the major "everything-to-structured/markdown" toolkits explicitly — **Docling** (IBM), **MarkItDown** (Microsoft), **Unstructured**, and **Apache Tika** — alongside OCR stacks (Tesseract / PaddleOCR / docTR / Surya) and locally-served vision-language models. For each, assess: which file types it covers, whether it can use **our local models** (vs. calling out), output structure quality, and cross-platform/Windows support. Report which we can adopt wholesale, which to use per-format, and where we still need our own extractor.
|
||||
- **Agent C — Office & legacy formats.** Best cross-platform libraries for OOXML (docx/xlsx/pptx), strategies for legacy binary `.doc/.xls/.ppt` (e.g. headless LibreOffice conversion), and Access `.mdb/.accdb` on Windows vs. Linux/macOS. Identify the Windows-only gotchas.
|
||||
- **Agent D — Audio/video transcription.** Local ASR options (e.g. Whisper-family runtimes), audio extraction (e.g. ffmpeg), timestamps and optional speaker diarization, and GPU vs. CPU performance trade-offs.
|
||||
- **Agent E — Search frontend & UX.** Evaluate the UI options in Section 10 (Python-native: Streamlit/NiceGUI; minimal-JS: FastAPI + Jinja2 + HTMX; full JS: InstantSearch/`instant-meilisearch`), how each handles faceting/highlighting/typo-tolerant search, and the direct-to-engine vs. through-the-API trade-off. Recommend a default and note what's needed to keep the UI engine-agnostic.
|
||||
|
||||
After research, **synthesize** the findings into a single recommendations doc.
|
||||
|
||||
## 7. Index-design discussion (the "agents debate the indexes" part)
|
||||
|
||||
Have at least two agents take **different perspectives** and propose competing designs, then reconcile:
|
||||
|
||||
- one optimizing for **search relevance / UX** (what should be searchable, filterable, sortable, facetable; how to rank; synonyms; typo tolerance; chunking for relevance), and
|
||||
- one optimizing for **data modeling / pipeline cleanliness** (how the IR maps to documents, stable IDs, field flattening, avoiding lossy transforms, incremental update semantics, document-size limits).
|
||||
|
||||
They should explicitly resolve: single index vs. multiple indexes (e.g. by file type), the IR→document field mapping, which fields are filterable/sortable/searchable, chunking strategy, primary key choice, and how updates/deletes propagate. Capture the outcome as an **ADR** plus a concrete proposed Meilisearch settings configuration, and a concrete IR JSON schema. Surface any genuinely open trade-offs to me rather than silently picking.
|
||||
|
||||
## 8. Implementation expectations (after I approve the design)
|
||||
|
||||
- **Repo scaffolding:** clear module boundaries matching the layers; a README documenting the architecture and the IR contract; the chosen config format with a sample config.
|
||||
- **Interfaces first:** define `Source`, `Extractor`, `ModelBackend`, `Transformer`, `Sink`/`Indexer`, `SearchProvider` (read side), and `StateStore` as explicit protocols/interfaces. Provide a trivial "file sink" (writes IR to disk) so the standalone path works on day one.
|
||||
- **Phased build:** Priority 1 fully end-to-end and tested before starting Priority 2, etc.
|
||||
- **Incremental indexing:** content-hash + mtime tracked in the State Store; stable IDs for idempotent upserts; detect and propagate deletions.
|
||||
- **Concurrency:** parallel file processing with a worker pool, while respecting limits on local model calls (e.g. serialize/queue GPU-bound work). Make concurrency configurable.
|
||||
- **Error handling & observability:** structured logging, per-file status, a quarantine/dead-letter path for failures, and a `status` summary (counts of processed/failed/skipped).
|
||||
- **Testing:** a small fixture corpus per format under `tests/`; unit tests for each extractor and the IR serialization; an integration test for the end-to-end path against a local Meilisearch (or a mock sink). Keep tests cross-platform.
|
||||
- **Cross-platform care:** handle Windows path separators, long paths, file encodings, and file-lock situations; gate any OS-specific dependency (Access, LibreOffice, ffmpeg) behind capability checks with clear error messages when missing.
|
||||
- **Per-file timeouts & size limits:** OCR and transcription can hang or explode on huge inputs. Enforce a per-file timeout and a configurable max-size skip, recorded as a skipped/failed status — never a crashed run.
|
||||
- **Reindex / backfill command:** because the IR records `extractor_version` and the embedding-model id/version, provide an explicit command to reprocess or re-embed when a model or the schema improves. Don't rely on manual deletion.
|
||||
- **Deduplication policy:** decide what happens when the same content hash appears at multiple paths (index once with multiple source paths, vs. one document per path).
|
||||
- **Run modes & triggering:** support a one-shot CLI run now, and keep the core run loop decoupled from how it's triggered so a watched-folder mode (filesystem events) or a scheduled/background service can be added later — on Windows this may eventually run as a service or scheduled task.
|
||||
- **Secrets & keys:** model endpoints and search-engine API keys live in config / `.env`, never committed. Document the keys required.
|
||||
|
||||
## 9. Vector & hybrid search (design for it from day one)
|
||||
|
||||
We will support **semantic (vector)** and **hybrid (keyword + vector)** search. **v1 ships keyword-only** — but nothing in the design may make adding vectors later expensive, so the IR, pipeline, and index schema must accommodate them from day one (build the seams, leave the feature switched off).
|
||||
|
||||
- **Embeddings are generated locally**, behind the same Model Backend interface as OCR/ASR — never via an external API.
|
||||
- **Document-level vs. chunk-level embeddings:** good vector relevance usually wants chunk-level embeddings, which ties directly to the chunking decision in Section 7. Keep the two consistent.
|
||||
- The IR can carry embeddings **plus** the embedding-model id/version that produced them. Changing the embedding model or the chunking strategy invalidates existing vectors and requires a **reindex** — make that an explicit, supported operation (see the reindex command in Section 8), not a manual scramble.
|
||||
- **Index design must reserve for vectors now:** have Agent A research Meilisearch's embedder configuration, user-provided vs. auto-generated embeddings, hybrid ranking, and the version this requires, and bake the necessary fields/config into the proposed schema even if it ships disabled.
|
||||
- Keep it **engine-agnostic:** a replacement search engine might handle vectors differently (or not at all), so the read-side search interface (Section 10) must express **keyword**, **semantic**, and **hybrid** query modes generically rather than leaking Meilisearch specifics.
|
||||
|
||||
## 10. Frontend & search UX
|
||||
|
||||
This is the area I'm least familiar with, so propose options with clear trade-offs and a recommended default, then build something usable early even if basic.
|
||||
|
||||
**Core principle — the UI is just another client.** It must not reach into the pipeline and must not hardcode Meilisearch. Define a **read-side `SearchProvider` interface** that mirrors the write-side `Sink` (methods like `query`, `facets`, `suggest`, supporting the keyword/semantic/hybrid modes from Section 9) and expose it through a thin API. Swapping the search engine then means writing one new `SearchProvider` adapter; the UI never changes.
|
||||
|
||||
**Key decision to put to me (show the trade-off):**
|
||||
|
||||
- **Option A — UI queries the search engine directly** (e.g. Meilisearch's InstantSearch components / `instant-meilisearch`). Best turnkey UX — instant results, faceting, highlighting — for the least effort, *but* it couples the UI to Meilisearch and exposes a search-only key to the browser, which breaks our swappability promise on the read path.
|
||||
- **Option B — UI queries a thin Python search API** (FastAPI) that wraps `SearchProvider`. Keeps the system engine-agnostic end to end at the cost of reimplementing some glue (faceting, highlighting, pagination). **Recommended default**, given how central modularity is to this project. A rich JS search UI can still be added later in front of the same API.
|
||||
|
||||
**Frontend technology options** (I am not a frontend developer — bias toward staying in Python / minimal JS, and treat the UI as fully replaceable):
|
||||
|
||||
- **Fastest, Python-only:** Streamlit or NiceGUI — stand up a working search UI quickly; good for v1 and internal use. NiceGUI gives more real-app structure; Streamlit is the quickest to demo.
|
||||
- **Clean, minimal-JS (recommended starting point):** FastAPI serving server-rendered templates (Jinja2) enhanced with **HTMX** for search-as-you-type. Stays almost entirely in Python, is genuinely modular, and is production-reasonable.
|
||||
- **Best-in-class UX later:** a JS framework (React / Vue / Svelte) using Meilisearch's InstantSearch components, talking to the Python API (Option B) or the engine directly (Option A).
|
||||
|
||||
Have the research agents confirm current options and recommend one — but ship a basic working search page early rather than gold-plating.
|
||||
|
||||
**Search UX features to plan for** (not all in v1): search-as-you-type; typo-tolerant matching with **highlighted snippets**; **faceted filters** (file type, date, source folder, language); sorting; pagination or infinite scroll; result previews; and a clear way to **open or locate the original file** — including the page number for PDFs and the **timestamp for audio/video** so users can jump to the moment. Always show **provenance** (path, page/slide/sheet, timestamp). Handle empty-result and error states gracefully. An optional **admin view** can surface indexing status, last run, and failures (reuse the `status` summary from Section 8).
|
||||
|
||||
**Frontend security:** **single-user, no per-user access control** — so no tenant tokens or per-document ACLs are needed. Still, never expose the search engine's master key to a browser: if the UI ever queries the engine directly (Option A), use a **search-only key**. With Option B the API holds the key and the browser sees nothing.
|
||||
|
||||
## 11. CI/CD (Forgejo — set up on day one)
|
||||
|
||||
We host on a **local Forgejo instance** and want **continuous integration working from the very first commit**, not bolted on later. CI is part of Phase 1 scaffolding, before any extractor is written.
|
||||
|
||||
- **Forgejo Actions.** Use Forgejo Actions workflows (largely GitHub-Actions-compatible syntax) under `.forgejo/workflows/`, run by a **Forgejo Runner** (`act_runner`). Confirm with me which runners are registered and their labels/capabilities (Docker vs. host, and whether a Windows runner exists) — don't assume. If syntax/feature gaps from GitHub Actions show up, flag them rather than guessing.
|
||||
- **Layered, fast-by-default test suite.** Separate the tiers so most pushes stay fast:
|
||||
- **Unit tests** — no external services, no real models; run on every push. The `ModelBackend` and `Sink`/`SearchProvider` interfaces make this clean: inject fakes/mocks.
|
||||
- **Integration tests** — spin up a real **Meilisearch** in the job (service container or `docker run`) and exercise the indexing + query path against it, still with **mocked model backends**. Run on every PR to the main branch.
|
||||
- **Heavy / real-model tests** — OCR/ASR against actual local models are GPU-bound and slow; **gate these behind a marker** (e.g. a pytest marker / env flag) so they only run locally or on a specifically-labelled runner, never in the default pipeline.
|
||||
- **Cross-platform CI with a real Windows runner.** Primary target is Windows, so don't let "passes in Linux CI" hide Windows-only breakage. The official Forgejo Runner is Linux-only; use the community **`Crown0815/Forgejo-runner-windows-builder`** prebuilt Windows runner, registered against our instance with a `windows` label, to run a dedicated Windows job. Notes to account for: it's an unofficial community build (pin a known release); antivirus/Defender may flag the binary, so an exception is needed on the runner host; and a native Windows runner runs jobs **on the host, not in containers**. Therefore split CI by runner: the **Linux/Docker runner** handles the Meilisearch integration tier (service container), and the **Windows host runner** handles the path/encoding/file-lock/Windows-only unit tests (host-native, with any needed tools vendored on that host). Confirm the runner labels with me.
|
||||
- **Quality gates on every push:** linting + formatting (**ruff**, and `ruff format` or **black**), **type checking** (**mypy**), and the unit suite — all must pass to merge. Add **pytest coverage** reporting.
|
||||
- **Dependency & tool management:** pin dependencies (lockfile), cache them in CI, and pin the Python version(s) tested. Mirror the same checks in **pre-commit hooks** so failures surface before CI.
|
||||
- **Pipeline-as-code, reviewed like code.** Workflows live in the repo; keep them minimal and documented in the README (how to run each tier locally, how the runner is expected to be configured).
|
||||
- **Fixtures, not real data.** Ship a tiny sanitized fixture corpus per format under `tests/`; never commit real/sensitive documents. Keep large/binary fixtures minimal.
|
||||
|
||||
The acceptance bar for Phase 1: a fresh clone, the documented setup, and a green pipeline (unit + Meilisearch integration) on the local Forgejo instance.
|
||||
|
||||
## 12. Packaging & setup (make it as close to zero-install as possible)
|
||||
|
||||
A top priority: an operator should be able to get a working system **without manually installing dependencies**. This is hard here because the stack pulls in heavy native pieces — Meilisearch, an OCR runtime, an ASR runtime + ffmpeg, and LibreOffice for legacy Office conversion. The lever for "no manual installs" is to **bundle those, not ask the user to install them.**
|
||||
|
||||
- **Primary distribution — one-command Docker Compose stack.** Ship a `docker compose up` that brings up everything wired together: the pipeline, Meilisearch, the search UI/API, and the converter/model services, with ffmpeg / LibreOffice / OCR baked into the images. The operator's only prerequisite is a container runtime (Docker Desktop or Podman) — nothing else to install. Provide **sensible CPU-only defaults and a zero-config first run** (auto-create the index, ship an example config) so that `docker compose up` + pointing at a folder yields working search.
|
||||
- **GPU is opt-in, and the model backend can live outside Docker.** Because models sit behind the `ModelBackend` interface, the model runtime can be either a container **or** a host service the stack points at (e.g. a local model server). This avoids forcing every user through GPU-in-Docker setup (which on Windows means WSL2 + the NVIDIA Container Toolkit). Default to CPU; document GPU as an explicit upgrade.
|
||||
- **Secondary option for a non-Docker, Windows-first install — flag the trade-off.** If you (the human) would rather hand a non-technical Windows user a double-click experience with no Docker concept at all, the alternative is a **bundled native installer** (e.g. Inno Setup/MSI) that vendors the frozen Python app plus `meilisearch.exe`, `ffmpeg`, the OCR runtime, and a portable LibreOffice. This is the friendliest for an end user but the most work to build and maintain cross-platform. Have the research/design agents weigh Docker Compose vs. native bundle and recommend; **default to Docker Compose for v1** unless I say otherwise.
|
||||
- **Zero-install must not break modularity.** Batteries-included is the *default*, not a lock-in: every bundled service (Meilisearch, the model backend, the UI) stays independently overridable via config / compose overrides, so a user can point at their own Meilisearch or their own model server. The standalone, pip-installable pipeline (no UI, no search engine) remains a supported lighter path for developers.
|
||||
- **Document the few prerequisites honestly.** Whatever the choice, the README states the single prerequisite (container runtime, or nothing for the native bundle) and gives a copy-paste quickstart. No multi-step "install Tesseract, then ffmpeg, then…" lists.
|
||||
|
||||
## 13. Agile delivery & Forgejo workflow
|
||||
|
||||
All planning and execution runs through Forgejo — agile and iterative, on top of the superpowers methodology.
|
||||
|
||||
**Role-based agents.** Spin up as many specialized agents as the work needs, both while *detailing* the plan and later while *executing* it: **PM** (scope, sprint/issue breakdown, acceptance criteria), **Tech Lead** (architecture, interface contracts, PR review), **Senior Fullstack** (implementation), **UX/UI** (the search frontend), **QA** (tests/fixtures), and any others that help. This composes with superpowers' own subagent-driven development and code-reviewer agent rather than replacing it.
|
||||
|
||||
**V1 milestone & issues** (after I approve the design):
|
||||
|
||||
- Create a **V1 milestone** on Forgejo containing **all V1 tasks as issues** — each with a clear title, description, acceptance criteria, labels (type / component / priority), and an assigned sprint.
|
||||
- Create a **V2 milestone** with **coarse, low-detail issues** as placeholders for known-later work. Don't over-specify V2 — it will change as V1 teaches us things. Refine it only once V1 is done or something forces it.
|
||||
|
||||
**Sprints are end-to-end.** V1 is a sequence of sprints, each shipping a **thin but complete e2e slice**, not a horizontal layer. Illustrative only — the agents set the real breakdown:
|
||||
|
||||
- *Sprint 0 — skeleton:* repo scaffolding, green CI (Section 11), the core interfaces, the file-sink, and a trivial walk → stub-extractor → IR → Meilisearch → search path that proves the whole pipe end-to-end.
|
||||
- *Sprint 1+ — Priority 1 (scanned docs):* a real OCR slice end to end, with tests, behind the same interfaces.
|
||||
- then **Priority 2 (Office)**, then **Priority 3 (A/V)** — each its own sprint(s).
|
||||
|
||||
Every sprint ends with something demonstrable, tested, and merged.
|
||||
|
||||
**Dev loop (per issue)** — unless I tell you otherwise for a given issue or sprint:
|
||||
|
||||
1. **Pick an issue** from the active sprint of the milestone.
|
||||
2. **Create a git worktree for a new branch** off the latest `main` (see below) and **implement test-first** there (superpowers TDD: red → green → refactor); keep the change scoped to that issue.
|
||||
3. **Open a PR** that links the issue (`Closes #N`).
|
||||
4. **Review:** the code-reviewer / Tech Lead agent reviews against the plan and standards; address blocking findings.
|
||||
5. **CI must be green** (Section 11).
|
||||
6. **I approve and merge.** I do the final approval/merge myself — **do not self-merge** unless I've explicitly allowed it for that issue or sprint.
|
||||
7. Update the issue + milestone status, then pick the next issue.
|
||||
|
||||
**Always use git worktrees for new branches.** Never commit on `main` and never reuse one working directory across branches. Every issue gets its own **git worktree** in a **sibling** folder next to the repo, created from an up-to-date `main`, using this convention:
|
||||
|
||||
- **Branch:** `<type>/<issue>-<slug>` where type ∈ `feat | fix | chore | docs | refactor | test` — e.g. `feat/42-pdf-ocr-extractor`.
|
||||
- **Worktree path:** `../<repo>.worktrees/<issue>-<slug>` — one flat directory per issue, sibling to the repo root (e.g. `../filesearch.worktrees/42-pdf-ocr-extractor`). Sibling placement keeps worktrees out of the tree, so no `.gitignore` entry and no tooling traverses them.
|
||||
- **After merge:** remove the worktree and delete the branch (`git worktree remove …` + `git branch -d …`).
|
||||
|
||||
This keeps `main` clean, isolates each issue's work, and lets multiple role agents work in parallel without colliding. It matches superpowers' branch-finishing flow, which already creates and tidies up worktrees — let it manage them where it can.
|
||||
|
||||
**Forgejo access.** Use the **Forgejo MCP connector** (already installed) to create the milestone and issues, open PRs, post reviews, and update statuses. First **discover what the MCP actually exposes** and confirm it supports every operation this workflow needs (create milestone, create/label issues, open PR, request/post review, set status, and — if I ever delegate it — merge). If any needed operation is missing, flag it rather than working around it, and we'll add a Forgejo API-token + `tea` CLI fallback.
|
||||
|
||||
## 14. Questions to confirm with me before building
|
||||
|
||||
Ask me these (and anything else you need) up front:
|
||||
|
||||
- **Runtime:** Python — **confirmed.** Flag if any required dependency forces a non-Python component.
|
||||
- **Search type:** v1 ships **keyword-only** — **confirmed** (vectors designed-for but switched off; Section 9).
|
||||
- **Access control:** **single-user, no per-user access** — **confirmed** (no tenant tokens / ACLs needed).
|
||||
- **Forgejo access:** the **Forgejo MCP** is installed — **confirmed** as the way you'll create the milestone/issues and open PRs. Verify it exposes every operation the workflow needs and flag any gaps (Section 13).
|
||||
- **Forgejo runners:** I plan to use the `Crown0815/Forgejo-runner-windows-builder` Windows runner alongside the Linux/Docker runner — confirm the registered runner **labels** (e.g. `docker`, `windows`) so the workflows target them correctly.
|
||||
- **Distribution:** Docker Compose is the default — requiring a container runtime (Docker) is **confirmed acceptable**; no Docker-free native installer needed for v1 (Section 12).
|
||||
- **My existing OCR/scanned-doc setup:** which local model(s) and how they're served (in-process library? local server? Ollama-style?). I'll hand over my findings — how do you want them?
|
||||
- **UI approach & tech:** Option A (direct-to-engine) vs. Option B (Python search API), and a preference among Streamlit/NiceGUI vs. minimal-JS HTMX vs. a full JS frontend? `<FILL IN>`
|
||||
- **Indexing granularity:** whole documents or chunked? Any known max document/field sizes I care about? `<FILL IN>`
|
||||
- **Scale:** roughly how many files / total volume? (drives batching and concurrency design) `<FILL IN>`
|
||||
- **Document languages?** (affects OCR, transcription, and Meilisearch settings) `<FILL IN>`
|
||||
- **Hardware:** GPU available for local models? How much VRAM/RAM? `<FILL IN>`
|
||||
- **Where does Meilisearch run** — same machine, or a server? Is it already deployed? `<FILL IN>`
|
||||
- **File access:** local disks only, or network shares / UNC paths too? `<FILL IN>`
|
||||
- **How should indexing run** — manual CLI only for now, or do you want a watched-folder / scheduled / background-service mode? `<FILL IN>`
|
||||
|
||||
---
|
||||
|
||||
### Working agreement
|
||||
|
||||
Be pragmatic: don't over-engineer, prefer proven libraries that meet the local-model constraint, and keep the layering honest. When a decision has real trade-offs, show me the options and your recommendation instead of guessing. The intermediate representation and the swappable-sink boundary are the two things that must never be compromised for short-term convenience. Keep me in the loop at the gates that matter: I approve the plan before building, and I approve and merge each PR myself — don't self-merge unless I've said so.
|
||||
510
docs/research/A-meilisearch.md
Normal file
510
docs/research/A-meilisearch.md
Normal file
|
|
@ -0,0 +1,510 @@
|
|||
# Agent A — Meilisearch Research Findings
|
||||
|
||||
**Date:** 2026-07-01
|
||||
**Meilisearch version verified against:** v1.48.3 (released 2026-06-29)
|
||||
**Primary sources:** https://www.meilisearch.com/docs/, GitHub releases, llms.txt index
|
||||
|
||||
---
|
||||
|
||||
## 1. Summary and Concrete Recommendations
|
||||
|
||||
**Recommended version:** Pin to `getmeili/meilisearch:v1.48.3` (or the latest stable v1.48.x patch). The AI/vector search features (embedders, hybrid search) are production-stable in v1.x with no experimental flags needed; they do not need to be active in v1 — the embedder config can be declared at index creation and left dormant.
|
||||
|
||||
**Key decisions:**
|
||||
|
||||
| Decision | Recommendation |
|
||||
|---|---|
|
||||
| Index count | Single index `digger_documents` + `localizedAttributes` for Arabic+English |
|
||||
| Primary key | `id` field = SHA-256 hex of `abs_canonical_path + "\|" + content_hash` |
|
||||
| Embedder strategy | Declare `userProvided` embedder at index creation (no external calls, no cost); populate `_vectors.digger_semantic` in documents only when vectors are ready in v2 |
|
||||
| Arabic | Supported natively via Charabia; use `localizedAttributes` with `["ara","eng"]`; configure Arabic stop words manually |
|
||||
| Payload limit | Raise `MEILI_HTTP_PAYLOAD_SIZE_LIMIT` to 500MB; batch documents to stay well below that |
|
||||
| Long-doc limit | Plan chunked indexing path now (65,535-word field limit); whole-document is fine for v1 if you truncate `content_text` to first 60K words with a note in status |
|
||||
| Deduplication | `distinctAttribute: null` by default (one doc per file path); `content_hash` as distinct attribute is optional — surface decision to human |
|
||||
|
||||
---
|
||||
|
||||
## 2. Proposed Meilisearch Settings JSON
|
||||
|
||||
Apply this to `POST /indexes/digger_documents/settings` after index creation.
|
||||
Comments inline (strip before sending — JSON does not support comments).
|
||||
|
||||
```json
|
||||
{
|
||||
"searchableAttributes": [
|
||||
"content_text",
|
||||
"filename",
|
||||
"title",
|
||||
"tags",
|
||||
"metadata.author",
|
||||
"metadata.keywords",
|
||||
"metadata.subject",
|
||||
"path"
|
||||
],
|
||||
"filterableAttributes": [
|
||||
"file_type",
|
||||
"mime_type",
|
||||
"detected_language",
|
||||
"processing_status",
|
||||
"extractor_name",
|
||||
"source.host",
|
||||
"source.drive",
|
||||
"modified_at_epoch",
|
||||
"created_at_epoch",
|
||||
"file_size_bytes"
|
||||
],
|
||||
"sortableAttributes": [
|
||||
"modified_at_epoch",
|
||||
"created_at_epoch",
|
||||
"file_size_bytes",
|
||||
"filename"
|
||||
],
|
||||
"displayedAttributes": ["*"],
|
||||
"rankingRules": [
|
||||
"words",
|
||||
"typo",
|
||||
"proximity",
|
||||
"attribute",
|
||||
"sort",
|
||||
"exactness"
|
||||
],
|
||||
"distinctAttribute": null,
|
||||
"faceting": {
|
||||
"maxValuesPerFacet": 100,
|
||||
"sortFacetValuesBy": {
|
||||
"*": "count"
|
||||
}
|
||||
},
|
||||
"typoTolerance": {
|
||||
"enabled": true,
|
||||
"minWordSizeForTypos": {
|
||||
"oneTypo": 5,
|
||||
"twoTypos": 9
|
||||
},
|
||||
"disableOnWords": [],
|
||||
"disableOnAttributes": ["path", "content_hash", "id"],
|
||||
"disableOnNumbers": true
|
||||
},
|
||||
"synonyms": {},
|
||||
"stopWords": [],
|
||||
"localizedAttributes": [
|
||||
{
|
||||
"attributePatterns": ["content_text", "title", "metadata.*", "tags"],
|
||||
"locales": ["ara", "eng"]
|
||||
}
|
||||
],
|
||||
"pagination": {
|
||||
"maxTotalHits": 10000
|
||||
},
|
||||
"proximityPrecision": "byWord",
|
||||
"embedders": {
|
||||
"digger_semantic": {
|
||||
"source": "userProvided",
|
||||
"dimensions": 768
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Notes on this config
|
||||
|
||||
- **`searchableAttributes` order matters** for the `attribute` ranking rule: `content_text` is first (dominant for relevance), `filename` second (boosts exact filename hits), `path` last (so path-matching doesn't override content relevance).
|
||||
- **`filterableAttributes`** — `source.parent_dir` is omitted because filterable attribute values are hard-limited to **468 bytes** (LMDB constraint). Full paths can exceed this; store a normalized parent-dir token (≤ 468 bytes) if you want faceted folder filtering.
|
||||
- **`modified_at_epoch` and `created_at_epoch`** are Unix epoch integers, enabling range filters (`modified_at_epoch > 1700000000`) without date parsing in the filter language.
|
||||
- **`embedders.digger_semantic`** with `userProvided` + `dimensions: 768` reserves the vector slot. Meilisearch does not call any external service for userProvided embedders; it simply expects documents to optionally include a `_vectors.digger_semantic` array of 768 floats. Documents without the field are indexed and searched normally in keyword mode.
|
||||
- **`stopWords: []`** — Start empty; add Arabic and English stop words once you have a representative corpus. See Section 4 for Arabic-specific guidance.
|
||||
- **`maxTotalHits: 10000`** — raised from the default 1000 to support pagination over large result sets without hitting the cap; keep bounded to protect performance.
|
||||
- **`distinctAttribute: null`** — one document per file path. If you decide "same content at multiple paths → single document", change this to `"content_hash"` and revisit the primary key design (you'd want path stored as an array).
|
||||
|
||||
### Index creation call
|
||||
|
||||
```http
|
||||
POST /indexes
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"uid": "digger_documents",
|
||||
"primaryKey": "id"
|
||||
}
|
||||
```
|
||||
|
||||
Setting the primary key explicitly at creation avoids auto-detection races. This is an async operation — check the returned `taskUid`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Detailed Findings
|
||||
|
||||
### 3.1 Index Creation and Settings API
|
||||
|
||||
**Source:** https://www.meilisearch.com/docs/reference/api/settings
|
||||
|
||||
Indexes are created via `POST /indexes` with `uid` (index name) and optional `primaryKey`. All settings changes are applied via `PATCH /indexes/{uid}/settings` (partial update — only supplied keys change) or the individual sub-endpoints (`/settings/searchableAttributes`, etc.). Resetting a setting to its default uses `DELETE` on the sub-endpoint, or pass `null` in a PATCH.
|
||||
|
||||
Settings updates are **asynchronous** — they return a task object with a `taskUid`; the index remains queryable but with the old settings until the task completes.
|
||||
|
||||
**Full settings API surface relevant to digger:**
|
||||
|
||||
| Setting | Default | Notes |
|
||||
|---|---|---|
|
||||
| `searchableAttributes` | `["*"]` | Order determines attribute ranking weight |
|
||||
| `filterableAttributes` | `[]` | Must declare before filtering/faceting; triggers re-indexing |
|
||||
| `sortableAttributes` | `[]` | Must declare before sorting; triggers re-indexing |
|
||||
| `displayedAttributes` | `["*"]` | Controls what's returned in search results (not document GET) |
|
||||
| `rankingRules` | see above | Custom sort rules also valid, e.g. `"modified_at_epoch:desc"` |
|
||||
| `distinctAttribute` | `null` | Returns max 1 doc per distinct value |
|
||||
| `typoTolerance` | enabled | See 3.4 |
|
||||
| `faceting` | `{maxValuesPerFacet:100}` | See 3.2 |
|
||||
| `synonyms` | `{}` | Bidirectional unless one-way declared |
|
||||
| `stopWords` | `[]` | Case-insensitive; triggers re-indexing |
|
||||
| `localizedAttributes` | `[]` | Locale-specific tokenization per attribute pattern |
|
||||
| `embedders` | `{}` | AI/vector search config; see 3.5 |
|
||||
| `pagination` | `{maxTotalHits:1000}` | Cap on pageable results |
|
||||
| `proximityPrecision` | `"byWord"` | `"byAttribute"` is faster but less precise |
|
||||
| `searchCutoffMs` | `null` (uses 1500ms) | Hard timeout per search request |
|
||||
|
||||
Adding/changing `filterableAttributes` or `sortableAttributes` triggers a **full re-index** of that attribute. Plan the full set before first production import to avoid this cost on a large corpus.
|
||||
|
||||
### 3.2 Faceting, Synonyms, Stop Words, Typo Tolerance
|
||||
|
||||
**Faceting**
|
||||
|
||||
Configure via `faceting` setting. Facets are returned in search results when the `facets` search parameter lists the filterable attributes to aggregate. Key settings:
|
||||
|
||||
```json
|
||||
{
|
||||
"maxValuesPerFacet": 100,
|
||||
"sortFacetValuesBy": {"*": "count"}
|
||||
}
|
||||
```
|
||||
|
||||
`sortFacetValuesBy` can be `"count"` (most-common first) or `"alpha"`. Facet search (type-ahead on facet values) is enabled by default via `facetSearch: true`.
|
||||
|
||||
For digger, primary facet candidates: `file_type`, `detected_language`, `source.drive`, `modified_at_epoch` (use range filter not facet), `processing_status`.
|
||||
|
||||
**Synonyms**
|
||||
|
||||
One-way: `"AI": ["artificial intelligence"]`. Two-way: `"phone": ["mobile", "cell"]`. Synonyms do not affect typo-tolerance or prefix search. Start empty; add as users report missed results.
|
||||
|
||||
**Stop Words**
|
||||
|
||||
No built-in list for any language. Configure manually as an array of strings (case-insensitive). Updating stop words triggers re-indexing. For Arabic stop words see Section 4.
|
||||
|
||||
**Typo Tolerance**
|
||||
|
||||
- Enabled by default; words under 5 chars get 0 typo allowance; 5–8 chars get 1; 9+ chars get 2.
|
||||
- Disable on: `path`, `content_hash`, `id` (these should match exactly or not at all).
|
||||
- `disableOnNumbers: true` prevents `123` from matching `124`.
|
||||
- For Arabic: the 5-char minimum may not behave as expected for short Arabic roots (3–4 letters). Consider `disableOnAttributes: ["content_text"]` if Arabic content drives false positives, or raise `oneTypo` minimum to 7 for the corpus. Test empirically.
|
||||
|
||||
### 3.3 Document Operations and Tasks API
|
||||
|
||||
**Document semantics**
|
||||
|
||||
- `POST /indexes/{uid}/documents` — **add-or-update** (partial merge): if a document with the same primary key exists, only provided fields are updated; absent fields are preserved.
|
||||
- `PUT /indexes/{uid}/documents` — **add-or-replace**: if a document with the same primary key exists, it is completely replaced with the new document (missing fields are deleted).
|
||||
|
||||
For digger, use **PUT** (add-or-replace) during the main ingestion loop so that re-processing a file produces a clean, complete document. Use **POST** only for targeted partial updates (e.g. enriching with embeddings later without re-extracting content).
|
||||
|
||||
**Delete** is via `DELETE /indexes/{uid}/documents/{documentId}` (single) or `POST /indexes/{uid}/documents/delete-batch` (array of IDs) or `POST /indexes/{uid}/documents/delete` (filter expression — powerful for bulk deletes, e.g. remove all docs from a deleted directory).
|
||||
|
||||
**Batching**
|
||||
|
||||
Send arrays of documents: `[{...}, {...}, ...]`. No strict per-batch document count limit, but keep payload under `MEILI_HTTP_PAYLOAD_SIZE_LIMIT` (default 100MB, recommend raising to 500MB for large OCR outputs). Meilisearch groups compatible sequential document-add tasks into internal batches automatically.
|
||||
|
||||
**Tasks API**
|
||||
|
||||
Every mutating operation returns:
|
||||
```json
|
||||
{"taskUid": 12, "indexUid": "digger_documents", "status": "enqueued", "type": "documentAdditionOrUpdate"}
|
||||
```
|
||||
|
||||
Poll `GET /tasks/{taskUid}` until `status` is `succeeded`, `failed`, or `canceled`. Failed tasks include an `error` object with `code`, `message`, `type`, and a documentation link. Never assume a document is indexed until the task succeeds.
|
||||
|
||||
For the digger pipeline, implement a task-result callback that:
|
||||
1. On `succeeded`: mark files as indexed in the State Store.
|
||||
2. On `failed`: log the error, mark file as `index_failed` in State Store.
|
||||
3. Use `GET /tasks?indexUids=digger_documents&statuses=failed&limit=100` to poll for failures in bulk.
|
||||
|
||||
**Important task types for digger:**
|
||||
- `documentAdditionOrUpdate` / `documentDeletion` — document operations
|
||||
- `settingsUpdate` — settings changes (wait for this before first document import)
|
||||
- `indexCreation` — wait for this before any other operation on the index
|
||||
|
||||
### 3.4 Document and Field Limits
|
||||
|
||||
Source: https://www.meilisearch.com/docs/resources/help/known_limitations
|
||||
|
||||
| Limit | Value | Implication for digger |
|
||||
|---|---|---|
|
||||
| Default payload per HTTP request | 100MB | Raise to 500MB; batch carefully |
|
||||
| Max attributes per index | 65,536 | Not a concern for flat IR |
|
||||
| Max documents per index | ~4.3 billion | Covers any realistic corpus |
|
||||
| **Max positions per string field** | **65,535 words** | Long OCR'd docs / transcripts may be truncated silently |
|
||||
| Primary key value length | 511 bytes | SHA-256 hex (64 chars) is safe |
|
||||
| Filterable attribute value | **468 bytes** | Long path strings won't filter correctly |
|
||||
| Max concurrent searches | 1,000 | Single-user; not a concern |
|
||||
| Database size | ~80TiB (2TiB recommended) | Not a concern for ≤5TB corpus |
|
||||
| Max query terms | 10 | Queries longer than 10 tokens have later tokens ignored |
|
||||
|
||||
**Critical: 65,535 position limit on `content_text`.**
|
||||
A 3-hour audio transcript or a long PDF can exceed this. Words beyond position 65,535 are silently ignored — they are not indexed and will not appear in keyword search results. Options:
|
||||
1. **Truncate at index time**: Store first 60K words in `content_text`; store overflow in `content_text_overflow` (unindexed, displayed only). Simple, works for v1.
|
||||
2. **Chunked indexing**: Split long documents into chunk documents sharing a `parent_id` field; use `distinctAttribute: "parent_id"` to collapse chunks in results (one best-hit per parent). More complex but gives full coverage. This is the v2 path; design the IR to support it.
|
||||
|
||||
Recommendation: truncate in v1, design IR fields for chunked indexing from day one (include `chunk_index`, `chunk_count`, `parent_id` as optional fields that are null for whole-document mode).
|
||||
|
||||
**Primary key format constraint (critical for IR design):**
|
||||
Document IDs must contain **only alphanumeric characters, hyphens (`-`), and underscores (`_`)**. No slashes, dots, colons, spaces, or Unicode. Max 511 bytes.
|
||||
|
||||
Since our IR `id` is derived from `path + content_hash`:
|
||||
- A file path like `/home/user/docs/report 2024.pdf` cannot be used directly.
|
||||
- Recommended encoding: `id = sha256_hex(canonical_absolute_path + "|" + content_hash_hex)` — produces a 64-character lowercase hex string, well within the 511-byte limit and fully valid.
|
||||
- Store the original path in a separate `path` field for display and filtering.
|
||||
|
||||
### 3.5 Vector / Hybrid Search and Embedders
|
||||
|
||||
**Stability:** Production-stable as of Meilisearch v1.x (no experimental feature flag needed). No feature flag required to declare embedder settings in v1.48.
|
||||
|
||||
**Embedder sources available:**
|
||||
|
||||
| Source | Use case |
|
||||
|---|---|
|
||||
| `openAi` | Calls OpenAI API (requires network + API key) |
|
||||
| `huggingFace` | Downloads and runs HuggingFace model locally (in-process) |
|
||||
| `ollama` | Calls local Ollama server |
|
||||
| `rest` | Calls any HTTP embedding API (our local model server) |
|
||||
| `userProvided` | Embeddings pre-computed externally; included in document `_vectors` field |
|
||||
| `composite` | Combines two sources (indexing vs. search) |
|
||||
|
||||
**For digger:** `userProvided` is the correct choice for v1 vector readiness. Reasons:
|
||||
- All inference is local (requirement). Embeddings are generated by our pipeline, not by Meilisearch.
|
||||
- Decouples embedding generation from indexing: the pipeline can include/omit `_vectors` per document independently.
|
||||
- Meilisearch does not call any external service; it simply stores and searches the vectors.
|
||||
- `dimensions: 768` covers most local embedding models (e.g. BAAI/bge-base, nomic-embed-text, multilingual-e5-base).
|
||||
|
||||
**`userProvided` embedder config:**
|
||||
```json
|
||||
{
|
||||
"digger_semantic": {
|
||||
"source": "userProvided",
|
||||
"dimensions": 768
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Including vectors in documents** (v2, when embeddings are ready):
|
||||
```json
|
||||
{
|
||||
"id": "a3f9c1d2...",
|
||||
"_vectors": {
|
||||
"digger_semantic": [0.12, -0.34, 0.56, ...]
|
||||
}
|
||||
}
|
||||
```
|
||||
Documents without `_vectors.digger_semantic` are indexed normally and participate in keyword search but not semantic/hybrid search.
|
||||
|
||||
**REST embedder alternative** (if you want Meilisearch to auto-embed during indexing by calling your local model server):
|
||||
```json
|
||||
{
|
||||
"digger_semantic": {
|
||||
"source": "rest",
|
||||
"url": "http://embedding-service:8080/embed",
|
||||
"request": {
|
||||
"inputs": ["{{text}}", "{{..}}"]
|
||||
},
|
||||
"response": {
|
||||
"data": ["{{embedding}}", "{{..}}"]
|
||||
},
|
||||
"dimensions": 768,
|
||||
"documentTemplate": "{{doc.content_text | truncatewords: 200}}"
|
||||
}
|
||||
}
|
||||
```
|
||||
This makes Meilisearch call the embedding service on every document add/update. More convenient but adds coupling: indexing fails if the embedding service is down. For digger's standalone-pipeline requirement, `userProvided` is preferable.
|
||||
|
||||
**Warning:** Changing `source`, `model`, `dimensions`, or `documentTemplate` on an existing embedder triggers **complete re-generation** of all embeddings. Pick dimensions and stick with them. 768 is a safe, widely-supported value.
|
||||
|
||||
**Hybrid search at query time:**
|
||||
```json
|
||||
{
|
||||
"q": "quarterly financial report",
|
||||
"hybrid": {
|
||||
"semanticRatio": 0.5,
|
||||
"embedder": "digger_semantic"
|
||||
}
|
||||
}
|
||||
```
|
||||
`semanticRatio` ranges from 0.0 (pure keyword) to 1.0 (pure semantic); default 0.5 balances both. This is a per-query parameter, not a settings parameter — no re-indexing needed to tune it.
|
||||
|
||||
**Key design implication:** Because the `userProvided` embedder is declared at index creation with zero cost, do this in v1 Sprint 0 scaffolding. Changing dimensions later requires re-indexing everything. Pick 768 now.
|
||||
|
||||
---
|
||||
|
||||
## 4. Arabic-Specific Notes
|
||||
|
||||
**Tokenization support:** Arabic is supported via the Charabia tokenizer (Meilisearch's Rust tokenization library). Capabilities confirmed in current docs:
|
||||
- **Article segmentation**: the Arabic definite article (ال, al-) is segmented from the noun, enabling `باب` to match `الباب`.
|
||||
- **Normalization**: decomposition, digit conversion, and non-spacing mark (diacritic/tashkeel) removal. This means `كَتَبَ` and `كتب` are treated as equivalent — correct behavior for Arabic search.
|
||||
|
||||
**Mixed Arabic + English documents:**
|
||||
|
||||
Option A (recommended for digger v1): **Single index with `localizedAttributes`**
|
||||
|
||||
```json
|
||||
"localizedAttributes": [
|
||||
{
|
||||
"attributePatterns": ["content_text", "title", "metadata.*", "tags"],
|
||||
"locales": ["ara", "eng"]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
The `locales` array tells Meilisearch to apply both Arabic and English tokenization pipelines to these fields. For fields not in any pattern (e.g. `path`, `filename`), the default pipeline applies. ISO 639-3 locale codes: `ara` for Arabic, `eng` for English.
|
||||
|
||||
Pass the `locales` parameter in search queries that are in a specific language:
|
||||
```json
|
||||
{
|
||||
"q": "الوثائق المالية",
|
||||
"locales": ["ara"]
|
||||
}
|
||||
```
|
||||
Without the search-time `locales` parameter, Meilisearch auto-detects language — this works but explicit is more reliable for short queries.
|
||||
|
||||
Option B: **Two indexes** (`digger_documents_ar`, `digger_documents_en`) with per-language settings. Advantages: cleaner stop words, better facet isolation. Disadvantages: cross-language search requires federated search (`POST /multi-search`), more operational complexity. Recommended only if query quality is poor with single-index approach.
|
||||
|
||||
**Stop words for Arabic:**
|
||||
No built-in Arabic stop words in Meilisearch. Common Arabic stop words to configure manually (select based on corpus):
|
||||
|
||||
```json
|
||||
"stopWords": [
|
||||
"في", "من", "على", "إلى", "عن", "مع", "هذا", "هذه", "ذلك", "تلك",
|
||||
"التي", "الذي", "الذين", "اللتي", "اللذان", "و", "أو", "ثم",
|
||||
"لكن", "إن", "أن", "ما", "لا", "قد", "كان", "هو", "هي",
|
||||
"the", "a", "an", "is", "are", "was", "were", "of", "in", "to", "for"
|
||||
]
|
||||
```
|
||||
|
||||
Add English stop words to the same list (single index). This is safe because stop word matching is exact and case-insensitive; Arabic and English words do not collide.
|
||||
|
||||
**Typo tolerance and Arabic:**
|
||||
Arabic words are frequently 3–4 letters for common roots (كتب, فتح, علم). The default `oneTypo` threshold of 5 characters means most short Arabic words get zero typo tolerance — this is **correct behavior** (a typo in a 3-letter Arabic word would usually produce a different word). Do not lower the threshold below 5 for Arabic. Consider `disableOnAttributes: ["content_text"]` if you find false positives in Arabic content; enable typo only for `filename` and `title`.
|
||||
|
||||
**Right-to-left display:**
|
||||
Meilisearch returns plain text; RTL rendering is a UI concern, not a Meilisearch setting.
|
||||
|
||||
**Recommendation:** Use a single index with `localizedAttributes` for v1. If search quality is unsatisfactory for Arabic queries, migrate to two-index approach in v2 (the `SearchProvider` interface insulates the rest of the pipeline from this change).
|
||||
|
||||
---
|
||||
|
||||
## 5. Operational: Docker, Keys, and Environment
|
||||
|
||||
### Docker setup
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml (excerpt)
|
||||
services:
|
||||
meilisearch:
|
||||
image: getmeili/meilisearch:v1.48.3
|
||||
ports:
|
||||
- "7700:7700"
|
||||
environment:
|
||||
MEILI_MASTER_KEY: "${MEILI_MASTER_KEY}"
|
||||
MEILI_ENV: "production"
|
||||
MEILI_HTTP_PAYLOAD_SIZE_LIMIT: "536870912" # 512MB
|
||||
MEILI_NO_ANALYTICS: "true"
|
||||
MEILI_DB_PATH: "/meili_data/data.ms"
|
||||
MEILI_LOG_LEVEL: "INFO"
|
||||
volumes:
|
||||
- meili_data:/meili_data
|
||||
restart: unless-stopped
|
||||
|
||||
volumes:
|
||||
meili_data:
|
||||
```
|
||||
|
||||
Pin to a specific version tag — never use `latest` in production. Update version tag deliberately after testing.
|
||||
|
||||
### Master key and API keys
|
||||
|
||||
- **Master key**: `MEILI_MASTER_KEY` — minimum 16 bytes, alphanumeric. Required in `production` mode (enforced). Keep in `.env`, never commit.
|
||||
- **API keys**: created via `POST /keys` using the master key. Granular per-action permissions.
|
||||
- For digger's **read-side SearchProvider API**: create a search-only key:
|
||||
|
||||
```http
|
||||
POST /keys
|
||||
Authorization: Bearer <master_key>
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"name": "digger-search-readonly",
|
||||
"description": "Search-only key for the read-side API",
|
||||
"actions": ["search"],
|
||||
"indexes": ["digger_documents"],
|
||||
"expiresAt": null
|
||||
}
|
||||
```
|
||||
|
||||
The key value is returned once at creation and never again — store it securely. Use this key in the Python `SearchProvider` implementation; never expose the master key to the search API or any browser.
|
||||
|
||||
- **Pipeline indexer key**: create a separate key with `documents.add`, `documents.delete`, `tasks.get` permissions scoped to `digger_documents` only.
|
||||
|
||||
```http
|
||||
POST /keys
|
||||
{
|
||||
"name": "digger-indexer",
|
||||
"actions": ["documents.add", "documents.delete", "documents.get", "tasks.get", "tasks.cancel"],
|
||||
"indexes": ["digger_documents"],
|
||||
"expiresAt": null
|
||||
}
|
||||
```
|
||||
|
||||
- **Setup key** (one-time): use master key only to create the index, apply settings, and create the operational keys above. Rotate after setup.
|
||||
|
||||
### Environment variables summary
|
||||
|
||||
| Variable | Value for digger | Notes |
|
||||
|---|---|---|
|
||||
| `MEILI_MASTER_KEY` | from `.env` | Min 16 bytes; mandatory in production |
|
||||
| `MEILI_ENV` | `production` | Enables auth enforcement |
|
||||
| `MEILI_HTTP_PAYLOAD_SIZE_LIMIT` | `536870912` | 512MB; handles large OCR batches |
|
||||
| `MEILI_MAX_INDEXING_MEMORY` | default (2/3 RAM) | Override if co-located with other services |
|
||||
| `MEILI_MAX_INDEXING_THREADS` | default (half CPUs) | Override if CPU contention is an issue |
|
||||
| `MEILI_NO_ANALYTICS` | `true` | Privacy; no telemetry |
|
||||
| `MEILI_DB_PATH` | `/meili_data/data.ms` | Map to named volume |
|
||||
| `MEILI_SCHEDULE_SNAPSHOT` | `86400` (optional) | Daily snapshots for disaster recovery |
|
||||
|
||||
---
|
||||
|
||||
## 6. Risks and Open Questions
|
||||
|
||||
### Risks
|
||||
|
||||
1. **65,535-word field limit**: Long OCR'd PDFs (>~250 pages) and long transcripts will be silently truncated. This is invisible — Meilisearch will not warn you. Mitigate in v1 by storing a `content_text_truncated: true` flag in the IR when truncation occurs, and planning chunked indexing for v2.
|
||||
|
||||
2. **468-byte filterable attribute value limit**: File paths on Windows (UNC paths, deep nesting) can exceed this. Do NOT make `path` a filterable attribute; instead derive a `source_folder` or `parent_dir_hash` short token. Add `path` to `displayedAttributes` only.
|
||||
|
||||
3. **Re-indexing cost on setting changes**: Adding a new `filterableAttribute` or changing `searchableAttributes` order triggers a full re-index of the entire corpus. With 500K files this can take hours. Plan the full attribute list before first production import.
|
||||
|
||||
4. **Embedder config lock-in**: Changing embedder `dimensions` requires re-indexing all vectors. Commit to 768 dimensions now. If a better model uses different dimensions, you'll need a new embedder name (supported — multiple embedders are allowed on one index) rather than changing the existing one.
|
||||
|
||||
5. **Arabic query quality**: Diacritic normalization is good, but very short Arabic queries (1–2 words) may produce overly broad results due to root-sharing words. Test with representative queries; may need to add domain-specific synonyms.
|
||||
|
||||
6. **Tasks API polling**: All mutating operations are async. The pipeline must implement task tracking to confirm indexing success before marking files as done in the State Store. Failure to do this risks "indexed" files that Meilisearch never actually committed.
|
||||
|
||||
7. **`maxTotalHits` cap**: Even with 10,000, a query matching millions of documents will only return the first 10,000. This is correct behavior (pagination cap), not a bug. Make this visible to the user ("showing top 10,000 of N results").
|
||||
|
||||
### Open Questions (for human)
|
||||
|
||||
1. **Deduplication policy**: Same content at multiple paths — one document (with `content_hash` as `distinctAttribute` and paths stored as array) or one document per path? Recommendation: one per path (simpler, easier deletion tracking), but this is a product decision.
|
||||
|
||||
2. **Index language strategy**: Single index with `localizedAttributes` vs. separate per-language indexes? Recommend single index for v1; revisit in v2 based on Arabic query quality.
|
||||
|
||||
3. **Chunked indexing**: When? v2, or does the corpus include enough very long documents that it's needed in v1? Answer affects Sprint 1 scope.
|
||||
|
||||
4. **Snapshot / backup strategy**: Enable `MEILI_SCHEDULE_SNAPSHOT` for automated daily backups? Dump vs. snapshot? (Dumps are portable across versions; snapshots are faster but version-locked.)
|
||||
|
||||
5. **Meilisearch version upgrade policy**: Pin to patch version or minor version? Recommend pinning to exact tag (e.g. `v1.48.3`) and upgrading deliberately with a tested migration path.
|
||||
|
||||
---
|
||||
|
||||
*Research by Agent A. Verified against live Meilisearch docs and GitHub releases on 2026-07-01.*
|
||||
341
docs/research/B-local-model-tooling.md
Normal file
341
docs/research/B-local-model-tooling.md
Normal file
|
|
@ -0,0 +1,341 @@
|
|||
# Research B: Local-Model Tooling & End-to-End Document Conversion
|
||||
|
||||
**Agent B — July 2026**
|
||||
|
||||
---
|
||||
|
||||
## 1. Summary & Recommendations
|
||||
|
||||
### Document-Conversion Strategy
|
||||
|
||||
Use a tiered approach, not a single toolkit:
|
||||
|
||||
| Tier | What it handles | Tool |
|
||||
|---|---|---|
|
||||
| 1 – Native digital | PDF (digital), DOCX, PPTX, XLSX, HTML, EPUB, LaTeX | **Docling** (adopt wholesale) |
|
||||
| 2 – Scanned / image-heavy | Scanned PDFs, JPEG/PNG images, handwriting, certificates, IDs, forms | **Qwen2.5-VL via Ollama** (existing arabic-ocr approach, wrapped behind ModelBackend) |
|
||||
| 3 – Email + edge formats | EML, MSG, XML, CSV, ZIP | **Unstructured** (per-format, open-source mode only) |
|
||||
| 4 – Markdown fallback | Any Office file Docling skips or fails on | **MarkItDown** (per-format fallback) |
|
||||
| Rejected | Java heavyweight, Tesseract-only | ~~Apache Tika~~ — do not use as primary |
|
||||
|
||||
**OCR backend verdict**: Keep **Qwen2.5-VL via Ollama** as v1 OCR backend. It is the only option in the field that handles Arabic handwriting, certificates, IDs, tables, and forms reliably in a single model. Fine-tuned 3B variants achieve CER under 2% on Arabic handwriting — better than Google Vision API. Surya is a viable printed-only fallback for speed. Tesseract and docTR are not adequate for Arabic handwriting.
|
||||
|
||||
**Ollama deployment recommendation**: Run Ollama as a **host service** (native install, outside Docker). Expose it on `0.0.0.0:11434`. Inside the Docker Compose pipeline, reference it via the `OLLAMA_HOST` env var, with `extra_hosts: [host.docker.internal:host-gateway]` on Linux so all platforms converge on the same `http://host.docker.internal:11434` default. Keep the endpoint fully overridable.
|
||||
|
||||
---
|
||||
|
||||
## 2. Toolkit Comparison Table
|
||||
|
||||
| | **Docling** (IBM) | **MarkItDown** (Microsoft) | **Unstructured** | **Apache Tika** |
|
||||
|---|---|---|---|---|
|
||||
| **Version (Jul 2026)** | v2.107.0 | v0.1.6 | v0.23.1 | v3.3.0 |
|
||||
| **License** | MIT | MIT | Apache-2.0 | Apache-2.0 |
|
||||
| **File formats** | PDF, DOCX, PPTX, XLSX, HTML, EPUB, WAV, MP3, images, LaTeX, EML, MSG, and more | PDF, PPTX, DOCX, XLSX, images (EXIF+OCR), audio, HTML, CSV, JSON, XML, ZIP, EPub, YouTube URLs | PDF, DOCX, PPTX, XLSX, HTML, images, EML, MSG, XML, TXT; extensible | 1,400+ MIME types (Java-based universal parser) |
|
||||
| **OCR / VLM** | Built-in pipeline: EasyOCR (default); VLM pipeline via HuggingFace Transformers or any OpenAI-compatible API including **Ollama**. Supported models: SmolDocling (256M), GraniteDocling (258M), Qwen2.5-VL (3B), Pixtral, Phi-4, Granite Vision, NanoNets. Custom models via `InlineVlmOptions` with any HF `repo_id`. | Image OCR via `markitdown-ocr` plugin which calls any OpenAI-compatible API — including **Ollama**. No built-in OCR otherwise. | Tesseract (local, system install) + poppler. No VLM integration. Cloud API mode optional but not required. | Tesseract (local install required). No VLM. No custom model hooks. |
|
||||
| **Bring your own local model** | **Yes** — Ollama backend natively supported in VLM pipeline | **Yes** — via OpenAI-compatible API (Ollama `--openai` flag or OpenAI-compat endpoint) | **No** — Tesseract only for OCR; no pluggable model interface | **No** — Tesseract only; tight Java coupling |
|
||||
| **Output structure** | Rich: Markdown, HTML, JSON (DocTags), page references, bounding boxes, table objects, reading order, image refs | Markdown only — flat, no page refs, no bounding boxes | Python element objects (`Title`, `NarrativeText`, `Table`, `Image`, etc.) with metadata; JSON/HTML serializable | Plain text + metadata; no structural segmentation |
|
||||
| **Arabic support** | Not explicitly documented; EasyOCR backend lacks good Arabic; VLM pipeline inherits whatever model you point at (Qwen2.5-VL = excellent Arabic) | None built-in; OCR plugin delegates to Ollama model | Via Tesseract ara/Arabic.traineddata — printed only, ~15% CER; no handwriting | Via tesseract-langpack-ara; same limitations as bare Tesseract |
|
||||
| **Table handling** | **Excellent** — dedicated TableFormer model; Markdown + JSON table objects with cell coordinates | Basic — tables in Markdown when detectable by the LLM | Basic — Table element type, no cell-level structure | None |
|
||||
| **Windows support** | Yes (x86_64 + arm64; cross-platform Python) | Yes | Yes (with caveats on binary deps) | Yes (Java) |
|
||||
| **Maintenance** | Very active: 189 releases, 62.4k GitHub stars, Jun 2026 release | Active: 19 releases, May 2026 release | Active: 232 releases, 15k stars | Active: Apache project, v3.3.0 Mar 2026 |
|
||||
| **Verdict** | **Adopt wholesale for Tier 1** | **Keep as lightweight fallback** | **Adopt per-format (email/edge); open-source mode only** | **Do not adopt** (Java dependency, Tesseract-only, no model hooks) |
|
||||
|
||||
### Why reject Apache Tika as primary
|
||||
|
||||
Tika requires a JVM. The project's zero-install Docker Compose goal means pulling a JVM image just for format parsing is disproportionate when Docling covers the same formats better. Tika has no VLM integration and its OCR path is Tesseract with no escape hatch. Acceptable as a last-resort format detection layer if needed, not as a pipeline stage.
|
||||
|
||||
---
|
||||
|
||||
## 3. OCR/VLM Stacks for Arabic + English
|
||||
|
||||
### 3.1 Tesseract 5.x
|
||||
|
||||
- Arabic printed text: ~15% CER (character error rate) on modern prints per 2025 benchmarks — passable for simple printed text, unacceptable for production.
|
||||
- Arabic handwriting: effectively unusable without heavy domain fine-tuning.
|
||||
- Tables/forms: no structural understanding; flat text only.
|
||||
- CPU speed: fast (traditional LSTM, not neural-generation).
|
||||
- Right-to-left: supported (Bidi output iterators since v4).
|
||||
- Use case in digger: only as a preprocessing fallback to detect "is there any text at all" before routing to VLM. Not a primary OCR engine for Arabic.
|
||||
- License: Apache-2.0.
|
||||
|
||||
### 3.2 PaddleOCR (PP-OCRv5 / PaddleOCR 3.0)
|
||||
|
||||
- Arabic support: yes, 109 languages including a dedicated `arabic_PP-OCRv3_mobile_rec` model (2M params). PP-OCRv5 extended script coverage in 2025.
|
||||
- PaddleOCR-VL: a 0.9B VLM variant for document parsing, multilingual. CPU-capable but slower.
|
||||
- Handwriting: weak — primarily optimised for printed text; accuracy drops significantly on cursive handwriting.
|
||||
- Tables/forms: basic layout analysis; no semantic field extraction.
|
||||
- CPU speed: fast for the base models; PaddleOCR-VL slower.
|
||||
- Windows: workable via pip; binary deps (PaddlePaddle wheel) can be fiddly.
|
||||
- License: Apache-2.0.
|
||||
- Verdict for digger: viable as a CPU-fast fallback for simple printed Arabic text. Not suitable as the primary path for handwriting/IDs/certificates.
|
||||
|
||||
### 3.3 docTR (Mindee)
|
||||
|
||||
- Arabic support: **none**. Pre-trained recognition models cover English and French only. The text detector is script-agnostic but the recogniser has no Arabic vocabulary. Community discussions as of 2025 confirm this is still unresolved upstream. Custom fine-tuning is possible but requires labelled data and significant effort.
|
||||
- Tables: no.
|
||||
- License: Apache-2.0.
|
||||
- Verdict for digger: **exclude**. Arabic is a hard requirement; this would be a greenfield training project, not a tool adoption.
|
||||
|
||||
### 3.4 Surya OCR (datalab-to/surya v0.20.0 — "Surya OCR 2", May 2026)
|
||||
|
||||
- Arabic: **yes** — 90+ languages, Arabic at 72.7% on their internal benchmark. Supports RTL.
|
||||
- Capabilities: OCR, layout analysis (headers/tables/images/equations), reading order, table recognition (rows/columns/cells, HTML output).
|
||||
- Handwriting: not a stated strength; the model is trained on printed/typeset documents.
|
||||
- CPU speed: "5 pages/s on an RTX 5090" — CPU-only will be substantially slower; 650M params.
|
||||
- License: Code is Apache-2.0; **model weights are under Modified AI Pubs Open Rail-M** (free for personal/research/startups under $5M but not freely commercial). This is a licensing risk.
|
||||
- Integration with our ModelBackend: can be imported as a Python library; cannot be served via Ollama; would be a direct Python dep.
|
||||
- Verdict for digger: useful for printed multilingual documents and layout detection. The weight license is a commercial risk. Not suitable for Arabic handwriting. Consider for future layout/reading-order module, not as primary OCR.
|
||||
|
||||
### 3.5 Qwen2.5-VL via Ollama (current approach)
|
||||
|
||||
- Arabic performance: **best in class for our use case**. Benchmark data (2025-2026):
|
||||
- Fine-tuned `Arabic-handwritten-OCR-4bit-Qwen2.5-VL-3B-v2`: CER **1.78%**, outperforming Google Vision API by 57%.
|
||||
- Fine-tuned Qwen2.5-VL-3B on handwriting: CER 4.51%, WER ~9%, accuracy ~97.2%.
|
||||
- LoRA fine-tuned Qwen2.5-VL: 29% CER reduction on modern Arabic print, 17% on historical documents.
|
||||
- The base 7B model (as used in arabic-ocr) is untested against these but expected to perform at least as well given larger capacity.
|
||||
- Handwriting: **excellent**. This is the only evaluated open stack that handles cursive Arabic handwriting reliably.
|
||||
- Certificates, IDs, tables, forms: handled in the existing prompt (see Section 4).
|
||||
- CPU viability: The 7B Q4_K_M GGUF runs on 128 GB RAM CPU machines — slower than GPU but feasible. Per-page generation time will be 30-120s on pure CPU depending on context length.
|
||||
- License: Qwen2.5-VL model weights are under Qwen License (permissive for most uses including commercial, up to 100M users). Ollama/GGUF wrapping is MIT/Apache.
|
||||
- VLM peers worth noting:
|
||||
- **Qwen2.5-VL 3B**: smaller, faster, community fine-tuned for Arabic OCR. Good for CPU-constrained deployment.
|
||||
- **SmolDocling / GraniteDocling** (256-258M): tiny, intended for structured document conversion (DocTags), not general Arabic OCR. Can run in Docling's VLM pipeline. Not trained for Arabic handwriting.
|
||||
- **GLM-OCR (0.9B)**: lightweight, Ollama-available, multilingual. Less Arabic-specific data than Qwen. Good for simple cases, not handwriting.
|
||||
|
||||
**OCR stack recommendation**: Qwen2.5-VL via Ollama is the v1 OCR backend. Use 7B for maximum coverage; allow downgrade to 3B via config for speed. Lock model name behind a config key, not hardcoded.
|
||||
|
||||
---
|
||||
|
||||
## 4. Existing `arabic-ocr` Repo — Precise Summary
|
||||
|
||||
**Repository**: `/home/luffy/space/arabic-ocr/`
|
||||
|
||||
### What it does
|
||||
|
||||
Single-file script (`arabic_ocr_smart.py`) that OCRs scanned PDFs and images using a Qwen2.5-VL model running in Ollama, with a carefully designed Arabic-language prompt.
|
||||
|
||||
### Pipeline
|
||||
|
||||
```
|
||||
Input (PDF/JPEG/PNG)
|
||||
→ [PDF] poppler via pdf2image at 300 DPI → list of PIL Images
|
||||
→ [image] PIL.Image.open → single-element list
|
||||
→ For each page: PIL Image → base64-encode → POST to Ollama /api/chat (streaming)
|
||||
→ Concatenate streamed response chunks → per-page text
|
||||
→ Write pages to output file separated by === headers
|
||||
```
|
||||
|
||||
### The prompt (single-pass, not two API calls)
|
||||
|
||||
The CLAUDE.md mentions a "two-pass" design but the current `arabic_ocr_smart.py` uses **one single prompt** that instructs the model to perform two *mental* steps silently before outputting:
|
||||
|
||||
1. **Mental step 1** — identify the document type (the model reasons internally: handwritten, certificate, ID, form, table, mixed — but does not output this label).
|
||||
2. **Mental step 2** — recall typical vocabulary/phrases for that document type to improve ambiguous character resolution.
|
||||
3. Then transcribe the full page.
|
||||
|
||||
The prompt is written in Arabic ("أنت عالِم متخصص في قراءة المخطوطات العربية...") and persona-encodes expertise in all Arabic script styles: نسخ، رقعة، ديواني، إجازة، كوفي.
|
||||
|
||||
**Strict rules encoded in the prompt:**
|
||||
- Line-by-line transcription preserving original line breaks
|
||||
- RTL for Arabic, LTR for numerals/Latin
|
||||
- Use `[؟]` for illegible characters rather than hallucinating
|
||||
- No invented words
|
||||
- Official stamps: `[ختم: ...]`
|
||||
- Preserve tashkeel (diacritics) if visible
|
||||
|
||||
### Output format
|
||||
|
||||
**NOT plain text** — it is structured pseudo-markdown:
|
||||
|
||||
```
|
||||
============================================================
|
||||
Page 1
|
||||
============================================================
|
||||
|
||||
** عنوان رئيسي ** (headers as **bold**)
|
||||
نص عادي سطراً سطراً (plain text, line-preserved)
|
||||
| عمود | عمود | (tables in Markdown)
|
||||
| --- | --- |
|
||||
اسم الحقل: القيمة (form fields as "field: value")
|
||||
[فارغ] (empty fields)
|
||||
✓ / ☐ (checkboxes)
|
||||
[ختم: ...] (official stamps)
|
||||
```
|
||||
|
||||
Sections are separated by `====...====\nPage N\n====...====\n\n`.
|
||||
|
||||
### Ollama integration
|
||||
|
||||
Uses raw `urllib.request` (zero Python HTTP dependency beyond stdlib). Sends to `POST {host}/api/chat` with `stream: true`. Reads NDJSON chunks, collects `message.content` fragments, monitors `done` flag.
|
||||
|
||||
### Configurable options
|
||||
|
||||
| Option | Default |
|
||||
|---|---|
|
||||
| `--host` | `http://192.168.122.1:11434` |
|
||||
| `--model` | `qwen2.5vl:7b` |
|
||||
| `--dpi` | 300 |
|
||||
| `--ctx` | 12288 |
|
||||
| `--timeout` | 600 s |
|
||||
| `--poppler` | None (Linux auto-discovers) |
|
||||
|
||||
### How to wrap behind a ModelBackend OCR protocol
|
||||
|
||||
The extraction point is already narrow. Define a protocol:
|
||||
|
||||
```python
|
||||
from typing import Protocol
|
||||
from PIL import Image
|
||||
|
||||
class OCRBackend(Protocol):
|
||||
def ocr_image(self, image: Image.Image) -> str:
|
||||
"""Return extracted text for a single page image."""
|
||||
...
|
||||
|
||||
class ModelBackend(Protocol):
|
||||
ocr: OCRBackend
|
||||
# + asr: ASRBackend, embed: EmbedBackend, etc.
|
||||
```
|
||||
|
||||
The `QwenOllamaOCRBackend` implementation wraps `call_ollama()` from the existing script with config injected at construction time (host, model, num_ctx, timeout). No changes to the prompt are needed for v1. The backend is instantiated once per pipeline run, not per file.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class QwenOllamaOCRBackend:
|
||||
host: str = "http://localhost:11434"
|
||||
model: str = "qwen2.5vl:7b"
|
||||
num_ctx: int = 12288
|
||||
timeout: int = 600
|
||||
|
||||
def ocr_image(self, image: Image.Image) -> str:
|
||||
text, _, _ = call_ollama(self.host, self.model, image, self.timeout, self.num_ctx)
|
||||
return text
|
||||
```
|
||||
|
||||
PDF → PIL conversion (pdf2image/poppler) belongs in the **extractor layer**, not the backend — the backend only speaks PIL Images. The extractor calls `backend.ocr_image(page)` per page and assembles the IR.
|
||||
|
||||
**Keep Qwen2.5-VL-via-Ollama as v1**: it is empirically the best option for Arabic handwriting + printed + certificates/IDs/tables/forms, it is already working and tested in the arabic-ocr repo, and the prompt has been carefully engineered. Do not replace it without A/B benchmarking on representative document samples from the actual corpus.
|
||||
|
||||
---
|
||||
|
||||
## 5. Ollama Deployment: Host Service vs Docker Container
|
||||
|
||||
### Option A — Ollama as a host service (current setup)
|
||||
|
||||
Ollama is installed natively on the machine that has the GPUs/RAM. The Docker Compose pipeline points at `http://<host-ip>:11434` via the `OLLAMA_HOST` env var.
|
||||
|
||||
**Current specific case**: user runs the pipeline in a KVM/QEMU VM; host Ollama is reachable at `http://192.168.122.1:11434` (the libvirt bridge IP). This is just a special case of Option A where "host" = the KVM hypervisor.
|
||||
|
||||
| Factor | Option A (host service) | Option B (Docker container) |
|
||||
|---|---|---|
|
||||
| **Ease of use** | **Best** — `ollama pull`, `ollama serve`, done. No Compose changes per model. | More complex: GPU passthrough flags (`--gpus=all` on Linux; WSL2 on Windows; unavailable on macOS Docker Desktop), volume mounts, restart policies. |
|
||||
| **Model storage** | Models live in `~/.ollama/models` on the host — natural, easy to manage, shared across projects. | Requires a named Docker volume (`-v ollama:/root/.ollama`). Without it, models are lost on `docker compose down`. |
|
||||
| **GPU access** | Direct — no passthrough layer. Best latency. | NVIDIA: works on Linux via NVIDIA Container Toolkit; works on Windows via WSL2 (with drivers). AMD: `--device /dev/kfd`. macOS: **not available** (Docker Desktop VM has no GPU access). On CPU-only dev box: moot. |
|
||||
| **Model sharing between projects** | Free — one Ollama instance, multiple clients. | Each Compose stack that includes Ollama gets its own instance unless they share a named volume. |
|
||||
| **Isolation / reproducibility** | Less — Ollama version tied to host; team members might differ. | More — version pinned in Compose image tag. |
|
||||
| **Pipeline reachability (cross-platform)** | See networking table below. | Same networking applies, but Ollama URL is a compose service name if colocated. |
|
||||
| **Primary target (Windows)** | User installs Ollama for Windows natively — one `.exe`. **Simplest for end users.** | Docker Desktop for Windows + WSL2 required anyway; no GPU access unless NVIDIA + WSL2 CUDA stack. |
|
||||
|
||||
### Networking: how a Dockerized pipeline reaches a host-side Ollama
|
||||
|
||||
| Platform | Host Ollama address from inside container |
|
||||
|---|---|
|
||||
| **Windows (Docker Desktop)** | `http://host.docker.internal:11434` — works out of the box |
|
||||
| **macOS (Docker Desktop)** | `http://host.docker.internal:11434` — works out of the box |
|
||||
| **Linux (Docker Engine)** | `host.docker.internal` not defined by default. **Fix**: add to Compose service: `extra_hosts: ["host.docker.internal:host-gateway"]`. Then same URL works. |
|
||||
| **Linux KVM VM (dev setup)** | `http://192.168.122.1:11434` (libvirt bridge) — already working. With `extra_hosts: host-gateway` this can also resolve to the VM host if Ollama is on the VM host. |
|
||||
|
||||
Ollama must be started with `OLLAMA_HOST=0.0.0.0:11434` (or `OLLAMA_ORIGINS=*`) so it accepts connections from non-localhost.
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Default: Ollama as host service (Option A), endpoint configurable via env var.**
|
||||
|
||||
Rationale:
|
||||
1. Windows is the primary deployment target. Native Ollama for Windows is a one-click install. Docker Desktop GPU passthrough on Windows requires NVIDIA + WSL2 CUDA stack — high friction for non-technical users.
|
||||
2. The dev box has no GPU; CPU inference is the same whether Ollama is in Docker or not. No benefit to containerising it.
|
||||
3. Model storage on host is simpler and already familiar (the user already pulls models with `ollama pull`).
|
||||
4. The endpoint is already configurable (`--host` flag, easily an env var in the pipeline).
|
||||
|
||||
**Suggested defaults in pipeline config / `.env`:**
|
||||
|
||||
```ini
|
||||
# .env (committed as .env.example, gitignored when real)
|
||||
OLLAMA_HOST=http://host.docker.internal:11434 # Windows/macOS Docker default
|
||||
# OLLAMA_HOST=http://192.168.122.1:11434 # Linux KVM VM override
|
||||
# OLLAMA_HOST=http://localhost:11434 # native (no Docker) override
|
||||
```
|
||||
|
||||
**Suggested Docker Compose snippet:**
|
||||
|
||||
```yaml
|
||||
services:
|
||||
pipeline:
|
||||
build: .
|
||||
environment:
|
||||
- OLLAMA_HOST=${OLLAMA_HOST:-http://host.docker.internal:11434}
|
||||
extra_hosts:
|
||||
- "host.docker.internal:host-gateway" # makes it work on Linux too
|
||||
```
|
||||
|
||||
The `host-gateway` magic value is resolved by Docker Engine at container start to the host's bridge IP. This makes the same env var default (`http://host.docker.internal:11434`) work identically on Windows, macOS, and Linux without platform-specific logic in the application code.
|
||||
|
||||
### When to consider Option B (Ollama in Docker)
|
||||
|
||||
- You want Ollama version-pinned alongside the pipeline for reproducible CI.
|
||||
- All target machines have NVIDIA GPUs with Container Toolkit.
|
||||
- You need a fully self-contained `docker compose up` for a demo/deployment with no pre-installed host software.
|
||||
- In that case: add an `ollama` service to Compose with `--gpus=all`, a named volume, and a `healthcheck` on port 11434. The pipeline service references it as `http://ollama:11434`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Embeddings: Designed-for, Not Shipped
|
||||
|
||||
Local embedding options servable via the same ModelBackend (Ollama or a small local server):
|
||||
|
||||
| Model | Ollama tag | Dimensions | Context | Multilingual | Notes |
|
||||
|---|---|---|---|---|---|
|
||||
| **bge-m3** (BAAI) | `bge-m3` | 1024 | 8192 tokens | 100+ languages incl. Arabic | Supports dense, sparse, and multi-vector retrieval. Best for multilingual hybrid search. **Recommended for Arabic+English.** |
|
||||
| **nomic-embed-text v1.5** | `nomic-embed-text` | 768 | 8192 tokens | Primarily English | Best-in-class English retrieval; 274 MB. Not ideal for Arabic. |
|
||||
| **nomic-embed-text-v2-moe** | `nomic-embed-text-v2-moe` | varies | — | ~100 languages | Mixture-of-experts; larger. Covers Arabic. |
|
||||
| **Qwen3-Embedding** | (HF, not yet Ollama native as of Jul 2026) | 1536 | — | Multilingual | State-of-the-art 2026 multilingual embedding; track for Ollama availability. |
|
||||
|
||||
**For digger v1 embed design**: define `EmbedBackend.embed(texts: list[str]) -> list[list[float]]` on the ModelBackend interface. Default implementation calls `POST {host}/api/embeddings` (Ollama native endpoint). Plugin `bge-m3` as the Arabic-safe default. This costs no extra infrastructure — same Ollama instance, different model name.
|
||||
|
||||
---
|
||||
|
||||
## 7. Risks & Open Questions
|
||||
|
||||
1. **Docling + Arabic**: Docling's EasyOCR default path likely has poor Arabic coverage. The VLM pipeline (Qwen2.5-VL via Ollama) fixes this, but it means Docling in "smart" mode is effectively the same inference cost as the arabic-ocr script. Need to decide: run Docling VLM pipeline, or use Docling for native-digital PDFs only and route scanned content to the arabic-ocr backend directly. The cleaner architecture is the latter.
|
||||
|
||||
2. **Surya model license**: Surya weights are not Apache-licensed (Modified AI Pubs Open Rail-M). If digger is ever used commercially or distributed, verify the Surya license before including it. Docling's SmolDocling/GraniteDocling are MIT/Apache.
|
||||
|
||||
3. **Qwen2.5-VL CPU throughput**: On a 128 GB RAM CPU-only machine, 7B Q4_K_M processes pages at ~30-120s each. For large corpora, this is a bottleneck. Mitigation: run multiple Ollama workers (set `OLLAMA_NUM_PARALLEL`), or downgrade to 3B for speed. The 3B fine-tuned variant already achieves <2% CER.
|
||||
|
||||
4. **Windows poppler path**: The existing arabic-ocr script already handles this via `--poppler` flag. The pipeline wrapper should auto-discover poppler from PATH on Linux/macOS and require explicit config on Windows, or bundle it in the Docker image.
|
||||
|
||||
5. **Unstructured cloud-mode risk**: Unstructured's pip package supports both local (Tesseract) and cloud API modes. The pipeline config must explicitly disable cloud features to guarantee no data leaves the machine. Use `partition()` with `strategy="fast"` or `strategy="hi_res"` (local only). Never use `UnstructuredClient` (cloud SDK).
|
||||
|
||||
6. **MarkItDown OCR quality**: The `markitdown-ocr` plugin delegates to whatever LLM client you give it. Image-heavy documents should go through the proper VLM OCR path, not MarkItDown's image description. MarkItDown is best reserved for Office documents where the format layer already contains structured text.
|
||||
|
||||
7. **ModelBackend interface versioning**: The `ocr_image(PIL.Image) -> str` signature is sufficient for v1. Future versions may want `ocr_image(...) -> StructuredPage` to carry bounding boxes, confidence scores, and detected document type from the model's reasoning. Design the IR to accommodate richer metadata even if v1 only fills `raw_text`.
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- Docling GitHub: https://github.com/DS4SD/docling
|
||||
- Docling VLM pipeline: https://docling-project.github.io/docling/usage/vision_models/
|
||||
- MarkItDown GitHub: https://github.com/microsoft/markitdown
|
||||
- Unstructured GitHub: https://github.com/Unstructured-IO/unstructured
|
||||
- Apache Tika GitHub: https://github.com/apache/tika
|
||||
- Surya GitHub: https://github.com/datalab-to/surya
|
||||
- Ollama Docker docs: https://docs.ollama.com/docker
|
||||
- QARI-OCR (Arabic OCR benchmark, Jun 2025): https://arxiv.org/html/2506.02295
|
||||
- Arabic-handwritten-OCR-4bit-Qwen2.5-VL-3B-v2 (HuggingFace): https://huggingface.co/sherif1313/Arabic-handwritten-OCR-4bit-Qwen2.5-VL-3B-v2
|
||||
- PaddleOCR 3.0 technical report: https://arxiv.org/html/2507.05595v1
|
||||
- bge-m3 on Ollama: https://ollama.com/library/bge-m3
|
||||
- ibm/granite-docling on Ollama: https://ollama.com/ibm/granite-docling
|
||||
- Docling+Ollama VLM example: https://www.youtube.com/watch?v=rHiL3LxYWY8
|
||||
- docTR Arabic discussion: https://github.com/mindee/doctr/discussions/1434
|
||||
- host.docker.internal on Linux: https://github.com/ollama/ollama/issues/3652
|
||||
352
docs/research/C-office-legacy.md
Normal file
352
docs/research/C-office-legacy.md
Normal file
|
|
@ -0,0 +1,352 @@
|
|||
# Agent C Research: Office & Legacy Format Extraction
|
||||
|
||||
**Scope:** Microsoft Office file formats — modern OOXML (.docx/.xlsx/.pptx) and legacy binary (.doc/.xls/.ppt), plus Microsoft Access (.mdb/.accdb). Cross-platform extraction strategy for a local-first Python pipeline.
|
||||
|
||||
**Researched:** 2026-07-01. All versions and URLs verified against live sources.
|
||||
|
||||
---
|
||||
|
||||
## 1. Summary & Per-Format Recommendation Table
|
||||
|
||||
| Format | Primary Tool | Fallback / Legacy | Notes |
|
||||
|---|---|---|---|
|
||||
| `.docx` | **docx2python** (extraction) + **python-docx** (structured access) | office_oxide (Rust, fast) | docx2python for rich structured output; python-docx for fine-grained API |
|
||||
| `.xlsx` | **openpyxl** (read-only mode for perf) | **python-calamine** (Rust, faster, read-only) | openpyxl for full metadata; calamine for large-file throughput |
|
||||
| `.pptx` | **python-pptx** | office_oxide | python-pptx covers slides, tables, notes, core properties |
|
||||
| `.doc` (binary) | **unoserver → convert to .docx → python-docx** | **office_oxide** (native read) | LibreOffice via unoserver is proven; office_oxide v0.1.x is newer but promising |
|
||||
| `.xls` (binary) | **xlrd** (for BIFF .xls only) + **unoserver → .xlsx** | **office_oxide** | xlrd is the cleanest direct reader; unoserver for full-fidelity conversion |
|
||||
| `.ppt` (binary) | **unoserver → convert to .pptx → python-pptx** | **office_oxide** | No good pure-Python .ppt reader; LibreOffice is the only reliable path |
|
||||
| `.mdb` (Access, JET 3/4) | **Windows:** pyodbc + ACE driver; **Linux:** mdbtools / mdb-parser | access-parser (pure Python, limited) | Windows-primary in v1; mdbtools on Linux for CI |
|
||||
| `.accdb` (Access 2007+) | **Windows:** pyodbc + ACE ODBC redistributable | — | **Defer .accdb cross-platform to v2**; Windows-only path in v1 |
|
||||
|
||||
---
|
||||
|
||||
## 2. Modern OOXML Libraries
|
||||
|
||||
### 2.1 Word (.docx) — python-docx + docx2python
|
||||
|
||||
**python-docx v1.2.0** (released 2025-06-16, MIT, Python 3.9+)
|
||||
- Canonical Word library. Reads paragraphs, runs, styles, tables, headers/footers, core properties.
|
||||
- `doc.core_properties` exposes: `author`, `title`, `subject`, `description`, `created`, `modified`, `revision`, `keywords`, `category`, `last_modified_by`.
|
||||
- Tables: iterate `doc.tables`, then `table.rows[i].cells[j].text`. Merged cells are approximated (repeated values), not structurally preserved. Nested tables accessible via `_Cell.tables`.
|
||||
- **Known limitation:** cannot iterate paragraphs and tables in document order (reading order). Workaround: walk `doc.element.body` directly to get mixed paragraph/table elements in order.
|
||||
- References: https://pypi.org/project/python-docx/ | https://python-docx.readthedocs.io/
|
||||
|
||||
**docx2python v3.6.2** (released 2026-01-31, MIT, Python 3.10+)
|
||||
- Purpose-built for *extraction*. Extracts headers, footers, body, footnotes, endnotes, comments, images, core properties, hyperlinks, numbered/bulleted lists, tables, math equations, checkboxes, dropdown selections.
|
||||
- Tables returned as clean `n×m` nested lists (cells always rectangular after merge flattening).
|
||||
- Paragraph styles exposed (Heading 1 → h1 when `html=True`). Font/bold/italic/underline/color optionally emitted as HTML spans.
|
||||
- **Recommendation:** use docx2python for the Extractor's text/structure output (the IR body and structured segments), and python-docx only when you need the full DOM API.
|
||||
- References: https://pypi.org/project/docx2python/ | https://github.com/ShayHill/docx2python
|
||||
|
||||
### 2.2 Excel (.xlsx) — openpyxl (+ calamine for big files)
|
||||
|
||||
**openpyxl v3.1.5** (released 2024-06-28, MIT, Python 3.8+)
|
||||
- The standard xlsx read/write library. Replaces xlrd for .xlsx (xlrd v2+ dropped xlsx support entirely).
|
||||
- Sheet access: `wb.sheetnames` returns ordered list. `wb[name]` or `wb.active`.
|
||||
- Workbook metadata: `wb.properties` (a `DocumentProperties` object) exposes `title`, `creator`, `description`, `created` (datetime), `modified` (datetime), `lastModifiedBy`, `keywords`, `subject`, `category`.
|
||||
- Row iteration: `ws.iter_rows()` in **read-only mode** (`load_workbook(path, read_only=True)`) — significantly lower memory and faster; note `iter_cols()` and `ws.columns` are unavailable in read-only mode.
|
||||
- Named ranges: `wb.defined_names` (DefinedNameList). Tables (ListObject): `ws.tables`.
|
||||
- References: https://pypi.org/project/openpyxl/ | https://openpyxl.readthedocs.io/
|
||||
|
||||
**python-calamine** (Rust/calamine binding, read-only)
|
||||
- Python bindings for Rust's `calamine` library. Reads .xlsx, .xls, .xlsm, .ods. Significantly faster than openpyxl for large read-only workloads. No write support.
|
||||
- Good fit as a performance tier: fall back to it when openpyxl is slow on large files (>50 MB).
|
||||
- References: https://pypi.org/project/python-calamine/ | https://github.com/dimastbk/python-calamine
|
||||
|
||||
**xlrd v2.0.1** (legacy .xls only)
|
||||
- Since v2.0, xlrd reads only BIFF `.xls` files (Excel 97–2003). Use it as the dedicated .xls reader.
|
||||
- For `.xls` text-only extraction, xlrd is cleaner than routing through LibreOffice conversion. Exposes sheet names, rows, cell types (string/number/date/boolean/error), and basic formatting info.
|
||||
- References: https://pypi.org/project/xlrd/ | https://xlrd.readthedocs.io/
|
||||
|
||||
### 2.3 PowerPoint (.pptx) — python-pptx
|
||||
|
||||
**python-pptx v1.0.2** (released 2024-08-07, MIT, Python 3.8+)
|
||||
- Reads/writes .pptx. Slide enumeration: `prs.slides`, each slide has `slide.shapes` (text boxes, tables, images, charts, SmartArt, auto-shapes).
|
||||
- Speaker notes: `slide.notes_slide.notes_text_frame.text`.
|
||||
- Tables: `shape.table`, iterate `shape.table.rows[i].cells[j].text`.
|
||||
- Core properties: `prs.core_properties` — same API as python-docx (author, title, subject, created, modified, etc.). `len(prs.slides)` for slide count.
|
||||
- Slide titles: access the title placeholder via `slide.shapes.title.text` where it exists.
|
||||
- Limitations: does not support `.ppt` binary format at all.
|
||||
- References: https://pypi.org/project/python-pptx/ | https://python-pptx.readthedocs.io/
|
||||
|
||||
### 2.4 office_oxide — Cross-format Rust library (emerging option)
|
||||
|
||||
**office_oxide v0.1.2** (released 2026-05-15, MIT/Apache-2.0)
|
||||
- Written in Rust, Python bindings via PyO3. Handles all six formats: DOCX, XLSX, PPTX + **DOC, XLS, PPT**.
|
||||
- Outputs plain text, Markdown, HTML, and a structured IR. Legacy formats (DOC/XLS/PPT) are read-only with conversion to modern OOXML available via `save_as()`.
|
||||
- Claims 8–100× faster than python-docx/openpyxl/python-pptx; 100% pass rate on a 6,062-file corpus.
|
||||
- **Risk:** v0.1.x — immature API, limited community vetting. Consider as a future optimization or a fallback for legacy binary extraction in v1, not the primary path yet.
|
||||
- References: https://github.com/yfedoseev/office_oxide
|
||||
|
||||
---
|
||||
|
||||
## 3. Legacy Binary Strategy — LibreOffice via unoserver
|
||||
|
||||
### 3.1 Approach
|
||||
|
||||
The reliable cross-platform strategy for `.doc`, `.ppt` (and as a fallback for `.xls`) is: **convert via LibreOffice headless → parse the resulting OOXML with the standard libraries above**.
|
||||
|
||||
Do NOT use `antiword` (source has disappeared, abandoned) or `catdoc` (abandoned) — they are dead projects.
|
||||
|
||||
### 3.2 unoserver (recommended over raw soffice)
|
||||
|
||||
**unoserver v3.7** (released 2026-06-10, MIT, Python 3.8+) — the modern successor to the deprecated `unoconv`.
|
||||
|
||||
unoserver keeps LibreOffice running in **listener mode** and accepts conversion requests via XML-RPC, rather than spawning and killing `soffice` per file. This reduces CPU load by **50–75%** for batch processing.
|
||||
|
||||
```bash
|
||||
# Start the server (runs LibreOffice persistently):
|
||||
unoserver --interface 127.0.0.1 --port 2003 &
|
||||
|
||||
# Convert a file:
|
||||
unoconverter --interface 127.0.0.1 --port 2003 \
|
||||
--convert-to docx input.doc output.docx
|
||||
```
|
||||
|
||||
Python integration via the `UnoClient` class or via subprocess to `unoconverter`.
|
||||
|
||||
Important: must be installed using the **same Python that LibreOffice uses** (typically system Python in a Docker image, not a virtualenv Python). In Docker this is cleanest when LibreOffice and unoserver are co-installed in the base image.
|
||||
|
||||
References: https://pypi.org/project/unoserver/ | https://github.com/unoconv/unoserver
|
||||
|
||||
### 3.3 Direct soffice invocation (simpler, lower throughput)
|
||||
|
||||
For low-volume or one-off use, invoke `soffice` directly:
|
||||
|
||||
```bash
|
||||
soffice --headless --norestore \
|
||||
"-env:UserInstallation=file:///tmp/lo_profile_${RANDOM}" \
|
||||
--convert-to docx \
|
||||
--outdir /tmp/output/ \
|
||||
input.doc
|
||||
```
|
||||
|
||||
**Critical pitfalls and mitigations:**
|
||||
|
||||
| Pitfall | Detail | Mitigation |
|
||||
|---|---|---|
|
||||
| **Instance lock** | LibreOffice uses a single-instance lock per user profile | Pass unique `UserInstallation` URI per process/thread. On Windows use `file:///C:/temp/lo_<uuid>` |
|
||||
| **Font substitution** | Without MS metric-compatible fonts, Calibri→Liberation substitution corrupts layouts | Install `fonts-crosextra-carlito` and `fonts-crosextra-caladea` in Docker image |
|
||||
| **Silent failures** | Conversion can fail without non-zero exit code | Always verify output file exists; check `stderr`; wrap in `try/finally` to clean up |
|
||||
| **Timeout** | Large files or corrupt inputs can hang `soffice` indefinitely | Use `subprocess.run(..., timeout=120)` or equivalent; treat TimeoutExpired as a failure |
|
||||
| **Single-threaded** | One LibreOffice instance handles one file at a time per profile | For parallel conversion: either use unoserver + a task queue, or run multiple isolated containers |
|
||||
| **Windows path** | `UserInstallation` must be a proper `file://` URI; use `Path(...).as_uri()` to generate it | Use Python's `pathlib.Path.as_uri()` |
|
||||
|
||||
### 3.4 Docker image strategy
|
||||
|
||||
Recommended base: **Ubuntu 22.04 LTS** + LibreOffice from apt.
|
||||
|
||||
```dockerfile
|
||||
FROM ubuntu:22.04
|
||||
RUN apt-get update && apt-get install -y \
|
||||
libreoffice \
|
||||
fonts-crosextra-carlito fonts-crosextra-caladea \
|
||||
fonts-liberation \
|
||||
python3-pip \
|
||||
&& pip3 install unoserver \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
```
|
||||
|
||||
Image size: ~800 MB for Ubuntu-based (LibreOffice is large). Alpine-based variants exist (~200–300 MB) but LibreOffice packaging for Alpine is less straightforward. For v1, prefer Ubuntu for reliability over Alpine for size.
|
||||
|
||||
LibreOffice **25.8** (August 2025) added 30% faster file-open times and significantly improved XLSX handling — use a recent release.
|
||||
|
||||
The document-conversion service should be isolated as its own container (separate from the main pipeline), called over a local socket or HTTP. This keeps LibreOffice's large footprint out of the pipeline container and makes it independently scalable.
|
||||
|
||||
References: https://oneuptime.com/blog/post/2026-02-08-how-to-run-libreoffice-in-docker-for-document-conversion/view | https://github.com/unoconv/unoserver
|
||||
|
||||
### 3.5 Legacy .xls — dual strategy
|
||||
|
||||
For `.xls` (BIFF/Excel 97–2003):
|
||||
- **Direct:** use `xlrd` (reads BIFF natively, no LibreOffice needed). Good for text + cell-type extraction.
|
||||
- **Fallback:** convert via unoserver to `.xlsx`, then read with openpyxl if richer formatting metadata is needed.
|
||||
|
||||
`xlrd` is the cleaner path for the pipeline since it avoids LibreOffice for this common format.
|
||||
|
||||
---
|
||||
|
||||
## 4. Microsoft Access — Reality Check
|
||||
|
||||
### 4.1 The cross-platform problem
|
||||
|
||||
Access is the most platform-restricted format in the stack. The MDB/ACCDB format is a proprietary JET/ACE database container. No fully cross-platform, production-grade pure-Python solution exists today.
|
||||
|
||||
### 4.2 Windows path (recommended for v1)
|
||||
|
||||
**pyodbc + Microsoft ACE ODBC Redistributable**
|
||||
|
||||
- `pyodbc` connects to the system ODBC driver layer.
|
||||
- For `.mdb` (JET 3/4): `Microsoft Access Driver (*.mdb)` (32-bit, bundled with Windows). For `.accdb`: `Microsoft Access Driver (*.mdb, *.accdb)` — this is the ACE engine, **not bundled with Windows but available as a free redistributable** (`AccessDatabaseEngine.exe` or `AccessDatabaseEngine_X64.exe` from Microsoft).
|
||||
- **Architecture must match:** if you use 64-bit Python, install the 64-bit ACE redistributable. Mixing 32/64-bit causes silent driver-not-found failures.
|
||||
- Does **not** require a Microsoft Office installation — the redistributable is standalone.
|
||||
- References: https://github.com/mkleehammer/pyodbc/wiki/Connecting-to-Microsoft-Access
|
||||
|
||||
```python
|
||||
import pyodbc
|
||||
|
||||
def open_access(path: str) -> pyodbc.Connection:
|
||||
conn_str = (
|
||||
r"Driver={Microsoft Access Driver (*.mdb, *.accdb)};"
|
||||
f"DBQ={path};"
|
||||
)
|
||||
return pyodbc.connect(conn_str)
|
||||
```
|
||||
|
||||
### 4.3 Linux/macOS path (limited, CI use only)
|
||||
|
||||
**mdbtools v1.0.1** (released 2024-12-26, LGPL)
|
||||
- C binary suite: `mdb-tables`, `mdb-export`, `mdb-schema`, `mdb-json`, `mdb-sql`.
|
||||
- `.mdb` (JET 3 and JET 4): solid support. `.accdb` (ACE/JET 5): added in v1.0 but described by community as "limited and unreliable" — encrypted files, multi-value fields, attachment columns not supported.
|
||||
- Use for Linux CI to run tests against `.mdb` fixtures. Do not depend on it for `.accdb` in production.
|
||||
- Python wrappers: `mdb-parser` (PyPI, wraps mdbtools CLI via subprocess) or call CLI tools directly.
|
||||
- References: https://github.com/mdbtools/mdbtools
|
||||
|
||||
**access-parser v0.0.6** (PyPI, 2025-01-23, Apache 2.0, pure Python)
|
||||
- Reverse-engineered pure-Python BIFF parser for both `.mdb` and `.accdb`. No external dependencies.
|
||||
- **Heavy caveat from the authors:** "tested on a limited subset of database files; we expect to find more parsing edge-cases." Version 0.0.6 is pre-production.
|
||||
- May be useful for simple fixtures in CI but should not be the primary production path.
|
||||
- References: https://pypi.org/project/access-parser/
|
||||
|
||||
**UCanAccess + JPype (cross-platform via JVM)**
|
||||
- Pure-Java JDBC driver supporting `.mdb` and `.accdb`. Forked and revived in 2022, actively maintained.
|
||||
- Callable from Python via `jpype` or `JayDeBeApi`. Works on any platform with a JVM.
|
||||
- Adds a JVM dependency — appropriate only if the pipeline already has Java available (e.g. for Tika). For this pipeline, the JVM adds 300+ MB and operational complexity; not recommended unless Access is a first-class requirement.
|
||||
- References: https://foojay.io/today/ucanaccess-java-ms-access-jdbc-guide/
|
||||
|
||||
### 4.4 Recommendation: defer .accdb to v2; support .mdb in v1 Windows-only
|
||||
|
||||
| Decision | Rationale |
|
||||
|---|---|
|
||||
| **Support .mdb in v1 on Windows** | pyodbc + ACE driver covers .mdb reliably; no Office install required |
|
||||
| **Support .accdb in v1 on Windows** | Same pyodbc + ACE redistributable path; gate behind capability check (driver detection) |
|
||||
| **Defer cross-platform .mdb/.accdb to v2** | mdbtools is Linux-only and .accdb reliability is poor; no good pure-Python solution at production quality |
|
||||
| **Gate behind capability check** | At startup, probe for ODBC driver: if not found, emit a clear error: `"Microsoft ACE ODBC driver not found; install AccessDatabaseEngine_X64.exe from Microsoft"` |
|
||||
|
||||
Emit a structured warning in the IR `processing_status` field when an Access file is encountered on a non-Windows platform: `"Access extraction requires Windows with ACE ODBC driver; file skipped"`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Per-Format Metadata to Capture in the IR
|
||||
|
||||
| Format | Metadata Fields |
|
||||
|---|---|
|
||||
| `.docx` | `author` (core_properties.author), `title`, `subject`, `description`, `created`, `modified`, `last_modified_by`, `revision`, `keywords`, `category`, `word_count` (from app properties), paragraph_count, section count |
|
||||
| `.xlsx` | `creator`, `title`, `subject`, `description`, `created`, `modified`, `last_modified_by`, `keywords`, `category`, `sheet_names` (ordered list), `sheet_count`, named ranges count |
|
||||
| `.pptx` | `author`, `title`, `subject`, `description`, `created`, `modified`, `last_modified_by`, `slide_count`, slide titles list, `notes_count` |
|
||||
| `.doc/.xls/.ppt` (legacy) | Same as OOXML counterpart after conversion; also capture: original_format (`doc`/`xls`/`ppt`), `extractor` = `libreoffice/{version}` |
|
||||
| `.mdb/.accdb` | `table_names` (ordered list), `table_count`, `record_counts` per table, `schema_version` (JET 3/4/5), Access engine version if detectable |
|
||||
|
||||
For all formats also capture: `file_size_bytes`, `detected_mime_type`, `extractor_name`, `extractor_version`, `extraction_timestamp`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Windows-Specific Gotchas
|
||||
|
||||
### 6.1 Long paths (MAX_PATH = 260 characters)
|
||||
|
||||
Windows limits paths to 260 characters by default. Files deep in folder hierarchies (common in corporate environments) will silently fail with `FileNotFoundError` or truncated paths.
|
||||
|
||||
Mitigations (apply all three):
|
||||
1. Enable long paths via Group Policy or registry: `HKLM\SYSTEM\CurrentControlSet\Control\FileSystem\LongPathsEnabled = 1`; or document this as a system requirement.
|
||||
2. Always use `pathlib.Path` — Python 3.6+ on Windows uses extended-path APIs when long-path support is enabled.
|
||||
3. As a defensive fallback, prefix absolute paths with `\\?\` when passing paths to Win32 APIs via subprocess (e.g. to `soffice`).
|
||||
|
||||
### 6.2 File locking
|
||||
|
||||
Windows uses mandatory (exclusive) file locking. If a `.docx` or `.xlsx` file is currently open in Word/Excel, any attempt to open it from Python raises `PermissionError: [Errno 13] Permission denied`.
|
||||
|
||||
Strategy: wrap every file open in a `try/except PermissionError` and log a structured warning to the IR's `processing_status`. Never crash the run. Optionally retry once after a short delay (the user may have saved and closed).
|
||||
|
||||
Do NOT use `fcntl` (Linux-only). The `msvcrt.locking` / `win32file.LockFile` approach for advisory locks is unnecessary here — just catch the error from the OS.
|
||||
|
||||
### 6.3 Encodings
|
||||
|
||||
Legacy `.doc`/`.xls` files from Windows environments may use CP-1252, CP-1251 (Cyrillic), or other ANSI code pages. LibreOffice handles this transparently during conversion. For `.docx`/`.xlsx`, everything is UTF-8 inside the ZIP container — no encoding issues.
|
||||
|
||||
CSV exports from `.mdb` via mdbtools default to the database's code page; pass `--encoding` if known.
|
||||
|
||||
### 6.4 COM / Office automation — AVOID
|
||||
|
||||
Do not use `win32com`, `comtypes`, or `pywin32` to automate Word/Excel. This approach:
|
||||
- Requires a licensed Microsoft Office installation on the host.
|
||||
- Is fragile (prompts, dialogs, licensing checks can block headless runs).
|
||||
- Is not containerizable.
|
||||
|
||||
All recommended libraries above (python-docx, openpyxl, python-pptx, unoserver) work entirely without Office. Enforce this as an invariant: no COM in the pipeline codebase.
|
||||
|
||||
### 6.5 ACE ODBC architecture mismatch
|
||||
|
||||
If the system has 32-bit ACE ODBC but 64-bit Python (or vice versa), pyodbc silently fails to find the driver. Document the requirement: **install `AccessDatabaseEngine_X64.exe` for 64-bit Python**. Add a driver-detection probe to the capability check that prints a helpful message rather than a cryptic ODBC error.
|
||||
|
||||
### 6.6 LibreOffice on Windows
|
||||
|
||||
The `UserInstallation` path in the `-env:UserInstallation=...` flag must be a valid `file://` URI. Use `pathlib.Path(tmpdir).as_uri()` to generate it — this produces the correct `file:///C:/...` form on Windows. Passing a raw Windows path (backslashes, no scheme) causes silent failure.
|
||||
|
||||
If running LibreOffice inside a Docker container on Windows (via Docker Desktop / WSL2), the Linux path rules apply inside the container.
|
||||
|
||||
---
|
||||
|
||||
## 7. Risks and Open Items
|
||||
|
||||
| Risk | Severity | Mitigation |
|
||||
|---|---|---|
|
||||
| office_oxide v0.1.x API instability | Medium | Pin version; wrap behind Extractor interface so swap is easy |
|
||||
| mdbtools .accdb unreliability | High | Defer to v2; Windows-only pyodbc path for v1 |
|
||||
| LibreOffice conversion quality loss | Medium | Test against fixture corpus; fall back to office_oxide for some formats |
|
||||
| ACE ODBC architecture mismatch | Medium | Capability check at startup with clear error message |
|
||||
| Corrupt/password-protected Office files | Medium | Catch exceptions per-file; record as `failed` in IR status; never abort run |
|
||||
| LibreOffice Docker image size (~800 MB) | Low-medium | Isolate as separate service container; consider multi-stage build to reduce |
|
||||
| python-docx document-order limitation | Low | Use `doc.element.body` iteration or docx2python which handles this |
|
||||
| Long path failures on Windows | High | Registry key + pathlib + `\\?\` prefix; document as system requirement |
|
||||
| unoserver requires same Python as LibreOffice | Medium | Install in Docker using system Python; document clearly in dev setup |
|
||||
|
||||
---
|
||||
|
||||
## 8. Recommended Bundled Dependencies (Docker)
|
||||
|
||||
| Dependency | How to bundle | Notes |
|
||||
|---|---|---|
|
||||
| LibreOffice | `apt install libreoffice` in Ubuntu base image | ~400 MB; isolate in separate service container |
|
||||
| unoserver | `pip3 install unoserver` in LibreOffice container | Must use LibreOffice's own Python |
|
||||
| Microsoft fonts (metric compat) | `apt install fonts-crosextra-carlito fonts-crosextra-caladea fonts-liberation` | Prevents layout corruption |
|
||||
| mdbtools (Linux/CI) | `apt install mdbtools` | For Linux CI of .mdb fixtures |
|
||||
| ACE ODBC (Windows host) | `AccessDatabaseEngine_X64.exe` pre-installed on runner | Not available in Linux containers; Windows runner only |
|
||||
|
||||
Python packages (all pip-installable, cross-platform):
|
||||
- `python-docx>=1.2.0`
|
||||
- `docx2python>=3.6.0`
|
||||
- `openpyxl>=3.1.5`
|
||||
- `python-calamine` (optional, for large-xlsx performance tier)
|
||||
- `xlrd>=2.0.1` (for legacy .xls)
|
||||
- `python-pptx>=1.0.2`
|
||||
- `pyodbc` (Windows, for Access)
|
||||
- `mdb-parser` (Linux/CI, for Access fixture reading)
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- python-docx PyPI: https://pypi.org/project/python-docx/
|
||||
- python-docx docs: https://python-docx.readthedocs.io/en/latest/
|
||||
- python-docx tables: https://python-docx.readthedocs.io/en/latest/user/tables.html
|
||||
- docx2python PyPI: https://pypi.org/project/docx2python/
|
||||
- docx2python GitHub: https://github.com/ShayHill/docx2python
|
||||
- openpyxl PyPI: https://pypi.org/project/openpyxl/
|
||||
- openpyxl docs: https://openpyxl.readthedocs.io/en/stable/tutorial.html
|
||||
- xlrd PyPI: https://pypi.org/project/xlrd/
|
||||
- python-calamine PyPI: https://pypi.org/project/python-calamine/
|
||||
- python-calamine GitHub: https://github.com/dimastbk/python-calamine
|
||||
- python-pptx PyPI: https://pypi.org/project/python-pptx/
|
||||
- python-pptx docs: https://python-pptx.readthedocs.io/en/latest/
|
||||
- office_oxide GitHub: https://github.com/yfedoseev/office_oxide
|
||||
- unoserver PyPI: https://pypi.org/project/unoserver/
|
||||
- unoserver GitHub: https://github.com/unoconv/unoserver
|
||||
- LibreOffice in Docker (2026): https://oneuptime.com/blog/post/2026-02-08-how-to-run-libreoffice-in-docker-for-document-conversion/view
|
||||
- LibreOffice parallel profiles: https://ask.libreoffice.org/t/multiple-user-profiles-for-parallel-processing-with-custom-configuration-changes-in-user-profiles/110834
|
||||
- mdbtools GitHub: https://github.com/mdbtools/mdbtools
|
||||
- access-parser PyPI: https://pypi.org/project/access-parser/
|
||||
- pyodbc Access wiki: https://github.com/mkleehammer/pyodbc/wiki/Connecting-to-Microsoft-Access
|
||||
- UCanAccess guide: https://foojay.io/today/ucanaccess-java-ms-access-jdbc-guide/
|
||||
- xlrd legacy note: https://xlrd.readthedocs.io/en/latest/
|
||||
- Windows MAX_PATH: https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation
|
||||
317
docs/research/D-audio-video.md
Normal file
317
docs/research/D-audio-video.md
Normal file
|
|
@ -0,0 +1,317 @@
|
|||
# Research D: Audio / Video Transcription
|
||||
|
||||
**Agent D — July 2026**
|
||||
|
||||
---
|
||||
|
||||
## 1. Summary & Recommendations
|
||||
|
||||
### Recommended ASR runtime
|
||||
|
||||
**faster-whisper v1.2.1 (SYSTRAN/faster-whisper)** is the recommended runtime for digger.
|
||||
- Backend: CTranslate2 — C++ inference engine that replaces PyTorch for transformer models.
|
||||
- Default model: **`large-v3` with `compute_type=int8`** (CPU-first).
|
||||
- Why not turbo: `large-v3-turbo` is ~7x faster on CPU but has documented Arabic accuracy regressions (common words like "نعم" transcribed as "Naah/Naahe"). For an Arabic+English requirement, sacrifice speed over accuracy and use full `large-v3`.
|
||||
- Memory: large-v3 with int8 requires ~1.5 GB RAM at runtime; trivial on a 128 GB machine.
|
||||
- Speed (CPU): Ryzen 7 7700X with 8 threads → ~10x real-time for large-v3 int8 (a 60-minute file in ~6 minutes). Perfectly acceptable for batch offline processing.
|
||||
|
||||
### GPU upgrade path
|
||||
|
||||
Set `device="cuda"` and `compute_type="float16"` in the `ModelBackend` configuration. The model files are identical; only the runtime flag changes. On an RTX 4070 (12 GB), large-v3 int8 runs at ~12x real-time.
|
||||
|
||||
### Sources
|
||||
|
||||
- faster-whisper GitHub: https://github.com/SYSTRAN/faster-whisper
|
||||
- faster-whisper 2026 guide: https://localaimaster.com/blog/faster-whisper-guide
|
||||
- whisper.cpp vs faster-whisper 2026: https://www.promptquorum.com/power-local-llm/local-whisper-stt-comparison-2026
|
||||
- Whisper Arabic accuracy: https://novascribe.ai/how-accurate-is-whisper
|
||||
|
||||
---
|
||||
|
||||
## 2. ASR Runtime Comparison
|
||||
|
||||
| | **faster-whisper** | **WhisperX** | **whisper.cpp** | **openai/whisper** | **Parakeet TDT v3** | **Moonshine** |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **Version (Jul 2026)** | 1.2.1 (Oct 2025) | 3.8.6 (May 2026) | 1.8.3 (Jan 2026) | 20240930 | 2.0 (2025) | 0.2 (2025) |
|
||||
| **Backend** | CTranslate2 (C++) | faster-whisper + wav2vec2 | GGML (C++) | PyTorch | NeMo (PyTorch) | Custom PyTorch |
|
||||
| **License** | MIT | BSD-2 | MIT | MIT | CC BY 4.0 | Apache-2.0 |
|
||||
| **Language coverage** | 99 (Whisper parity) | 99 (Whisper parity) | 99 (Whisper parity) | 99 | 25 (EN + major European) | English only |
|
||||
| **Arabic support** | Yes — WER ~15–25% on FLEURS (large-v3) | Segment-level yes; word-level alignment requires manual HF model (jonatasgrosman/wav2vec2-large-xlsr-53-arabic) | Yes (same models) | Yes (same models) | **No** | **No** |
|
||||
| **CPU support** | Yes — int8 quantization, SIMD | Yes — `--compute_type int8 --device cpu` | Yes — AVX/AVX2/NEON SIMD | Yes but very slow | Slow on CPU | Designed for edge/CPU |
|
||||
| **CPU speed (large-v3, int8)** | ~10x RT (good x86 CPU) | Same as faster-whisper core | Similar or ~30% slower on x86; better on ARM | 2–3x RT with PyTorch | N/A (EN only) | N/A (EN only) |
|
||||
| **GPU speed (large-v3, int8)** | ~12x RT (RTX 4070) | Same core | CUDA + Vulkan | 10x RT (RTX 4070) | ~3,333x RT (EN, GPU) | Fast (EN, GPU) |
|
||||
| **RAM (large-v3, int8)** | ~1.5 GB | ~1.5 GB + wav2vec2 model | Slightly lower overhead | ~6–10 GB (PyTorch) | ~3 GB | ~500 MB |
|
||||
| **Word-level timestamps** | Yes (built-in `word_timestamps=True`) | Yes (sub-100 ms via phoneme alignment) | Yes | No | Yes | Yes |
|
||||
| **Segment timestamps** | Yes | Yes | Yes | Yes | Yes | Yes |
|
||||
| **VAD / silence filtering** | Yes (Silero VAD built-in, `vad_filter=True`) | Yes (Silero VAD) | Yes (built-in) | No | Yes | Yes |
|
||||
| **Windows support** | Yes (pip wheels; `whisper-standalone-win` for CLI) | Yes (CUDA Toolkit 12.8+ for GPU; CPU works without) | Yes (prebuilt exe; CMake for custom builds) | Yes | Yes | Yes |
|
||||
| **Cross-platform** | Linux, Windows, macOS, Docker | Linux, Windows, macOS | Linux, Windows, macOS, iOS, Android, WASM | Linux, Windows, macOS | Linux, Windows | Linux, Windows, macOS |
|
||||
| **Python API** | Yes (native) | Yes (native) | Via `whispercpp` Python bindings or subprocess | Yes (native) | Via NeMo SDK | Yes (native) |
|
||||
| **Speaker diarization** | No (use WhisperX or standalone pyannote) | Yes (pyannote/speaker-diarization-community-1) | No | No | No | No |
|
||||
| **Verdict** | **Primary choice** | Good if word-level timestamps or diarization needed; wraps faster-whisper | Best for embedded/non-Python; no advantages over faster-whisper in a Python pipeline | **Reject** — too slow, too much memory | **Reject** — no Arabic | **Reject** — no Arabic |
|
||||
|
||||
### Notes
|
||||
|
||||
- **WhisperX** is not a separate model — it wraps faster-whisper and adds wav2vec2 alignment. If word-level timestamps and/or diarization become required, WhisperX is the natural extension, not a replacement. The Arabic alignment model must be supplied manually.
|
||||
- **whisper.cpp** shines for embedded, edge, or non-Python contexts (iOS, Rust, Go) where you cannot use CTranslate2. For a Python-first pipeline on a 128 GB machine, faster-whisper is simpler and marginally faster on x86.
|
||||
- **large-v3-turbo**: 7x faster on CPU and 1.5 GB model size, but has documented Arabic accuracy regression. Treat it as an optional override for English-only files where the user explicitly opts in.
|
||||
|
||||
---
|
||||
|
||||
## 3. Audio Extraction: ffmpeg Approach and Bundling
|
||||
|
||||
### Why ffmpeg
|
||||
|
||||
ffmpeg is the universal audio/video extraction standard. It handles every container and codec combination in the wild (MP4/H.264/AAC, MKV/VP9/Opus, MOV, AVI, M4A, FLAC, WAV, MP3, OGG, WMA, etc.) and can extract audio, normalize sample rate, and mono-convert in a single subprocess call.
|
||||
|
||||
### Extraction command pattern
|
||||
|
||||
```python
|
||||
import subprocess, shlex
|
||||
|
||||
def extract_audio(video_path: str, output_wav: str, sample_rate: int = 16000) -> None:
|
||||
"""
|
||||
Extract audio from any media file to a 16 kHz mono WAV.
|
||||
Whisper requires 16 kHz; mono reduces memory and avoids down-mix ambiguity.
|
||||
"""
|
||||
cmd = [
|
||||
"ffmpeg",
|
||||
"-y", # overwrite output
|
||||
"-i", video_path,
|
||||
"-vn", # no video
|
||||
"-acodec", "pcm_s16le",
|
||||
"-ar", str(sample_rate),
|
||||
"-ac", "1", # mono
|
||||
output_wav,
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, timeout=300)
|
||||
if result.returncode != 0:
|
||||
raise RuntimeError(f"ffmpeg failed: {result.stderr.decode()}")
|
||||
```
|
||||
|
||||
Use `subprocess` directly (not the `ffmpeg-python` library, which is a thin wrapper that adds a dependency without meaningful benefit for our simple extract-only use case). The `timeout` parameter enforces the per-file timeout at the OS level.
|
||||
|
||||
### Python API choice: subprocess vs ffmpeg-python
|
||||
|
||||
| | `subprocess` (direct) | `ffmpeg-python` |
|
||||
|---|---|---|
|
||||
| Dependency | None (only ffmpeg binary) | `ffmpeg-python` PyPI package |
|
||||
| Complexity | Low | Low (fluent API) |
|
||||
| Timeout control | Native (`subprocess.run(timeout=…)`) | Requires workaround |
|
||||
| Debugging | `stderr` available directly | Harder to inspect |
|
||||
| Complex filter graphs | Manual string building | More convenient |
|
||||
| **Verdict** | **Use for digger** — we only do extract+normalize, subprocess is sufficient | Unnecessary layer |
|
||||
|
||||
### Bundling strategy
|
||||
|
||||
**Inside Docker Compose (primary distribution)**: Add `ffmpeg` to the Dockerfile base image — one line:
|
||||
```dockerfile
|
||||
RUN apt-get install -y ffmpeg # Debian/Ubuntu base
|
||||
```
|
||||
No Python bundling needed. The binary is always present in the container.
|
||||
|
||||
**Outside Docker (pip-install / developer mode)**: Use `static-ffmpeg` (PyPI), which bundles a fully static ffmpeg binary and downloads the platform-appropriate build on first use:
|
||||
|
||||
```bash
|
||||
pip install static-ffmpeg # version 2.13, released Jan 2026
|
||||
```
|
||||
|
||||
```python
|
||||
import static_ffmpeg
|
||||
static_ffmpeg.add_paths() # prepends the bundled ffmpeg to PATH
|
||||
# now subprocess("ffmpeg …") finds the bundled binary
|
||||
```
|
||||
|
||||
`static-ffmpeg` v2.13 supports Windows (win64), Linux (x86_64, aarch64), and macOS (x86_64, arm64). It does not require elevated permissions. An alternative is `ffmpeg-update` (Jun 2026), which manages ffmpeg static binary updates.
|
||||
|
||||
**Abstraction in code**: Introduce a thin `get_ffmpeg_path() -> str` helper in the `ModelBackend`/tooling layer that:
|
||||
1. Checks `config.ffmpeg_path` (explicit override).
|
||||
2. Falls back to `shutil.which("ffmpeg")` (system install or Docker).
|
||||
3. Falls back to `static_ffmpeg`'s resolved path (developer mode).
|
||||
|
||||
This keeps the Docker path clean and the non-Docker path zero-install.
|
||||
|
||||
### Sample rate, codec, format
|
||||
|
||||
Whisper (and all Whisper variants) expect **16 kHz mono PCM** as input. Extract to a temporary WAV file (or pipe directly via `ffmpeg` stdout + `numpy`/`soundfile` stdin) before passing to faster-whisper. Temporary files should be cleaned up after transcription. Piped extraction (avoid writing to disk) is an optimization for future iterations; write to a temp file in v1 for simplicity and debuggability.
|
||||
|
||||
---
|
||||
|
||||
## 4. Speaker Diarization: v1 vs V2 Verdict
|
||||
|
||||
### Verdict: **defer diarization to V2**
|
||||
|
||||
Reasons:
|
||||
|
||||
1. **CPU cost is prohibitive for v1**. pyannote's speaker-diarization-community-1 runs at ~2.5% real-time on a high-end GPU (V100). On CPU-first hardware it is dramatically slower — processing a 1-hour audio file could take hours. For a batch offline pipeline this may be technically tolerable, but it is a poor default out of the box.
|
||||
|
||||
2. **HuggingFace gated model**. `pyannote/speaker-diarization-community-1` requires (a) a HuggingFace account, (b) token creation at `hf.co/settings/tokens`, and (c) acceptance of the model's usage conditions. This breaks the zero-install/zero-registration ethos of the project for a feature that is optional. Models can be cloned locally for offline use, but the registration step remains manual.
|
||||
|
||||
3. **Complexity is additive**. Adding diarization means shipping a second heavy neural model (the embedding model used by pyannote) in addition to Whisper. Each adds download time, disk space, and memory. V1 should demonstrate the full pipeline end-to-end with minimal moving parts.
|
||||
|
||||
4. **The IR is already designed for it** (see Section 5). Diarization can be layered on as a post-transcription enrichment stage in V2 without touching extractors or the IR schema.
|
||||
|
||||
### For V2: recommended approach
|
||||
|
||||
- **WhisperX** as the diarization-aware wrapper: it combines faster-whisper transcription + wav2vec2 alignment + pyannote/speaker-diarization-community-1 into a single Python call.
|
||||
- For Arabic word-level alignment: supply `jonatasgrosman/wav2vec2-large-xlsr-53-arabic` manually via `whisperx.load_align_model(language_code="ar", model_name="jonatasgrosman/wav2vec2-large-xlsr-53-arabic")`.
|
||||
- License: community-1 is CC-BY-4.0 (attribution required, no viral copyleft). Whisper is MIT. wav2vec2 models on HuggingFace vary — check per model.
|
||||
- Plan for HF token management: a dedicated `HF_TOKEN` env var in config/`.env`, documented in the README.
|
||||
|
||||
---
|
||||
|
||||
## 5. Long-Media Handling and IR Segment Mapping
|
||||
|
||||
### VAD-based chunking (built into faster-whisper)
|
||||
|
||||
faster-whisper integrates Silero VAD natively. Enable with:
|
||||
|
||||
```python
|
||||
from faster_whisper import WhisperModel
|
||||
|
||||
model = WhisperModel("large-v3", device="cpu", compute_type="int8")
|
||||
segments, info = model.transcribe(
|
||||
audio_path,
|
||||
language="ar", # or "en"; None = auto-detect
|
||||
vad_filter=True, # skip silent regions
|
||||
vad_parameters=dict(
|
||||
min_silence_duration_ms=500,
|
||||
speech_pad_ms=400,
|
||||
),
|
||||
word_timestamps=True, # enables word-level start/end
|
||||
beam_size=5,
|
||||
)
|
||||
```
|
||||
|
||||
Silero VAD processes 30-second chunks in < 1 ms on CPU. The VAD strip-and-restore mechanism preserves original absolute timestamps even when silence is removed — so the segment `start`/`end` times always reference the original audio clock.
|
||||
|
||||
### Per-file timeout and size limit
|
||||
|
||||
Both enforced at the `AudioVideoExtractor` level before and during processing:
|
||||
|
||||
```python
|
||||
import concurrent.futures, os
|
||||
|
||||
MAX_MEDIA_BYTES = config.max_media_size_bytes # e.g., 10 GB
|
||||
TRANSCRIPTION_TIMEOUT = config.transcription_timeout_seconds # e.g., 7200
|
||||
|
||||
def extract_audio_video(path: str, ...) -> CanonicalDocument:
|
||||
size = os.path.getsize(path)
|
||||
if size > MAX_MEDIA_BYTES:
|
||||
return CanonicalDocument.skipped(path, reason="file_too_large")
|
||||
|
||||
with concurrent.futures.ProcessPoolExecutor(max_workers=1) as pool:
|
||||
future = pool.submit(_transcribe_worker, path, ...)
|
||||
try:
|
||||
return future.result(timeout=TRANSCRIPTION_TIMEOUT)
|
||||
except concurrent.futures.TimeoutError:
|
||||
return CanonicalDocument.failed(path, reason="transcription_timeout")
|
||||
```
|
||||
|
||||
Using `ProcessPoolExecutor` (not `ThreadPoolExecutor`) is important because faster-whisper's CTranslate2 releases the GIL but a hung model may not respond to thread interrupts; a separate process is killable.
|
||||
|
||||
### IR segment schema
|
||||
|
||||
The `CanonicalDocument` for audio/video files carries a `transcript` field that holds structured segments. This maps directly to the IR design from Section 7 / the ADR. Proposed field additions to the IR for audio/video:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "sha256:...",
|
||||
"source_type": "audio_video",
|
||||
"path": "/abs/path/to/recording.mp4",
|
||||
"detected_mime": "video/mp4",
|
||||
"media_metadata": {
|
||||
"duration_seconds": 3612.4,
|
||||
"codec": "h264/aac",
|
||||
"sample_rate_hz": 44100,
|
||||
"channels": 2
|
||||
},
|
||||
"extractor": "AudioVideoExtractor",
|
||||
"extractor_version": "0.1.0",
|
||||
"asr_model": "large-v3",
|
||||
"asr_model_version": "openai/whisper-large-v3@abc123",
|
||||
"detected_language": "ar",
|
||||
"language_probability": 0.97,
|
||||
"content": "Full concatenated transcript text for keyword search.",
|
||||
"transcript_segments": [
|
||||
{
|
||||
"start": 0.0,
|
||||
"end": 4.32,
|
||||
"text": "مرحبا، كيف حالك؟",
|
||||
"speaker": null,
|
||||
"words": [
|
||||
{"word": "مرحبا", "start": 0.0, "end": 1.1, "probability": 0.98},
|
||||
{"word": "كيف", "start": 1.4, "end": 1.9, "probability": 0.97},
|
||||
{"word": "حالك؟", "start": 2.0, "end": 2.8, "probability": 0.96}
|
||||
]
|
||||
},
|
||||
{
|
||||
"start": 5.1,
|
||||
"end": 9.8,
|
||||
"text": "Hello, how are you today?",
|
||||
"speaker": null,
|
||||
"words": null
|
||||
}
|
||||
],
|
||||
"processing_status": "success",
|
||||
"warnings": [],
|
||||
"errors": []
|
||||
}
|
||||
```
|
||||
|
||||
**Key design decisions:**
|
||||
|
||||
- `content` (flat string) is what gets indexed in the full-text search field. It is the concatenation of all `transcript_segments[].text` with spaces or newlines.
|
||||
- `transcript_segments` is the structured field preserved for UI deep-linking. The UI reads `start` from the segment containing the matched snippet to generate a `?t=<seconds>` jump link.
|
||||
- `words` (word-level timestamps) is optional — only populated when `word_timestamps=True` is passed to the ASR backend. Set to `null` when not available. This avoids bloating the IR for files where word-precision is not needed.
|
||||
- `speaker` is `null` in v1 (diarization deferred). In V2 the diarization enricher overwrites this field in the IR without changing any other layer.
|
||||
- Language detection is done per-file (not per-segment) by Whisper's own language detection pass. For mixed Arabic/English audio, Whisper handles code-switching within a segment reasonably well; a per-segment language field can be added in V2 if needed.
|
||||
- `asr_model_version` stores the HuggingFace model ID + commit hash so the reindex command in Section 8 can detect when a model upgrade requires re-transcription.
|
||||
|
||||
### Progress for long files
|
||||
|
||||
faster-whisper's `transcribe()` returns a generator of segments. Consume it lazily and write each segment to the IR as it arrives:
|
||||
|
||||
```python
|
||||
segments_gen, info = model.transcribe(path, ...)
|
||||
segment_list = []
|
||||
for segment in segments_gen: # yields as they complete
|
||||
segment_list.append(segment)
|
||||
# optionally emit progress: (segment.start / info.duration) * 100
|
||||
```
|
||||
|
||||
This avoids holding the full audio in memory and gives a natural progress hook.
|
||||
|
||||
---
|
||||
|
||||
## 6. Risks and Mitigations
|
||||
|
||||
| Risk | Severity | Mitigation |
|
||||
|---|---|---|
|
||||
| Arabic WER 15–25% with large-v3 is too high for production use | High | Use fine-tuned Arabic Whisper models (e.g. `Byne/whisper-large-v3-arabic` on HuggingFace) as a configurable model override in the `ModelBackend`. Document this as a v1 known limitation. |
|
||||
| large-v3-turbo Arabic accuracy regression | Medium | Default to `large-v3`; allow turbo as an explicit opt-in per `config.asr_model`. |
|
||||
| very long files (feature films, 8-hour recordings) overwhelm RAM with in-memory audio | Medium | Stream via `ffmpeg` pipe + `soundfile` chunked read rather than loading full WAV to RAM. In v1, enforce `max_media_size_bytes` to avoid the problem entirely. |
|
||||
| Transcription hangs on corrupt / unusual media | High | `ProcessPoolExecutor` timeout (see Section 5) kills hung worker. Log as `failed` in State Store. |
|
||||
| pyannote gated HuggingFace model (diarization) | Medium | Deferred to V2; document token requirement in advance so the operator can register. |
|
||||
| ffmpeg not found outside Docker on Windows | Medium | `static-ffmpeg` `add_paths()` at startup; clear error message if binary missing with install instructions. |
|
||||
| Whisper hallucinations on silence / noise | Medium | Enable `vad_filter=True` (Silero VAD removes silence before passing to Whisper). |
|
||||
| Mixed Arabic-English audio (code-switching) per-segment language misdetection | Low | Whisper large-v3 handles code-switching at the model level. No action in v1; monitor WER in tests. |
|
||||
| GPU-in-Docker complexity on Windows (WSL2 + NVIDIA Container Toolkit) | Medium | Per brief, GPU is opt-in and the model server can run on the host (outside Docker) behind the `ModelBackend` interface. CPU default avoids this entirely in v1. |
|
||||
| `jonatasgrosman/wav2vec2-large-xlsr-53-arabic` model quality for Arabic word alignment | Low | V2 concern; in v1 we only use segment-level timestamps from faster-whisper directly. |
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Package Versions and URLs
|
||||
|
||||
| Package | Version (Jul 2026) | PyPI / URL |
|
||||
|---|---|---|
|
||||
| faster-whisper | 1.2.1 | https://pypi.org/project/faster-whisper/ |
|
||||
| whisperx | 3.8.6 | https://pypi.org/project/whisperx/ |
|
||||
| pyannote-audio | 3.4.0 | https://pypi.org/project/pyannote-audio/ |
|
||||
| static-ffmpeg | 2.13 | https://pypi.org/project/static-ffmpeg/ |
|
||||
| ffmpeg-update | latest (Jun 2026) | https://pypi.org/project/ffmpeg-update/ |
|
||||
| openai-whisper | — | https://github.com/openai/whisper (reference only) |
|
||||
| whisper.cpp | 1.8.3 | https://github.com/ggml-org/whisper.cpp |
|
||||
| pyannote/speaker-diarization-community-1 | — | https://huggingface.co/pyannote/speaker-diarization-community-1 |
|
||||
| wav2vec2 Arabic alignment model | — | https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-arabic |
|
||||
| Byne/whisper-large-v3-arabic (fine-tuned) | — | https://huggingface.co/Byne/whisper-large-v3-arabic |
|
||||
563
docs/research/E-frontend-ux.md
Normal file
563
docs/research/E-frontend-ux.md
Normal file
|
|
@ -0,0 +1,563 @@
|
|||
# Agent E Research: Search Frontend & UX
|
||||
|
||||
**Date:** 2026-07-01
|
||||
**Scope:** FastAPI + Jinja2 + HTMX search UI, SearchProvider interface, Arabic RTL, deep-link/provenance UX, alternatives.
|
||||
**Decision already made:** Option B — UI queries a thin Python API wrapping `SearchProvider`; engine-agnostic. Chosen tech: FastAPI + Jinja2 + HTMX. Single-user, no access control. Languages: Arabic + English (RTL matters). v1 = keyword search only.
|
||||
|
||||
---
|
||||
|
||||
## 1. Summary and Recommended v1 UI Structure
|
||||
|
||||
### Confirmed stack and versions (verified July 2026)
|
||||
|
||||
| Library | Latest stable | Notes |
|
||||
|---|---|---|
|
||||
| FastAPI | 0.136.1 | <https://pypi.org/project/fastapi/> |
|
||||
| HTMX | 2.0.10 | Stable; v4.0.0-alpha (fetch-based) targets early 2027 stable. Pin 2.x for v1. <https://htmx.org/> |
|
||||
| Jinja2 | 3.x (bundled with Starlette/FastAPI) | No breaking changes |
|
||||
| meilisearch (official sync) | 0.41.0 | <https://pypi.org/project/meilisearch/> |
|
||||
| meilisearch-python-sdk (community async) | 7.2.1 | <https://pypi.org/project/meilisearch-python-sdk/> — prefer this for async FastAPI |
|
||||
| @meilisearch/instant-meilisearch | 0.30.0 | npm; only relevant for Option A (not chosen for v1) |
|
||||
|
||||
**Recommendation:** use `meilisearch-python-sdk` (the async community SDK) for the Meilisearch adapter. It supports `async/await`, matches FastAPI's async model, and is actively maintained.
|
||||
|
||||
---
|
||||
|
||||
### Module layout
|
||||
|
||||
```
|
||||
src/digger/ui/
|
||||
├── app.py # FastAPI app factory; mounts /api, /static
|
||||
├── routes/
|
||||
│ ├── search.py # GET / and GET /search (full+partial)
|
||||
│ └── status.py # GET /status (admin view)
|
||||
├── providers/
|
||||
│ ├── base.py # SearchProvider Protocol + data classes
|
||||
│ └── meilisearch.py # Meilisearch adapter
|
||||
├── templates/
|
||||
│ ├── base.html # HTML skeleton with RTL, fonts, HTMX script
|
||||
│ ├── search.html # Full search page (extends base)
|
||||
│ ├── partials/
|
||||
│ │ ├── results.html # HTMX target: hit list + pagination bar
|
||||
│ │ ├── hit.html # One result card (included by results.html)
|
||||
│ │ ├── facets.html # Facet sidebar (HTMX-swapped on filter change)
|
||||
│ │ └── pagination.html # Page links (included by results.html)
|
||||
│ └── status.html # Admin/indexing status
|
||||
└── static/
|
||||
├── htmx.min.js # Vendored; pin 2.0.10
|
||||
└── style.css # Minimal CSS; logical properties for RTL
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Routes
|
||||
|
||||
| Route | Response type | HTMX trigger |
|
||||
|---|---|---|
|
||||
| `GET /` | Full HTML page (empty query) | — |
|
||||
| `GET /search?q=&page=&file_type=&lang=&sort=` | Full page **or** partial `#results` fragment | Detects `HX-Request` header |
|
||||
| `GET /status` | Full HTML page | — |
|
||||
|
||||
The single `/search` endpoint detects the `HX-Request: true` header sent by HTMX on every XHR. When present, it renders and returns only `partials/results.html` (and optionally updates the facets via `HX-Trigger` response header). When absent (direct URL access, back/forward navigation), it returns the full page with results already embedded — making deep links and bookmarks work out of the box.
|
||||
|
||||
```python
|
||||
# routes/search.py (sketch)
|
||||
from fastapi import Request
|
||||
from fastapi.responses import HTMLResponse
|
||||
|
||||
@router.get("/search")
|
||||
async def search(
|
||||
request: Request,
|
||||
q: str = "",
|
||||
page: int = 1,
|
||||
file_type: list[str] = Query(default=[]),
|
||||
lang: list[str] = Query(default=[]),
|
||||
sort: str = "relevance",
|
||||
):
|
||||
result = await provider.search(
|
||||
q,
|
||||
page=page,
|
||||
filters={"mime_type": file_type, "language": lang},
|
||||
sort=sort,
|
||||
facet_attributes=["mime_type", "language", "source_folder"],
|
||||
)
|
||||
|
||||
is_htmx = request.headers.get("HX-Request") == "true"
|
||||
template = "partials/results.html" if is_htmx else "search.html"
|
||||
return templates.TemplateResponse(
|
||||
template,
|
||||
{"request": request, "result": result, "q": q, ...},
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### HTMX interaction patterns
|
||||
|
||||
#### Search-as-you-type with debounce
|
||||
|
||||
```html
|
||||
<!-- search.html (the input) -->
|
||||
<input
|
||||
id="q"
|
||||
type="search"
|
||||
name="q"
|
||||
value="{{ q }}"
|
||||
placeholder="ابحث… / Search…"
|
||||
dir="auto"
|
||||
autocomplete="off"
|
||||
hx-get="/search"
|
||||
hx-trigger="input changed delay:300ms, keyup[key=='Enter']"
|
||||
hx-target="#results"
|
||||
hx-swap="innerHTML"
|
||||
hx-push-url="true"
|
||||
hx-include="[name='page'],[name='file_type[]'],[name='lang[]'],[name='sort']"
|
||||
hx-sync="this:replace"
|
||||
/>
|
||||
```
|
||||
|
||||
Key attributes explained:
|
||||
- `hx-trigger="input changed delay:300ms"` — debounces: fires 300 ms after the last keystroke, resets timer on each new keystroke. `changed` prevents firing if value did not change (e.g., arrow keys). ([htmx.org/attributes/hx-trigger](https://htmx.org/attributes/hx-trigger/))
|
||||
- `hx-sync="this:replace"` — if a prior request is still in flight, abort it and send this one instead, preventing stale results overwriting fresh ones. ([htmx.org/attributes/hx-sync](https://htmx.org/attributes/hx-sync/))
|
||||
- `hx-push-url="true"` — updates the browser address bar (`/search?q=...`), enabling bookmarking and browser back/forward. ([htmx.org/attributes/hx-push-url](https://htmx.org/attributes/hx-push-url/))
|
||||
- `dir="auto"` — browser auto-detects text direction per input value; switches correctly between Arabic and English.
|
||||
|
||||
#### Facet filters
|
||||
|
||||
Each facet checkbox posts to `/search` and replaces `#results`. Facets sidebar is refreshed via the `HX-Trigger` response header pattern:
|
||||
|
||||
```python
|
||||
# After search, attach a custom event so facets re-render
|
||||
response.headers["HX-Trigger"] = '{"facetsUpdated": {}}'
|
||||
```
|
||||
|
||||
```html
|
||||
<!-- Facet checkbox -->
|
||||
<input type="checkbox" name="file_type[]" value="application/pdf"
|
||||
hx-get="/search"
|
||||
hx-trigger="change"
|
||||
hx-target="#results"
|
||||
hx-swap="innerHTML"
|
||||
hx-include="#q,[name='page'],[name='lang[]'],[name='sort']"
|
||||
hx-push-url="true"
|
||||
/>
|
||||
```
|
||||
|
||||
#### Pagination
|
||||
|
||||
Use standard page-based pagination links rendered by `partials/pagination.html`, enhanced with HTMX `hx-boost` so clicking a page link replaces only `#results` (not the full page). HTMX `hx-boost` is simpler than custom `hx-get` per link when the target is always the same:
|
||||
|
||||
```html
|
||||
<div id="results" hx-boost="true" hx-target="#results" hx-swap="innerHTML">
|
||||
<!-- pagination links here; hx-boost intercepts clicks -->
|
||||
</div>
|
||||
```
|
||||
|
||||
Alternatively: explicit `hx-get="/search?q=...&page=N"` on each page `<a>`. The server uses `page`+`hitsPerPage` (page-based Meilisearch pagination) which returns exact `totalHits` and `totalPages`.
|
||||
|
||||
**No infinite scroll in v1.** Infinite scroll degrades poorly with faceted search (resetting scroll position on filter changes is jarring) and is harder to make accessible. Keep it as a v2 option. If desired later, the HTMX pattern is `hx-trigger="revealed"` on the last result row. ([htmx.org/examples/infinite-scroll](https://htmx.org/examples/infinite-scroll/))
|
||||
|
||||
---
|
||||
|
||||
## 2. SearchProvider Interface
|
||||
|
||||
The interface is the firewall that prevents any Meilisearch concept from leaking into templates. The Jinja2 templates depend only on the types defined here.
|
||||
|
||||
```python
|
||||
# src/digger/ui/providers/base.py
|
||||
from __future__ import annotations
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Literal, Protocol
|
||||
|
||||
|
||||
@dataclass
|
||||
class Hit:
|
||||
"""One search result, engine-agnostic."""
|
||||
id: str # stable document ID (from IR)
|
||||
path: str # absolute path to the source file
|
||||
filename: str # basename for display
|
||||
mime_type: str # e.g. "application/pdf"
|
||||
source_folder: str # parent directory (for facet display)
|
||||
language: str | None # BCP-47, e.g. "ar", "en"
|
||||
modified_at: str | None # ISO 8601 date string
|
||||
snippet: str | None # pre-rendered HTML with <mark> tags; mark safe in template
|
||||
# Provenance fields for deep links
|
||||
page_number: int | None = None # PDF: 1-based page
|
||||
timestamp_seconds: float | None = None # audio/video: seconds into media
|
||||
slide_number: int | None = None # PPTX: 1-based slide
|
||||
sheet_name: str | None = None # XLSX: sheet name
|
||||
score: float | None = None # relevance score 0–1 (optional)
|
||||
|
||||
|
||||
@dataclass
|
||||
class FacetBucket:
|
||||
value: str
|
||||
count: int
|
||||
|
||||
|
||||
@dataclass
|
||||
class SearchResult:
|
||||
query: str
|
||||
hits: list[Hit]
|
||||
total_hits: int
|
||||
page: int
|
||||
hits_per_page: int
|
||||
total_pages: int
|
||||
processing_time_ms: int
|
||||
facets: dict[str, list[FacetBucket]] = field(default_factory=dict)
|
||||
# dict key = facet attribute name (e.g. "mime_type"); value = sorted buckets
|
||||
|
||||
|
||||
class SearchProvider(Protocol):
|
||||
async def search(
|
||||
self,
|
||||
query: str,
|
||||
*,
|
||||
page: int = 1,
|
||||
hits_per_page: int = 20,
|
||||
filters: dict[str, list[str]] | None = None,
|
||||
sort: str | None = None,
|
||||
mode: Literal["keyword", "semantic", "hybrid"] = "keyword",
|
||||
facet_attributes: list[str] | None = None,
|
||||
) -> SearchResult:
|
||||
"""Full search with optional faceting and filtering."""
|
||||
...
|
||||
|
||||
async def suggest(
|
||||
self,
|
||||
prefix: str,
|
||||
*,
|
||||
limit: int = 5,
|
||||
) -> list[str]:
|
||||
"""Typeahead suggestions (future; v1 may return empty list)."""
|
||||
...
|
||||
|
||||
async def health(self) -> bool:
|
||||
"""True if the backend is reachable."""
|
||||
...
|
||||
```
|
||||
|
||||
**Design notes:**
|
||||
|
||||
- `filters` uses a plain dict (`{"mime_type": ["application/pdf"], "language": ["ar"]}`). The Meilisearch adapter converts this to `"mime_type = 'application/pdf' AND language = 'ar'"`. A different adapter translates to its own syntax — the UI never knows.
|
||||
- `sort` is a simple string token like `"modified_at:desc"` or `"relevance"`. Each adapter maps it to its own sort API.
|
||||
- `mode` is in the interface now (v1 always receives `"keyword"`); adding semantic/hybrid later is zero-churn on the UI.
|
||||
- `snippet` is already rendered HTML (safe to pass through `{{ hit.snippet | safe }}`). The adapter is responsible for converting engine-specific highlight markers into `<mark>` tags **before** returning the `Hit`. This keeps any Meilisearch field name (`_formatted`, `highlightPreTag`, etc.) out of templates.
|
||||
- `facets` returns a `list[FacetBucket]` per attribute (not a raw dict of dicts) so templates can iterate cleanly without knowing the engine's `facetDistribution` shape.
|
||||
|
||||
---
|
||||
|
||||
## 3. Highlighting, Faceting, and Pagination Mechanics
|
||||
|
||||
### Highlighting flow (Meilisearch → adapter → template)
|
||||
|
||||
Meilisearch returns highlights in the `_formatted` sub-object when `attributesToHighlight` is set ([meilisearch.com/docs/reference/api/search](https://www.meilisearch.com/docs/reference/api/search)):
|
||||
|
||||
```python
|
||||
# Inside MeilisearchProvider.search()
|
||||
raw = await index.search(
|
||||
query,
|
||||
attributes_to_highlight=["content_text"],
|
||||
attributes_to_crop=["content_text"],
|
||||
crop_length=30, # words around match
|
||||
crop_marker="…", # boundary marker (default)
|
||||
highlight_pre_tag="<mark>", # use <mark> not <em>; CSS-controllable
|
||||
highlight_post_tag="</mark>",
|
||||
facets=facet_attributes or [],
|
||||
filter=_build_filter(filters),
|
||||
sort=_map_sort(sort),
|
||||
page=page,
|
||||
hits_per_page=hits_per_page,
|
||||
)
|
||||
|
||||
for raw_hit in raw.hits:
|
||||
formatted = raw_hit.get("_formatted", {})
|
||||
snippet = formatted.get("content_text") or raw_hit.get("content_text", "")[:200]
|
||||
hits.append(Hit(
|
||||
id=raw_hit["id"],
|
||||
snippet=snippet, # HTML string with <mark> already in it
|
||||
...
|
||||
))
|
||||
```
|
||||
|
||||
The template renders:
|
||||
```html
|
||||
{% if hit.snippet %}
|
||||
<p class="snippet" lang="{{ hit.language or 'und' }}" dir="auto">
|
||||
{{ hit.snippet | safe }}
|
||||
</p>
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
`dir="auto"` on the snippet paragraph lets the browser choose RTL/LTR per content. `| safe` is correct here because the adapter controls the HTML construction (only `<mark>` tags are injected; no user-controlled HTML).
|
||||
|
||||
**What NEVER appears in templates:** `_formatted`, `highlightPreTag`, `facetDistribution`, `facetStats`, `estimatedTotalHits`. These are all translated in the adapter.
|
||||
|
||||
### Faceting flow
|
||||
|
||||
```python
|
||||
# Adapter translates facetDistribution
|
||||
facets_out: dict[str, list[FacetBucket]] = {}
|
||||
for attr, dist in (raw.facet_distribution or {}).items():
|
||||
facets_out[attr] = [
|
||||
FacetBucket(value=v, count=c)
|
||||
for v, c in sorted(dist.items(), key=lambda x: -x[1])
|
||||
]
|
||||
```
|
||||
|
||||
Template:
|
||||
```html
|
||||
<!-- partials/facets.html -->
|
||||
{% for attr, buckets in result.facets.items() %}
|
||||
<fieldset>
|
||||
<legend>{{ facet_labels[attr] }}</legend> {# human label from config #}
|
||||
{% for bucket in buckets %}
|
||||
<label>
|
||||
<input type="checkbox" name="{{ attr }}[]" value="{{ bucket.value }}"
|
||||
{% if bucket.value in active_filters.get(attr, []) %}checked{% endif %}
|
||||
hx-get="/search" hx-trigger="change" hx-target="#results"
|
||||
hx-include="#q,[name='page']" hx-push-url="true" />
|
||||
{{ bucket.value | display_mime }} {# template filter humanises mime type #}
|
||||
<span class="count">({{ bucket.count }})</span>
|
||||
</label>
|
||||
{% endfor %}
|
||||
</fieldset>
|
||||
{% endfor %}
|
||||
```
|
||||
|
||||
### Pagination
|
||||
|
||||
Use Meilisearch page-based pagination (`page` + `hitsPerPage`) rather than offset-based. Page-based returns **exact** `totalHits` and `totalPages` (at some extra cost). For a search UI showing "page 2 of 14", exact counts are required.
|
||||
|
||||
```
|
||||
GET /search → SearchResult.total_pages, .page, .hits_per_page → renders pagination links
|
||||
```
|
||||
|
||||
**Index setting:** configure `maxTotalHits` in Meilisearch settings to a sensible cap (e.g. 1000) to bound exhaustive count cost.
|
||||
|
||||
---
|
||||
|
||||
## 4. Arabic RTL Notes
|
||||
|
||||
### Base layout
|
||||
|
||||
```html
|
||||
<!-- base.html -->
|
||||
<html lang="ar" dir="rtl">
|
||||
```
|
||||
|
||||
Setting `dir="rtl"` at the root cascades to all children. CSS Flexbox and Grid automatically mirror layout direction when `dir=rtl` is present — no manual `margin-left/right` needed if you use logical CSS properties throughout.
|
||||
|
||||
### CSS logical properties (required)
|
||||
|
||||
Never use `margin-left`, `padding-right`, `border-left`, `float: left` in the stylesheet. Use logical equivalents:
|
||||
|
||||
| Physical | Logical (use these) |
|
||||
|---|---|
|
||||
| `margin-left` | `margin-inline-start` |
|
||||
| `margin-right` | `margin-inline-end` |
|
||||
| `padding-left` | `padding-inline-start` |
|
||||
| `text-align: left` | `text-align: start` |
|
||||
| `float: left` | No float; use flex/grid |
|
||||
|
||||
### Mixed Arabic + English (bidi)
|
||||
|
||||
Arabic paragraphs containing English words or file paths are the most common case.
|
||||
|
||||
1. **Search input:** `dir="auto"` — browser detects from first strong bidi character. Switches correctly as the user types Arabic or English. ([w3.org/International](https://www.w3.org/International/articles/inline-bidi-markup/))
|
||||
2. **Result snippets:** `dir="auto"` on the `<p class="snippet">` element. Works for mixed content.
|
||||
3. **File paths and numbers in RTL context:** wrap with `<bdi>` to isolate:
|
||||
```html
|
||||
<span class="path"><bdi>{{ hit.path }}</bdi></span>
|
||||
<span class="page">p. <bdi>{{ hit.page_number }}</bdi></span>
|
||||
```
|
||||
`<bdi>` (Bidirectional Isolate) prevents a LTR path like `/home/user/docs/file.pdf` from garbling surrounding RTL text.
|
||||
4. **Numbers:** both Western (1,2,3) and Eastern Arabic (١,٢,٣) numerals read left-to-right even within RTL text; no special treatment needed for page numbers.
|
||||
|
||||
### Typography
|
||||
|
||||
- **Recommended font stack:**
|
||||
```css
|
||||
body {
|
||||
font-family: "Cairo", "Tajawal", "Noto Sans Arabic", system-ui, sans-serif;
|
||||
line-height: 1.7; /* Arabic needs ~20-30% more line-height than Latin */
|
||||
}
|
||||
```
|
||||
Cairo and Noto Sans Arabic are available on Google Fonts ([fonts.google.com/noto/specimen/Noto+Sans+Arabic](https://fonts.google.com/noto/specimen/Noto+Sans+Arabic), [fonts.google.com/specimen/Cairo](https://fonts.google.com/specimen/Cairo)). Self-host them (download and serve from `/static/fonts/`) to avoid external network calls consistent with the project's local-first principle.
|
||||
- **Never add `letter-spacing` to Arabic text.** Arabic characters are joined glyphs; letter-spacing breaks the joins and produces corrupted-looking text. Restrict any `letter-spacing` rules to Latin-script selectors if used at all.
|
||||
|
||||
### Search highlights in Arabic text
|
||||
|
||||
`<mark>` inside RTL text renders correctly without extra handling. The only edge case is when a highlight spans an Arabic-English word boundary — `dir="auto"` on the containing element handles this correctly.
|
||||
|
||||
---
|
||||
|
||||
## 5. Deep-link and Provenance UX
|
||||
|
||||
Every result card must show provenance and allow the user to locate the original file. The `Hit` dataclass carries all required fields.
|
||||
|
||||
### Result card structure
|
||||
|
||||
```html
|
||||
<!-- partials/hit.html -->
|
||||
<article class="hit">
|
||||
<header>
|
||||
<h3 class="hit-title">
|
||||
<span class="filename"><bdi>{{ hit.filename }}</bdi></span>
|
||||
{% if hit.page_number %}
|
||||
<span class="locator">— p. <bdi>{{ hit.page_number }}</bdi></span>
|
||||
{% elif hit.timestamp_seconds is not none %}
|
||||
<span class="locator">— <bdi>{{ hit.timestamp_seconds | format_timestamp }}</bdi></span>
|
||||
{% elif hit.slide_number %}
|
||||
<span class="locator">— slide <bdi>{{ hit.slide_number }}</bdi></span>
|
||||
{% elif hit.sheet_name %}
|
||||
<span class="locator">— <bdi>{{ hit.sheet_name }}</bdi></span>
|
||||
{% endif %}
|
||||
</h3>
|
||||
<p class="hit-path" dir="ltr"><small><bdi>{{ hit.path }}</bdi></small></p>
|
||||
</header>
|
||||
|
||||
{% if hit.snippet %}
|
||||
<p class="snippet" lang="{{ hit.language or '' }}" dir="auto">{{ hit.snippet | safe }}</p>
|
||||
{% endif %}
|
||||
|
||||
<footer class="hit-meta">
|
||||
<span class="mime">{{ hit.mime_type | display_mime }}</span>
|
||||
{% if hit.modified_at %}
|
||||
<span class="date"><bdi>{{ hit.modified_at | format_date }}</bdi></span>
|
||||
{% endif %}
|
||||
{% if hit.language %}
|
||||
<span class="lang">{{ hit.language | upper }}</span>
|
||||
{% endif %}
|
||||
|
||||
<!-- "Open file" affordance -->
|
||||
<a class="open-link" href="{{ hit | open_uri }}" title="Open original file">
|
||||
Open file
|
||||
</a>
|
||||
</footer>
|
||||
</article>
|
||||
```
|
||||
|
||||
### Open-file affordance
|
||||
|
||||
On a single-user local deployment, "open the file" can be:
|
||||
|
||||
1. **`file://` URI link** — `href="file:///path/to/document.pdf#page=3"`. PDF viewers (Acrobat, Evince) honour the `#page=N` fragment. Works on Linux and macOS; on Windows, browser security policies often block `file://` links from HTML pages served over `http://`. A workaround is a small local helper endpoint.
|
||||
2. **`/api/open?path=...&page=...` local API endpoint** — FastAPI endpoint that calls `subprocess.Popen` (or `os.startfile` on Windows) to open the file in the OS default application. Simple to implement, works cross-platform, and keeps `file://` out of the browser. Add a `page` query param and let the OS/app handle it.
|
||||
3. **"Copy path" button** — always show a copy-to-clipboard button alongside the open link; the user can paste into an explorer or terminal.
|
||||
|
||||
Recommended: implement option 2 (`/api/open`) for v1. The endpoint is a one-liner; it handles all platforms the same way:
|
||||
|
||||
```python
|
||||
@router.post("/api/open")
|
||||
async def open_file(path: str, page: int | None = None):
|
||||
# Validate path is within configured source roots (security check)
|
||||
import subprocess, sys
|
||||
target = path if page is None else f"{path}#page={page}"
|
||||
if sys.platform == "win32":
|
||||
os.startfile(path)
|
||||
elif sys.platform == "darwin":
|
||||
subprocess.Popen(["open", path])
|
||||
else:
|
||||
subprocess.Popen(["xdg-open", path])
|
||||
return {"ok": True}
|
||||
```
|
||||
|
||||
### Audio/video deep links
|
||||
|
||||
For audio/video results, `timestamp_seconds` is the key field. Deep-linking into a media file is harder than PDFs because there's no universal URI scheme for media timestamp. Options:
|
||||
|
||||
- **Show timestamp in the result card** ("at 2m 14s") with a "copy path + timestamp" button. Simple and universal.
|
||||
- **HTML5 media preview** — if the media is served locally by the FastAPI app (via `FileResponse`), embed a `<video src="/media?path=...#t=134">` player directly in the result card, pre-seeked via the `#t=N` URI fragment. This works with any browser's HTML5 video player. Recommended for v1 as it requires no external player.
|
||||
|
||||
### Admin / status view
|
||||
|
||||
`GET /status` shows:
|
||||
- Last pipeline run timestamp and duration
|
||||
- Counts: total indexed, processed since last run, failed, skipped
|
||||
- List of recent failures (path, error message, timestamp) — sourced from the State Store / quarantine log
|
||||
- Backend health indicator (search engine reachable: yes/no)
|
||||
- Index size and document count (from Meilisearch stats endpoint)
|
||||
|
||||
This is a read-only page; no HTMX needed (simple full render). A `meta http-equiv="refresh" content="30"` or a `hx-get="/status/partial" hx-trigger="every 30s"` on the counts section provides live updates during a run.
|
||||
|
||||
---
|
||||
|
||||
## 6. Alternatives Comparison
|
||||
|
||||
### Streamlit
|
||||
|
||||
**What it is:** Python script → web app with zero HTML. Renders from top to bottom on every interaction.
|
||||
|
||||
**Cost:** Script-based re-render model means the entire page re-evaluates on each widget interaction. For a search UI, this produces visible flicker and re-renders the entire page (including facets) on every keystroke, even with `st.session_state` caching. No real debounce control. No URL state (direct links to searches not supported out of the box). Poor for RTL layout customisation without raw HTML injection. Font and direction customisation require `st.markdown(..., unsafe_allow_html=True)` hacks.
|
||||
|
||||
**Use if:** you need a 30-minute internal demo, not a production UI.
|
||||
|
||||
### NiceGUI
|
||||
|
||||
**What it is:** event-driven Python UI built on FastAPI + Vue.js (WebSocket-backed). NiceGUI 2.0 shipped in 2025 with Tailwind CSS integration and improved routing.
|
||||
|
||||
**Cost:** More control than Streamlit; supports multi-page apps and proper routing. However, it runs a persistent WebSocket connection per client tab — heavier than the stateless HTTP model. The Vue.js bridge means the server must stay alive (no simple static HTML fallback). RTL support requires configuring Tailwind's `dir` settings. Harder to make genuinely engine-agnostic: the UI logic is Python objects, so swapping to a different backend still requires changes to the event handlers. Less suited to the "UI is just an HTTP client" architecture this project mandates.
|
||||
|
||||
**Use if:** you prefer staying 100% in Python and are comfortable with the WebSocket/Vue runtime dependency.
|
||||
|
||||
### JS InstantSearch / instant-meilisearch (Option A)
|
||||
|
||||
**What it is:** `@meilisearch/instant-meilisearch` (v0.30.0) wraps Algolia's InstantSearch.js, bridging it to Meilisearch. Drop in React (`react-instantsearch`) or vanilla JS components for an off-the-shelf search UI with facets, highlighting, pagination, and search-as-you-type. ([github.com/meilisearch/meilisearch-js-plugins](https://github.com/meilisearch/meilisearch-js-plugins))
|
||||
|
||||
**Cost:**
|
||||
1. **Engine coupling (fatal for this project).** The browser talks directly to Meilisearch — requires exposing the search-only key to the browser. While Meilisearch's Default Search API Key is restricted to search-only actions ([meilisearch.com/docs/resources/self_hosting/security/master_api_keys](https://www.meilisearch.com/docs/resources/self_hosting/security/master_api_keys)), swapping the search engine later means rewriting the entire frontend.
|
||||
2. **Node.js build step.** Adds npm to the toolchain.
|
||||
3. **Arabic RTL:** InstantSearch default widgets are not RTL-aware; requires custom CSS overrides.
|
||||
|
||||
**Use if (v2):** you decide to commit permanently to Meilisearch and want the fastest, richest off-the-shelf search UX. Optionally put it in front of the Option B FastAPI API (rather than the engine directly) to preserve swappability — though you lose InstantSearch's direct real-time API then.
|
||||
|
||||
### Why HTMX + FastAPI + Jinja2 is the right v1 default
|
||||
|
||||
- **Stays in the Option B architecture.** The UI is stateless HTML over HTTP; swapping the search backend touches only the `MeilisearchProvider` class.
|
||||
- **No build step.** No npm, no webpack, no Node.js. The entire UI is Python + HTML + a single vendored `htmx.min.js`.
|
||||
- **RTL is native.** `dir="rtl"` and CSS logical properties work in server-rendered HTML without any framework involvement.
|
||||
- **Bookmarkable / shareable.** `hx-push-url` keeps the URL in sync; direct URL access works because the server renders the same content on full-page load.
|
||||
- **Accessible by default.** Server-rendered HTML with standard `<a>`, `<form>`, `<input>` elements degrades gracefully without JavaScript.
|
||||
- **Replaceability.** If a richer JS frontend is wanted in v2, it replaces only the templates/static layer; the FastAPI routes and `SearchProvider` interface stay unchanged.
|
||||
|
||||
---
|
||||
|
||||
## 7. Risks and Mitigations
|
||||
|
||||
| Risk | Likelihood | Mitigation |
|
||||
|---|---|---|
|
||||
| `dir="auto"` on search input surprises users when they paste a long English query into an RTL page | Medium | Default `dir="rtl"` on the input; add a small `<button>` to toggle `dir` manually if needed |
|
||||
| `file://` open links blocked by browser security policy on Windows | High | Use `/api/open` server-side endpoint instead |
|
||||
| `| safe` on `snippet` field is an XSS vector if the adapter ever includes unescaped user content | Low | The adapter must sanitise all HTML to only `<mark>…</mark>` tags before populating `snippet`; add a helper `sanitize_snippet()` using `bleach` (allowlist: `mark` only) |
|
||||
| Meilisearch `_formatted` field absent when no terms match | Medium | Adapter falls back to first 200 characters of `content_text` with no `<mark>` tags |
|
||||
| HTMX v4 breaks the 2.x API | Low (2025–2026) | Pin `htmx@2.0.10` in static/; upgrade deliberately after v4.0 hits stable (≈ early 2027) |
|
||||
| Large audio/video transcript snippets dominate search results | Medium | Set `cropLength=30` (words); crop to the matched segment; show "at 2m14s" provenance clearly |
|
||||
| Arabic text in Meilisearch snippets: diacritics (tashkeel) affect tokenisation | Medium | Meilisearch strips diacritics during indexing by default; verify Arabic query matches work with and without diacritics in integration tests |
|
||||
| RTL pagination controls render counter-intuitively (page 1 on right, last on left) | Low | Use `dir="ltr"` on the pagination bar element — numerically-ordered controls read left-to-right even in Arabic UIs (established convention) |
|
||||
| `suggest()` endpoint (typeahead) — Meilisearch does not have a native suggestions API | Medium | v1: leave `suggest()` returning empty list; implement using a fast second search query with `hitsPerPage=5` on a prefix match; document as a known limitation |
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- FastAPI release notes: <https://fastapi.tiangolo.com/release-notes/>
|
||||
- HTMX 2.x docs: <https://htmx.org/docs/>
|
||||
- HTMX hx-trigger (delay, changed, throttle): <https://htmx.org/attributes/hx-trigger/>
|
||||
- HTMX hx-sync (replace strategy): <https://htmx.org/attributes/hx-sync/>
|
||||
- HTMX hx-push-url: <https://htmx.org/attributes/hx-push-url/>
|
||||
- HTMX infinite scroll example: <https://htmx.org/examples/infinite-scroll/>
|
||||
- Meilisearch Search API reference: <https://www.meilisearch.com/docs/reference/api/search>
|
||||
- Meilisearch security / API keys: <https://www.meilisearch.com/docs/resources/self_hosting/security/master_api_keys>
|
||||
- Meilisearch hybrid/vector search: <https://www.meilisearch.com/docs/capabilities/hybrid_search/overview>
|
||||
- meilisearch-python-sdk (PyPI): <https://pypi.org/project/meilisearch-python-sdk/>
|
||||
- meilisearch official SDK (PyPI): <https://pypi.org/project/meilisearch/>
|
||||
- instant-meilisearch (npm): <https://www.npmjs.com/package/@meilisearch/instant-meilisearch>
|
||||
- meilisearch-js-plugins (GitHub): <https://github.com/meilisearch/meilisearch-js-plugins>
|
||||
- W3C inline bidi markup: <https://www.w3.org/International/articles/inline-bidi-markup/>
|
||||
- Google Fonts — Noto Sans Arabic: <https://fonts.google.com/noto/specimen/Noto+Sans+Arabic>
|
||||
- Google Fonts — Cairo: <https://fonts.google.com/specimen/Cairo>
|
||||
- Google Fonts — Tajawal: <https://fonts.google.com/specimen/Tajawal>
|
||||
- Meilisearch HTMX tutorial (Botmonster): <https://botmonster.com/posts/full-text-search-meilisearch-htmx/>
|
||||
- FastAPI + HTMX guide: <https://blakecrosley.com/guides/fastapi-htmx>
|
||||
- NiceGUI vs Streamlit: <https://www.bitdoze.com/streamlit-vs-nicegui/>
|
||||
570
docs/research/F-forgejo-ci-windows.md
Normal file
570
docs/research/F-forgejo-ci-windows.md
Normal file
|
|
@ -0,0 +1,570 @@
|
|||
# Research F: Forgejo CI & Windows Runner
|
||||
|
||||
**Agent F — July 2026**
|
||||
**Sources consulted**: https://forgejo.org/docs/latest/user/actions/, https://forgejo.org/docs/latest/admin/actions/, https://code.forgejo.org/actions, https://github.com/Crown0815/Forgejo-runner-windows-builder, https://code.forgejo.org/forgejo/runner/releases, `/home/luffy/space/forgejo-stack/` (live inspection)
|
||||
|
||||
---
|
||||
|
||||
## 1. Summary & Recommendation
|
||||
|
||||
The existing forgejo-stack hosts a single Linux/Docker runner (`local-runner`) with labels `docker` and `ubuntu-latest`, using `code.forgejo.org/forgejo/runner:6` and job image `forgejo-stack/job:latest`. This runner can handle all Linux CI work for digger — unit tests, Meilisearch integration tests (via service containers), and caching — without any changes to the stack.
|
||||
|
||||
The **Windows runner requires a separate Windows host or VM**. It cannot be a Docker Compose service because the Crown0815 binary is a native Windows executable that runs CI jobs on the host itself (no container runtime). The most practical path on the user's existing Linux box is a **KVM/QEMU Windows 11 VM** running alongside the Docker stack. The VM registers the runner against Forgejo at the Linux host's LAN IP; no changes to docker-compose.yml are required. The integration is documented via a new script and README addition to forgejo-stack.
|
||||
|
||||
**Recommendation:**
|
||||
- Linux jobs: unit (every push) + Meilisearch integration (every PR to main) on `ubuntu-latest` label. Pin Python 3.12. Use `actions/cache` for pip. Gate heavy OCR/ASR behind `pytest -m heavy` + `workflow_dispatch` with a dedicated runner label.
|
||||
- Windows jobs: Windows-only path/encoding unit tests on `windows` label, triggered on every PR to main and every push to main. No service containers (not supported in host mode).
|
||||
- Stand up the Windows VM and runner registration BEFORE sprint 0 lands; the first commit should already have the workflow files for both tiers.
|
||||
|
||||
---
|
||||
|
||||
## 2. Linux CI Workflow Outline
|
||||
|
||||
### 2.1 Workflow file location
|
||||
|
||||
All workflow files live under `.forgejo/workflows/` in the repo root. The runner picks them up automatically once the repo has Actions enabled. Use fully-qualified `uses:` URLs (see Section 5) — avoid bare `actions/checkout@v4` style references because `DEFAULT_ACTIONS_URL` can be changed by the instance admin.
|
||||
|
||||
### 2.2 Recommended workflow structure
|
||||
|
||||
Two workflow files: `ci.yml` (Linux) and `ci-windows.yml` (Windows). Split by runner label so the Windows job never blocks the Linux job queue.
|
||||
|
||||
#### `.forgejo/workflows/ci.yml` — Linux unit + integration
|
||||
|
||||
```yaml
|
||||
name: CI (Linux)
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: ["**"]
|
||||
pull_request:
|
||||
branches: [main]
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
# ── Tier 1: Unit tests — runs on every push, no external services ─────────
|
||||
unit:
|
||||
name: Unit tests (Python ${{ matrix.python-version }})
|
||||
runs-on: ubuntu-latest
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
python-version: ["3.12"] # pin; add 3.13 when stable
|
||||
|
||||
steps:
|
||||
- uses: https://code.forgejo.org/actions/checkout@v4
|
||||
|
||||
- uses: https://code.forgejo.org/actions/setup-python@v5
|
||||
with:
|
||||
python-version: ${{ matrix.python-version }}
|
||||
|
||||
# Cache pip download cache (not the venv) — keyed on lockfile hash
|
||||
- uses: https://code.forgejo.org/actions/cache@v4
|
||||
with:
|
||||
path: ~/.cache/pip
|
||||
key: pip-${{ matrix.python-version }}-${{ hashFiles('**/requirements*.txt', '**/pyproject.toml') }}
|
||||
restore-keys: pip-${{ matrix.python-version }}-
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
pip install -e ".[dev]" # dev extras: ruff, mypy, pytest, pytest-cov
|
||||
|
||||
- name: Lint — ruff
|
||||
run: ruff check .
|
||||
|
||||
- name: Format check — ruff format
|
||||
run: ruff format --check .
|
||||
|
||||
- name: Type check — mypy
|
||||
run: mypy src/
|
||||
|
||||
- name: Unit tests
|
||||
run: |
|
||||
pytest -m "not integration and not heavy" \
|
||||
--cov=src/digger \
|
||||
--cov-report=xml \
|
||||
--cov-report=term-missing \
|
||||
-v
|
||||
|
||||
# ── Tier 2: Integration tests — only on PRs to main ──────────────────────
|
||||
integration:
|
||||
name: Integration tests (Meilisearch)
|
||||
runs-on: ubuntu-latest
|
||||
# Only run integration tier for PRs targeting main (not every push)
|
||||
if: github.event_name == 'pull_request' && github.base_ref == 'main'
|
||||
|
||||
services:
|
||||
meilisearch:
|
||||
image: getmeili/meilisearch:v1.11 # pin to a specific patch release
|
||||
env:
|
||||
MEILI_NO_ANALYTICS: "true"
|
||||
MEILI_MASTER_KEY: "digger-test-key"
|
||||
options: >-
|
||||
--health-cmd "curl -fsS http://localhost:7700/health"
|
||||
--health-interval 5s
|
||||
--health-timeout 3s
|
||||
--health-retries 10
|
||||
|
||||
steps:
|
||||
- uses: https://code.forgejo.org/actions/checkout@v4
|
||||
|
||||
- uses: https://code.forgejo.org/actions/setup-python@v5
|
||||
with:
|
||||
python-version: "3.12"
|
||||
|
||||
- uses: https://code.forgejo.org/actions/cache@v4
|
||||
with:
|
||||
path: ~/.cache/pip
|
||||
key: pip-3.12-${{ hashFiles('**/requirements*.txt', '**/pyproject.toml') }}
|
||||
restore-keys: pip-3.12-
|
||||
|
||||
- name: Install dependencies
|
||||
run: pip install -e ".[dev]"
|
||||
|
||||
- name: Integration tests
|
||||
env:
|
||||
MEILI_URL: http://meilisearch:7700
|
||||
MEILI_MASTER_KEY: "digger-test-key"
|
||||
run: |
|
||||
pytest -m integration \
|
||||
--cov=src/digger \
|
||||
--cov-report=xml \
|
||||
-v
|
||||
```
|
||||
|
||||
**Notes on the Meilisearch service container:**
|
||||
- The service container hostname is the key in `services:` — here `meilisearch`. The job reaches it at `http://meilisearch:7700`.
|
||||
- The runner joins job containers to the compose internal network (`forgejo-stack_internal`) per the existing runner config — so the service container can also use `forgejo.localhost:3000` for any Forgejo API calls if needed.
|
||||
- `options` passes raw `docker run` flags; `--health-cmd` + related options must be in `options:` not as top-level keys (Forgejo Actions syntax matches GitHub Actions here).
|
||||
- Pin `getmeili/meilisearch` to a patch version (e.g., `v1.11`) so a Meilisearch release doesn't silently break tests. Update intentionally.
|
||||
|
||||
### 2.3 Heavy / real-model tier (gated)
|
||||
|
||||
```yaml
|
||||
# In .forgejo/workflows/ci-heavy.yml — separate file, NOT included in ci.yml
|
||||
|
||||
name: CI (Heavy — real models)
|
||||
|
||||
on:
|
||||
workflow_dispatch: # manual trigger only
|
||||
inputs:
|
||||
reason:
|
||||
description: "Why are you running heavy tests?"
|
||||
required: false
|
||||
|
||||
jobs:
|
||||
heavy:
|
||||
name: Heavy model tests
|
||||
# Target a runner with the 'heavy' label — either the Linux runner
|
||||
# if you add that label, or a dedicated GPU host in future.
|
||||
# For now, add 'heavy' to the existing linux runner's label list
|
||||
# so these jobs at least run somewhere (just slowly on CPU).
|
||||
runs-on: [ubuntu-latest, heavy]
|
||||
|
||||
steps:
|
||||
- uses: https://code.forgejo.org/actions/checkout@v4
|
||||
|
||||
- uses: https://code.forgejo.org/actions/setup-python@v5
|
||||
with:
|
||||
python-version: "3.12"
|
||||
|
||||
- name: Install dependencies
|
||||
run: pip install -e ".[dev,models]" # models extra: docling, whisper, etc.
|
||||
|
||||
- name: Heavy tests
|
||||
env:
|
||||
RUN_HEAVY_TESTS: "1"
|
||||
run: |
|
||||
pytest -m heavy -v --timeout=600
|
||||
```
|
||||
|
||||
**pytest markers** — add to `pyproject.toml`:
|
||||
```toml
|
||||
[tool.pytest.ini_options]
|
||||
addopts = "-m 'not integration and not heavy'" # safe default for local dev
|
||||
markers = [
|
||||
"integration: requires a running Meilisearch instance",
|
||||
"heavy: requires local model inference (OCR/ASR); slow, never run in default CI",
|
||||
"windows_compat: Windows-specific path/encoding behaviour tests",
|
||||
]
|
||||
```
|
||||
|
||||
The `addopts` default means `pytest` (no flags) skips integration and heavy tests everywhere, including local dev. CI overrides via `-m integration` or `-m heavy` as needed.
|
||||
|
||||
### 2.4 Caching pip dependencies
|
||||
|
||||
`actions/cache` is available at `code.forgejo.org/actions/cache` and supports Forgejo. Cache key must include the lockfile (or `pyproject.toml`) hash. Cache restores automatically across workflow runs on the same branch; Linux runner stores them in the shared `forgejo-stack-hostedtoolcache` Docker volume already mounted by the runner.
|
||||
|
||||
---
|
||||
|
||||
## 3. Windows Runner Setup Plan
|
||||
|
||||
### 3.1 What Crown0815/Forgejo-runner-windows-builder provides
|
||||
|
||||
Source: https://github.com/Crown0815/Forgejo-runner-windows-builder
|
||||
|
||||
This is an **unofficial community project** that cross-compiles the official Forgejo runner source (https://code.forgejo.org/forgejo/runner) for Windows using GitHub Actions. It makes no code changes to the upstream source — it's purely a build pipeline. The official Forgejo runner ships only Linux binaries; Crown0815 fills the gap.
|
||||
|
||||
**Available binaries (as of v12.12.0, June 27 2026):**
|
||||
- `forgejo-runner-windows-386.exe` — 32-bit Intel
|
||||
- `forgejo-runner-windows-amd64.exe` — 64-bit Intel/AMD (**use this**)
|
||||
- `forgejo-runner-windows-arm64.exe` — ARM64
|
||||
|
||||
**Version correspondence:** The Crown0815 version numbers track the official Forgejo runner release tags exactly (v12.12.0 = official runner v12.12.0). The stack's Linux runner image uses the tag `code.forgejo.org/forgejo/runner:6` — this is a Docker image major-version tag from a different numbering era. Verify compatibility; if in doubt, update the Linux runner image to match the Windows binary version to keep both runners in sync.
|
||||
|
||||
**Pinning a release:** Always pin to an exact tag, never use `latest`. Download URL pattern:
|
||||
```
|
||||
https://github.com/Crown0815/Forgejo-runner-windows-builder/releases/download/v12.12.0/forgejo-runner-windows-amd64.exe
|
||||
```
|
||||
Record the pinned version in `forgejo-stack/docs/windows-runner.md` and bump it deliberately.
|
||||
|
||||
**Risk note:** Crown0815 is a single-maintainer community project. If it goes unmaintained, build your own: the upstream runner is standard Go and cross-compilation is a one-line Makefile addition. Treat this as a convenience, not a hard dependency.
|
||||
|
||||
### 3.2 What the user must provide
|
||||
|
||||
The Windows runner runs jobs on the host OS — there is no container runtime involved. The user must provide:
|
||||
|
||||
**A. A Windows host or VM reachable from (and that can reach) the Forgejo instance.**
|
||||
|
||||
The most practical option on the user's existing Linux box: a **Windows 11 KVM/QEMU VM** managed by libvirt. This is standard on Linux and performs well with KVM acceleration.
|
||||
|
||||
- Install: `sudo apt install qemu-kvm libvirt-daemon-system virt-manager`
|
||||
- Create a Windows 11 VM with ≥4 GB RAM and ≥60 GB disk
|
||||
- The VM will bridge to the host's network and can reach Forgejo at `http://<linux-host-lan-ip>:3000`
|
||||
- `forgejo.localhost` does NOT auto-resolve on Windows to the Linux host; add it to `C:\Windows\System32\drivers\etc\hosts`: `<linux-host-ip> forgejo.localhost`
|
||||
- Alternatively, register the runner pointing at the IP directly: `http://<linux-host-ip>:3000` (not the `.localhost` URL)
|
||||
|
||||
A physical spare Windows machine works equally well if available.
|
||||
|
||||
**B. Tools pre-installed on the Windows host (vendored for CI jobs):**
|
||||
|
||||
The Windows runner executes jobs natively — there is no base container image. All tools referenced in a `run:` step must already exist on the host.
|
||||
|
||||
| Tool | Purpose | How to install |
|
||||
|------|---------|----------------|
|
||||
| Python 3.12 | Running tests | `winget install Python.Python.3.12` or Python.org installer |
|
||||
| Git | `actions/checkout` | `winget install Git.Git` |
|
||||
| Ruff | Linting/formatting | `pip install ruff` (part of dev deps) |
|
||||
| Mypy | Type checking | `pip install mypy` (part of dev deps) |
|
||||
| Poppler | PDF processing (tests) | Pre-built Windows binaries at https://github.com/oschwartz10612/poppler-windows/releases |
|
||||
| FFmpeg | Audio/video tests | `winget install Gyan.FFmpeg` |
|
||||
|
||||
Note: Heavy model inference (OCR/ASR) is gated behind the `heavy` marker and NOT expected to run on the Windows runner. The Windows CI job is limited to path/encoding/cross-platform unit tests.
|
||||
|
||||
**C. Windows Defender / AV exception:**
|
||||
|
||||
Windows Defender (and other AV software) frequently flags unsigned Go binaries from GitHub as suspicious. Before running `forgejo-runner-windows-amd64.exe`:
|
||||
|
||||
1. Open "Windows Security" → "Virus & threat protection" → "Manage settings"
|
||||
2. Under "Exclusions", add the folder where the runner binary lives (e.g., `C:\forgejo-runner\`)
|
||||
3. Or use PowerShell (admin): `Add-MpPreference -ExclusionPath "C:\forgejo-runner"`
|
||||
|
||||
Without this, Defender may quarantine the binary at download, prevent execution, or kill it mid-job with no useful error message.
|
||||
|
||||
### 3.3 Registration
|
||||
|
||||
**Step 1 — Get a registration token** (run on the Linux host):
|
||||
```bash
|
||||
# Uses the Forgejo admin API — same pattern as bootstrap-forgejo.sh
|
||||
TOKEN=$(curl -fsS \
|
||||
-H "Authorization: token ${FORGEJO_ADMIN_TOKEN}" \
|
||||
"http://forgejo.localhost:3000/api/v1/admin/runners/registration-token" \
|
||||
| jq -r .token)
|
||||
echo "Registration token: $TOKEN"
|
||||
```
|
||||
|
||||
**Step 2 — Register on the Windows host** (run in an admin PowerShell):
|
||||
```powershell
|
||||
cd C:\forgejo-runner
|
||||
|
||||
.\forgejo-runner-windows-amd64.exe register `
|
||||
--no-interactive `
|
||||
--instance "http://<linux-host-ip>:3000" `
|
||||
--token "<registration-token>" `
|
||||
--name "windows-runner" `
|
||||
--labels "windows:host,self-hosted:host"
|
||||
```
|
||||
|
||||
The label format `<name>:host` tells the runner to execute jobs on the host OS rather than in a Docker container. `windows:host` creates the `windows` label; `self-hosted:host` creates the standard self-hosted label (workflows can target either).
|
||||
|
||||
**Step 3 — Write runner config** (`C:\forgejo-runner\config.yaml`):
|
||||
```yaml
|
||||
runner:
|
||||
capacity: 1 # Windows VM is resource-constrained; keep low
|
||||
labels:
|
||||
- "windows:host"
|
||||
- "self-hosted:host"
|
||||
envs:
|
||||
ACTIONS_RUNTIME_URL: "http://<linux-host-ip>:3000/"
|
||||
ACTIONS_RESULTS_URL: "http://<linux-host-ip>:3000/"
|
||||
```
|
||||
|
||||
The `ACTIONS_RUNTIME_URL` override is critical: without it, the runner inherits whatever Forgejo emits as `ROOT_URL` (`http://forgejo.localhost:3000/`). That URL resolves fine in a browser (RFC 6761 *.localhost) but NOT on a Windows host that doesn't have `forgejo.localhost` in its hosts file. Use the host's actual LAN IP here.
|
||||
|
||||
**Step 4 — Run as a service** (so it survives reboots):
|
||||
|
||||
Option A — Windows Task Scheduler (simpler, no extra tools):
|
||||
```powershell
|
||||
schtasks /Create /SC ONSTART /TN "ForgejoRunner" /TR "C:\forgejo-runner\forgejo-runner-windows-amd64.exe daemon --config C:\forgejo-runner\config.yaml" /RU SYSTEM /RL HIGHEST /F
|
||||
```
|
||||
|
||||
Option B — NSSM (Non-Sucking Service Manager, recommended for proper service semantics and log capture):
|
||||
```powershell
|
||||
# Download nssm from https://nssm.cc/
|
||||
nssm install ForgejoRunner "C:\forgejo-runner\forgejo-runner-windows-amd64.exe"
|
||||
nssm set ForgejoRunner AppParameters "daemon --config C:\forgejo-runner\config.yaml"
|
||||
nssm set ForgejoRunner AppDirectory "C:\forgejo-runner"
|
||||
nssm start ForgejoRunner
|
||||
```
|
||||
|
||||
**Step 5 — Verify**: On the Linux host, check the runner appears in Forgejo's admin UI at `http://forgejo.localhost:3000/-/admin/runners`.
|
||||
|
||||
### 3.4 Windows CI workflow
|
||||
|
||||
```yaml
|
||||
# .forgejo/workflows/ci-windows.yml
|
||||
|
||||
name: CI (Windows)
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
pull_request:
|
||||
branches: [main]
|
||||
|
||||
jobs:
|
||||
windows-unit:
|
||||
name: Windows unit tests
|
||||
runs-on: windows # matches the 'windows:host' label
|
||||
|
||||
steps:
|
||||
# actions/checkout on a host runner: the runner clones via git directly
|
||||
- uses: https://code.forgejo.org/actions/checkout@v4
|
||||
|
||||
# Do NOT use actions/setup-python here — it downloads Node.js-based
|
||||
# toolchain wrappers that don't play well with host-mode runners
|
||||
# without a proper AGENT_TOOLSDIRECTORY. Instead, call Python directly
|
||||
# from the PATH (pre-installed on the Windows host).
|
||||
- name: Verify Python
|
||||
shell: pwsh
|
||||
run: python --version
|
||||
|
||||
- name: Install dependencies
|
||||
shell: pwsh
|
||||
run: pip install -e ".[dev]"
|
||||
|
||||
- name: Lint (ruff)
|
||||
shell: pwsh
|
||||
run: ruff check .
|
||||
|
||||
- name: Format check (ruff format)
|
||||
shell: pwsh
|
||||
run: ruff format --check .
|
||||
|
||||
- name: Windows unit tests
|
||||
shell: pwsh
|
||||
run: |
|
||||
pytest -m "not integration and not heavy" `
|
||||
--cov=src/digger `
|
||||
--cov-report=xml `
|
||||
-v
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- `shell: pwsh` uses PowerShell 7+ (Core). Ensure it's installed on the Windows host.
|
||||
- Do NOT add `services:` to the Windows job. Host-mode runners cannot spin up service containers (they have no Docker daemon). Meilisearch integration tests stay Linux-only.
|
||||
- `actions/setup-python` may work on a host runner but requires `AGENT_TOOLSDIRECTORY` to be set. The simpler approach is to pre-install Python and call it directly from PATH. Test this during sprint 0 and document what works.
|
||||
- The job should include tests tagged `@pytest.mark.windows_compat` — Windows path separator handling, long path (>260 char) awareness, file encoding (UTF-8 BOM, CP1252), file-lock behaviour under concurrent access.
|
||||
|
||||
---
|
||||
|
||||
## 4. forgejo-stack Integration Plan
|
||||
|
||||
The Windows runner cannot be a Docker Compose service (it's a Windows binary). The integration is documentation + a helper script.
|
||||
|
||||
### 4.1 What changes in forgejo-stack
|
||||
|
||||
**A. New script: `scripts/register-windows-runner.sh`**
|
||||
|
||||
A Linux-side helper that generates a registration token and prints the PowerShell commands to run on the Windows host:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# register-windows-runner.sh — generate a registration token and print
|
||||
# the Windows PowerShell commands needed to set up the Windows runner.
|
||||
set -euo pipefail
|
||||
HERE="$(cd "$(dirname "$0")" && pwd)"
|
||||
cd "$HERE/.."
|
||||
source "$HERE/lib.sh"
|
||||
load_env
|
||||
|
||||
log_info "fetching runner registration token from Forgejo..."
|
||||
TOKEN=$(forgejo_api GET "/admin/runners/registration-token" | jq -r .token)
|
||||
if [[ -z "$TOKEN" || "$TOKEN" == "null" ]]; then
|
||||
log_err "failed to get registration token (check FORGEJO_ADMIN_TOKEN)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
HOST_IP=$(hostname -I | awk '{print $1}')
|
||||
RUNNER_VERSION="${WINDOWS_RUNNER_VERSION:-v12.12.0}"
|
||||
|
||||
cat <<EOF
|
||||
|
||||
== Windows Runner Registration ==
|
||||
|
||||
1. On the Windows host/VM, open an admin PowerShell and run:
|
||||
|
||||
mkdir C:\forgejo-runner
|
||||
cd C:\forgejo-runner
|
||||
|
||||
# Add Windows Defender exclusion first:
|
||||
Add-MpPreference -ExclusionPath "C:\forgejo-runner"
|
||||
|
||||
# Download the runner binary:
|
||||
Invoke-WebRequest -Uri "https://github.com/Crown0815/Forgejo-runner-windows-builder/releases/download/${RUNNER_VERSION}/forgejo-runner-windows-amd64.exe" -OutFile "forgejo-runner-windows-amd64.exe"
|
||||
|
||||
# Register:
|
||||
.\forgejo-runner-windows-amd64.exe register \`
|
||||
--no-interactive \`
|
||||
--instance "http://${HOST_IP}:3000" \`
|
||||
--token "${TOKEN}" \`
|
||||
--name "windows-runner" \`
|
||||
--labels "windows:host,self-hosted:host"
|
||||
|
||||
# Write config.yaml:
|
||||
@"
|
||||
runner:
|
||||
capacity: 1
|
||||
labels:
|
||||
- "windows:host"
|
||||
- "self-hosted:host"
|
||||
envs:
|
||||
ACTIONS_RUNTIME_URL: "http://${HOST_IP}:3000/"
|
||||
ACTIONS_RESULTS_URL: "http://${HOST_IP}:3000/"
|
||||
"@ | Out-File -FilePath "config.yaml" -Encoding utf8
|
||||
|
||||
# Start as a Windows service (requires nssm — https://nssm.cc/):
|
||||
nssm install ForgejoRunner "C:\forgejo-runner\forgejo-runner-windows-amd64.exe"
|
||||
nssm set ForgejoRunner AppParameters "daemon --config C:\forgejo-runner\config.yaml"
|
||||
nssm set ForgejoRunner AppDirectory "C:\forgejo-runner"
|
||||
nssm start ForgejoRunner
|
||||
|
||||
2. Verify at: http://${FORGEJO_ROOT_URL%-/admin/runners}/-/admin/runners
|
||||
The 'windows-runner' should appear online within ~30 seconds.
|
||||
|
||||
EOF
|
||||
log_info "registration token used above will expire; re-run this script if setup is delayed."
|
||||
```
|
||||
|
||||
**B. `.env.example` additions** (informational only — the Windows runner is not a compose service):
|
||||
|
||||
```dotenv
|
||||
# ---------- Windows runner (host/VM — see docs/windows-runner.md) ----------
|
||||
# Version of Crown0815/Forgejo-runner-windows-builder to deploy.
|
||||
# Check https://github.com/Crown0815/Forgejo-runner-windows-builder/releases
|
||||
WINDOWS_RUNNER_VERSION=v12.12.0
|
||||
```
|
||||
|
||||
**C. README.md addition** (brief paragraph under a new "Windows CI Runner" section pointing to `docs/windows-runner.md`).
|
||||
|
||||
**D. `docs/windows-runner.md`** — full setup guide with KVM VM provisioning, PowerShell steps, NSSM, Defender exception, and troubleshooting. (Out of scope for this research; written at sprint 0 time.)
|
||||
|
||||
### 4.2 What does NOT change in docker-compose.yml
|
||||
|
||||
Nothing. The compose stack does not manage the Windows runner. The `runner` service in compose.yml remains the Linux/Docker runner. The Windows runner is an external process that happens to register against the same Forgejo instance.
|
||||
|
||||
### 4.3 Network topology
|
||||
|
||||
```
|
||||
Linux host (forgejo-stack runs here)
|
||||
├── Docker bridge: forgejo-stack_internal
|
||||
│ ├── forgejo (Forgejo, port 3000 → host 3000)
|
||||
│ ├── db (Postgres)
|
||||
│ ├── runner (Linux/Docker runner, labels: docker, ubuntu-latest)
|
||||
│ └── forgejo-mcp
|
||||
│
|
||||
└── Host LAN IP: <linux-host-ip>:3000 (accessible from VM/Windows)
|
||||
|
||||
Windows VM (KVM on same host, or separate PC)
|
||||
└── forgejo-runner-windows-amd64.exe
|
||||
├── registers at: http://<linux-host-ip>:3000
|
||||
├── labels: windows, self-hosted
|
||||
└── ACTIONS_RUNTIME_URL: http://<linux-host-ip>:3000/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. GitHub Actions Gaps and Risks
|
||||
|
||||
### 5.1 Action URL resolution (important)
|
||||
|
||||
Forgejo Actions resolves `uses: actions/checkout@v4` by prepending `DEFAULT_ACTIONS_URL`, which defaults to `https://data.forgejo.org` (which redirects to `https://code.forgejo.org`). This means bare `actions/checkout@v4` resolves to `code.forgejo.org/actions/checkout@v4`, NOT `github.com/actions/checkout@v4`.
|
||||
|
||||
**For digger:** Always use fully-qualified `uses:` URLs:
|
||||
- `uses: https://code.forgejo.org/actions/checkout@v4`
|
||||
- `uses: https://code.forgejo.org/actions/setup-python@v5`
|
||||
- `uses: https://code.forgejo.org/actions/cache@v4`
|
||||
|
||||
Confirmed available at `code.forgejo.org/actions`:
|
||||
- `checkout` ✓
|
||||
- `cache` ✓
|
||||
- `setup-python` ✓
|
||||
- `setup-node` ✓
|
||||
- `setup-go` ✓
|
||||
- `upload-artifact` — **use patched version**: `https://code.forgejo.org/forgejo/upload-artifact` (the standard `@v4` does not work from this mirror)
|
||||
- `download-artifact` — similar: use `code.forgejo.org/forgejo/download-artifact`
|
||||
|
||||
### 5.2 Service containers — only in Docker-mode runner jobs
|
||||
|
||||
Service containers (`jobs.<id>.services`) work only when the job runs in a Docker container (i.e., the runner launches a Docker container for the job). For the Windows host runner (`windows:host` label), there is no Docker daemon available and `services:` is unsupported. This is why the Meilisearch integration tests MUST run on the Linux runner, not the Windows runner.
|
||||
|
||||
### 5.3 actions/setup-python on host runners
|
||||
|
||||
On a Docker-mode runner, `setup-python` installs Python into `AGENT_TOOLSDIRECTORY` (`/opt/hostedtoolcache`). On a host-mode Windows runner, `AGENT_TOOLSDIRECTORY` may not be set, causing the action to fail or install Python to an unexpected location. The safer approach: pre-install Python on the Windows host and call it from `PATH` directly, skipping `setup-python` in the Windows workflow. Test at sprint 0.
|
||||
|
||||
### 5.4 actions/cache on the Windows host runner
|
||||
|
||||
`actions/cache` posts/restores cache to the Forgejo instance's cache API. This may work on a host runner but requires the runner to authenticate to `ACTIONS_RUNTIME_URL`. Since the Windows runner points at `http://<linux-host-ip>:3000/`, this should work, but is untested in this exact topology. If it fails, fall back to a local pip cache dir set via `PIP_CACHE_DIR` pointing at a fixed path on the Windows host.
|
||||
|
||||
### 5.5 `FORGEJO_*` environment variables
|
||||
|
||||
The runner v7+ exposes `FORGEJO_TOKEN`, `FORGEJO_REPOSITORY`, `FORGEJO_REF`, etc. alongside `GITHUB_*` aliases. The current stack uses runner image `:6` (older). If any workflow uses `FORGEJO_*` vars, ensure the runner image is at least v7. For digger workflows use `GITHUB_*` aliases (which are always available) for maximum compatibility.
|
||||
|
||||
### 5.6 Artifact and log retention
|
||||
|
||||
Artifacts default to 90 days, logs to 365 days. For digger's use case these defaults are fine; no config changes needed.
|
||||
|
||||
### 5.7 DST clock-change quirk
|
||||
|
||||
Scheduled workflows that would run during a "spring forward" DST transition are skipped; those in a "fall back" window run twice. Digger does not use scheduled triggers (only `push` and `pull_request`), so this is a non-issue.
|
||||
|
||||
### 5.8 No cross-instance reusable workflows
|
||||
|
||||
Reusable workflows only work within the same Forgejo instance. This is not a problem for digger, which has no cross-repo reusable workflows, but worth knowing if we later want to share CI logic with other repos on the stack.
|
||||
|
||||
### 5.9 Crown0815 runner is unofficial
|
||||
|
||||
The official Forgejo project ships Linux-only binaries. Crown0815's Windows builds are a community convenience. Risks:
|
||||
- Delayed releases (may lag upstream by days to weeks)
|
||||
- Single-maintainer bus factor
|
||||
- Unsigned binary (Defender flag)
|
||||
|
||||
Mitigation: Pin to a tested release version; document the version in `.env.example`; know that cross-compiling Go from the official source is a straightforward fallback if the project goes dark.
|
||||
|
||||
---
|
||||
|
||||
## 6. Pre-Sprint-0 Checklist
|
||||
|
||||
Before the first commit:
|
||||
|
||||
- [ ] Provision Windows VM (KVM) or confirm access to a Windows host
|
||||
- [ ] Run `scripts/register-windows-runner.sh` to get a registration token
|
||||
- [ ] Follow Windows setup steps (Defender exception → download binary → register → NSSM service)
|
||||
- [ ] Verify both runners are online in Forgejo admin UI
|
||||
- [ ] Commit `.forgejo/workflows/ci.yml` and `.forgejo/workflows/ci-windows.yml` in sprint 0 scaffolding
|
||||
- [ ] Confirm `actions/cache` works on both runner types (smoke test with a trivial workflow)
|
||||
- [ ] Pin `getmeili/meilisearch` version in the service container definition
|
||||
- [ ] Update `RUNNER_IMAGE` in `.env.example` to align with Crown0815 version if needed
|
||||
111
docs/research/SYNTHESIS.md
Normal file
111
docs/research/SYNTHESIS.md
Normal file
|
|
@ -0,0 +1,111 @@
|
|||
# Research Synthesis — digger
|
||||
|
||||
**Date:** 2026-07-01
|
||||
**Inputs:** Research docs A–F in this directory, plus the existing `arabic-ocr` repo and the confirmed Section-14 answers.
|
||||
**Purpose:** Reconcile the six findings into one set of per-layer technology choices, surface the cross-cutting decisions, and list the genuinely open trade-offs for the human to resolve before the IR/index design is finalized.
|
||||
|
||||
---
|
||||
|
||||
## 1. Confirmed project parameters (from Section 14)
|
||||
|
||||
| Parameter | Decision |
|
||||
|---|---|
|
||||
| Runtime | Python |
|
||||
| Search (v1) | Keyword only; vector/hybrid designed-for but switched off |
|
||||
| Access control | Single-user, no ACLs |
|
||||
| Hardware | CPU-capable, GPU-optional with auto-fallback. Dev box: no GPU, 128 GB unified RAM |
|
||||
| Languages | Arabic + English (RTL matters) |
|
||||
| Scale | Medium: 10k–500k files, ~0.5–5 TB |
|
||||
| Meilisearch | Bundled in our Compose, same machine, zero-config first run |
|
||||
| File access | Local disks only (still cross-platform path care) |
|
||||
| Granularity | Chunk-capable architecture, whole-doc default in v1 |
|
||||
| Read path | Option B — thin Python `SearchProvider` API; UI never knows the engine |
|
||||
| UI tech | FastAPI + Jinja2 + HTMX |
|
||||
| Run mode | CLI (`scan/extract/index/run/status`) now; seams for watched-folder/service later |
|
||||
| CI | Linux/Docker runner (`docker`, `ubuntu-latest`) exists; add Windows runner |
|
||||
|
||||
---
|
||||
|
||||
## 2. Per-layer technology decisions
|
||||
|
||||
### Search engine — Meilisearch (Agent A)
|
||||
- **Pin `getmeili/meilisearch:v1.48.3`.** Vector/hybrid is production-stable in v1.x; no experimental flags needed.
|
||||
- **Single index `digger_documents`** + `localizedAttributes: ["ara","eng"]` for mixed Arabic/English.
|
||||
- **Primary key `id` = SHA-256 hex of `canonical_path + "|" + content_hash`** — Meili IDs allow only `[a-zA-Z0-9_-]`, so raw paths can't be used; the hash is 64 hex chars, safe.
|
||||
- **Reserve vectors at zero cost now:** declare a `userProvided` embedder, `dimensions: 768`, at index creation in Sprint 0. Populate `_vectors.digger_semantic` only in V2. Changing dimensions later forces a full reindex → **commit to 768 now**.
|
||||
- Arabic: native via Charabia (article segmentation, diacritic stripping). Configure Arabic+English stop words manually (no built-ins). Keep typo `oneTypo` minimum at 5 (correct for short Arabic roots).
|
||||
- **Two hard limits that shape the IR:** (1) **65,535 word-positions per field** — long OCR'd PDFs / long transcripts silently truncate; (2) **468-byte filterable-value cap** — raw paths can't be filter values; derive a short `source_folder` token instead.
|
||||
|
||||
### Document conversion — tiered, not one toolkit (Agent B)
|
||||
| Tier | Handles | Tool | Adoption |
|
||||
|---|---|---|---|
|
||||
| 1 — native-digital | digital PDF, DOCX, PPTX, XLSX, HTML, EPUB | **Docling** | wholesale |
|
||||
| 2 — scanned / image | scanned PDF, JPEG/PNG, handwriting, IDs, forms | **Qwen2.5-VL via Ollama** (existing `arabic-ocr`) | wrap behind ModelBackend |
|
||||
| 3 — email/edge | EML, MSG, XML, CSV, ZIP | **Unstructured** (open-source/local mode only) | per-format, later |
|
||||
| 4 — fallback | Office files Docling skips | **MarkItDown** | per-format fallback |
|
||||
| rejected | — | ~~Apache Tika~~ (JVM, Tesseract-only, no model hooks) | do not use |
|
||||
|
||||
- **OCR backend verdict: keep Qwen2.5-VL via Ollama.** It is the only evaluated stack that reliably handles Arabic handwriting + certificates/IDs/tables/forms in one model (fine-tuned 3B variants hit <2% CER). Tesseract/docTR inadequate for Arabic; Surya printed-only + non-commercial weight license; PaddleOCR printed-only fallback at best.
|
||||
- **Architecture choice (resolves B-risk-1): route by content, not by one tool.** Native-digital PDFs → Docling (no inference). Scanned/image PDFs and images → Qwen-OCR directly. A cheap "has an embedded text layer?" probe decides the route. This keeps OCR cost off documents that don't need it.
|
||||
|
||||
### Office & legacy (Agent C)
|
||||
- `.docx` → **docx2python** (extraction/structure) + **python-docx** (DOM when needed); `.xlsx` → **openpyxl** read-only, **python-calamine** fast-path for big files; `.pptx` → **python-pptx**; legacy `.xls` → **xlrd**.
|
||||
- Legacy `.doc/.ppt` → **unoserver** (persistent LibreOffice listener, isolated container, metric-compatible fonts) → convert to OOXML → parse. Never antiword/catdoc (dead).
|
||||
- **Access: `.mdb`/`.accdb` Windows-only in v1** via pyodbc + ACE redistributable, gated behind a capability check; **cross-platform Access deferred to V2** (mdbtools `.accdb` unreliable). Non-Windows Access files → `skipped` with structured reason.
|
||||
- No COM/Office automation ever. Windows long-path (`\\?\`, registry key), file-lock (`catch PermissionError`), and encoding care throughout.
|
||||
|
||||
### Audio/video (Agent D)
|
||||
- **faster-whisper v1.2.x, model `large-v3`, `compute_type=int8`, CPU-first.** ~1.5 GB RAM, ~10× real-time on a good CPU. GPU upgrade = change two flags. Avoid `large-v3-turbo` for Arabic (accuracy regression); allow as opt-in override. Document `Byne/whisper-large-v3-arabic` as a configurable model override for better Arabic WER.
|
||||
- **ffmpeg** via plain `subprocess` (16 kHz mono WAV). Bundle in Docker; `static-ffmpeg` for non-Docker dev. Resolver: explicit config → `shutil.which` → `static_ffmpeg`.
|
||||
- **Diarization deferred to V2** (gated HF model, CPU-prohibitive). IR reserves `speaker` (null in v1) and optional `words` so V2 layers on without schema change.
|
||||
- Long media: `vad_filter=True`; transcribe in a killable `ProcessPoolExecutor` with timeout; enforce `max_media_size_bytes`.
|
||||
|
||||
### Frontend (Agent E)
|
||||
- **FastAPI 0.136.x + HTMX 2.0.x + Jinja2**, `meilisearch-python-sdk` (async) for the adapter.
|
||||
- **`SearchProvider` Protocol** with `search()/suggest()/health()` returning engine-agnostic `Hit`/`SearchResult`/`FacetBucket` dataclasses. `Hit.snippet` is **pre-rendered HTML with `<mark>` only** (sanitized) — no Meilisearch field name (`_formatted`, `facetDistribution`) ever reaches a template. `mode: keyword|semantic|hybrid` is in the signature now (always `keyword` in v1).
|
||||
- Search-as-you-type: `hx-trigger="input changed delay:300ms"`, `hx-sync="this:replace"`, `hx-push-url="true"`; one `/search` route returns full page or `#results` partial based on the `HX-Request` header (bookmarkable + degrades gracefully).
|
||||
- **Arabic RTL:** root `dir="rtl"`, CSS logical properties only, `dir="auto"` on input + snippets, `<bdi>` around paths/numbers, self-hosted Cairo/Noto Sans Arabic, no letter-spacing.
|
||||
- **Open-file / deep-link:** server-side `/api/open` (os-native open) instead of browser-blocked `file://`; PDF `#page=N`; audio/video embedded HTML5 player pre-seeked `#t=N`. Always show provenance (path, page/slide/sheet, timestamp).
|
||||
- `suggest()` returns empty in v1 (Meili has no native suggest API).
|
||||
|
||||
### CI / Windows runner (Agent F)
|
||||
- Two workflows under `.forgejo/workflows/`: `ci.yml` (Linux) — **unit on every push** (ruff, ruff format, mypy, pytest+coverage), **Meilisearch integration on PRs to main** (service container, pinned image); `ci-windows.yml` — Windows path/encoding/file-lock unit tests on the `windows` label. Heavy OCR/ASR gated behind a `heavy` marker + `workflow_dispatch`, never in the default pipeline.
|
||||
- **All `uses:` must be fully-qualified** `https://code.forgejo.org/actions/...` (Forgejo resolves bare refs to its own mirror). Use `code.forgejo.org/forgejo/upload-artifact` for artifacts.
|
||||
- **Windows runner = Crown0815/Forgejo-runner-windows-builder** (unofficial, pin `v12.12.0`), runs **host-native (no containers)** → no service containers on Windows (integration stays Linux-only). **User must provide a Windows host or KVM Windows 11 VM**; pre-install Python 3.12, Git, poppler, ffmpeg; add a Defender exclusion; register with `--labels "windows:host,self-hosted:host"` and override `ACTIONS_RUNTIME_URL` to the Linux host LAN IP. forgejo-stack gets a `register-windows-runner.sh` helper + `docs/windows-runner.md`; **no docker-compose.yml change**.
|
||||
|
||||
---
|
||||
|
||||
## 3. Cross-cutting architecture implications
|
||||
|
||||
1. **The IR must carry chunking seams from day one** even though v1 indexes whole-doc: `chunk_index`, `chunk_count`, `parent_id` (null in whole-doc mode); plus a `content_truncated` flag for the 65,535-word ceiling.
|
||||
2. **The IR must carry vector seams:** optional `_vectors.digger_semantic` + the `embedding_model_id/version` that produced them (null in v1).
|
||||
3. **`source_folder` is a derived, length-bounded token** (≤468 bytes) so folder faceting works within Meili's filterable-value cap; raw `path` is displayed-only.
|
||||
4. **Provenance is structural:** page (PDF), slide (PPTX), sheet (XLSX), and transcript-segment timestamps (A/V) live in the IR's structured segments so the UI can deep-link.
|
||||
5. **Model access is uniform:** OCR, ASR, and (future) embeddings all sit behind one `ModelBackend` interface; the default concrete backends talk to **Ollama (host service)** and **faster-whisper**, both endpoint/flag-configurable.
|
||||
6. **Ollama deployment default = host service** (resolves the user's open question): native install, `OLLAMA_HOST=0.0.0.0:11434`, pipeline reaches it via `host.docker.internal` with `extra_hosts: host-gateway` so Windows/macOS/Linux converge on one default; the dev VM keeps `192.168.122.1` as an override. Fully overridable; Dockerized-Ollama is documented as the alternative for GPU-reproducible setups.
|
||||
|
||||
---
|
||||
|
||||
## 4. Open trade-offs to resolve with the human (before finalizing the IR/index)
|
||||
|
||||
1. **Deduplication policy.** Same content hash at multiple paths → (a) **one document per path** (simpler, cleaner deletion tracking — recommended) vs (b) **one document, multiple source paths** (`distinctAttribute=content_hash`, paths as array). Affects primary-key and delete semantics.
|
||||
2. **Chunking in v1 or V2.** The 65,535-word field limit means very long scanned PDFs and long transcripts truncate silently. Options: (a) **v1 = whole-doc + truncate-and-flag, chunk in V2** (recommended, matches "whole-doc default") vs (b) **chunk long docs in v1** (full coverage + better long-doc relevance, more Sprint-1 complexity). The IR carries the seams either way.
|
||||
3. **Windows CI host.** The Windows runner needs a Windows host or a KVM Windows 11 VM on the Linux box. Confirm the human can provision one (and whether a Windows license/ISO is available), or accept "Linux CI now, wire Windows job when the runner exists."
|
||||
|
||||
These three are carried into the design presentation; everything else above is decided.
|
||||
|
||||
---
|
||||
|
||||
## 5. Key version pins (verified 2026-07-01)
|
||||
|
||||
| Component | Pin |
|
||||
|---|---|
|
||||
| Meilisearch | `getmeili/meilisearch:v1.48.3` |
|
||||
| Docling | 2.107.x |
|
||||
| faster-whisper | 1.2.x (`large-v3`, int8) |
|
||||
| FastAPI / HTMX | 0.136.x / 2.0.x |
|
||||
| python-docx / docx2python / openpyxl / python-pptx / xlrd | 1.2.x / 3.6.x / 3.1.5 / 1.0.2 / 2.0.1 |
|
||||
| unoserver | 3.7 |
|
||||
| static-ffmpeg | 2.13 |
|
||||
| Crown0815 Windows runner | v12.12.0 |
|
||||
| Python | 3.12 |
|
||||
Loading…
Add table
Reference in a new issue