Randa 8b3d7ea7e3 fix: correct relative links in brief Section 14 (docs/ root, no ../)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-07-01 14:03:15 +04:00

33 KiB

Raw Permalink Blame History

Project Brief / Kickoff Prompt: File-Ingestion Search Pipeline

Paste this whole document as your first message to Claude Code. It is written to Claude Code. Items in <ANGLE BRACKETS> are decisions for me (the human) — ask me about them before you build.

1. Mission

Build a modular file-ingestion pipeline that walks files on a machine, extracts their content (including scanned documents, Office files, and audio/video), normalizes everything into a single well-defined intermediate document model, and feeds that into Meilisearch for full-text search.

Two hard requirements shape every decision:

The pipeline must be usable standalone, without Meilisearch. It must be able to read files and emit the intermediate representation (IR) to disk on its own. Indexing into a search engine is a separate, swappable stage.
The search backend must be swappable. Meilisearch is the first target, but it sits behind an interface so it can be replaced by another full-text (or vector) search engine without touching the pipeline.

Target platform is primarily Windows, but the code must be cross-platform (Windows + Linux + macOS).

Stack is Python (chosen for its OCR / ML / document-parsing ecosystem). Semantic / vector + hybrid search is a first-class goal: it does not have to ship in v1, but the IR, the pipeline, and the index design must be built to accommodate it from day one so we never have to re-architect for it (see Section 9).

2. Non-negotiable principles

Strict layering. Each stage talks to the next only through a documented interface, never through a concrete implementation.
The intermediate representation is the contract. It is the most important artifact in the repo. Design it first, document it, version it, and keep it stable.
Local-first / privacy-first. All content processing (OCR, transcription, any AI) runs against local models. No file content is sent to external services unless I explicitly opt in. State this constraint in the README and enforce it in code (no silent network egress of content).
Prefer existing, well-maintained tools over reinventing — provided they can hook into our own local models. Evaluate before adopting.
Idempotent and incremental. Re-running over a directory must not reprocess unchanged files and must handle additions, changes, and deletions cleanly.
Fail loud, fail isolated. One unreadable file must never abort a run. Failures are logged, quarantined, and reportable.
Tested and CI-gated from day one. The first commits stand up the test suite and a green CI pipeline on the local Forgejo instance (Section 11); features land behind passing tests, not after them.
Zero-install by default. The operator should get a working system without manually installing dependencies — bundle the heavy native pieces rather than asking the user to install them (Section 12). Batteries-included, but every bundled piece stays overridable.

3. How we work — superpowers-driven planning, then iterative delivery

You have the superpowers plugin (obra/superpowers). Use its brainstorm → plan → execute methodology, its TDD (red → green → refactor) discipline, and its subagent code-review throughout. Because superpowers prioritizes project-specific instructions, this brief should live where it will be honored: commit it as docs/PROJECT_BRIEF.md and anchor/reference it from CLAUDE.md.

Do not build features until I approve the plan (step 6).

Read this brief, run a brainstorming pass, and ask me the open questions in Section 14. Wait for my answers.
Research (Section 6): spawn subagents in parallel; each writes findings to docs/research/.
Index-design discussion (Section 7): produce the IR schema, the Meilisearch index/schema design (incl. the vector/embedder reservations from Section 9), the read-side search API + UI approach (Section 10), and ADRs — all in docs/.
Plan (superpowers write-plan): turn the design into an agile, sprint-based V1 where every sprint ships a working end-to-end slice (Section 13). Use role-based agents (PM, Tech Lead, …) to detail it.
Create the Forgejo V1 milestone with every task as an issue, plus a coarse V2 milestone (Section 13).
Present the plan + milestone to me for approval. Stop here until I sign off.
Deliver sprint by sprint using the Forgejo dev loop (Section 13): pick an issue → branch + TDD → PR (Closes #N) → code-review + green CI → I approve and merge. The first sprint lands repo scaffolding, a green CI pipeline (Section 11), and the core interfaces before any extractor.

4. Proposed architecture (critique and refine this — don't take it as final)

Treat the following as a strong starting point that your design agents should challenge and improve, not a spec to implement blindly.

 file system
     │
     ▼
[ Source / Walker ]   discovers files, yields file references + filesystem metadata
     │
     ▼
[ Router ]            picks an Extractor based on type/mime/content sniffing
     │
     ▼
[ Extractor (per format) ]   uses Model Backends as needed (OCR / ASR / VLM / embeddings)
     │
     ▼
[ Canonical Document (the "middle type" / IR) ]   <-- THE CONTRACT, serializable to JSONL on disk
     │
     ▼
[ Transformer / Enricher ]   normalization, language detection, optional chunking, optional embeddings
     │
     ▼
[ Sink / Indexer (interface) ]   Meilisearch adapter is one impl; a "file/null" sink writes IR to disk

Supporting components that cut across the stages:

Model Backend interface — abstracts local models: OCR, automatic speech recognition (ASR/transcription), vision-language understanding, and (optionally) embeddings. Concrete backends are configured by endpoint/runtime (e.g. a local server or in-process library). Extractors depend on the interface, never a specific model.
SearchProvider interface (read side) — the query-time mirror of the Sink. The UI and any search API talk only to this; the Meilisearch implementation is one adapter. Supports keyword, semantic, and hybrid query modes (see Sections 9 and 10) so swapping engines never touches the UI.
State Store — records what has been processed (e.g. a local SQLite DB) keyed by a content hash + path, to drive incremental runs and deletion handling.
Config — a single config file (TOML or YAML) with env-var overrides. Selects the active sink, model backends, source roots, concurrency, and per-format options.
CLI — subcommands such as scan, extract (files → IR on disk), index (IR → sink), run (end-to-end), and status. The extract + index split is what makes the pipeline usable without a search engine.

The Canonical Document (the "middle type")

This is the central deliverable of the design phase. It must be a serializable schema (JSON/JSONL) that is rich enough to drive search but independent of any search engine. Design it to include at least:

Identity: a stable id derived from path + content hash; the content_hash itself.
Source metadata: absolute/relative path, filename, extension, detected mime type, size, created/modified timestamps, host/drive, and (where relevant) network-share/UNC origin.
Provenance: which extractor + version produced it, and when.
Content: a plain-text field for search, plus optional structured segments that preserve native structure where it matters — pages (PDF), sheets/rows (spreadsheets), slides (PPTX), and tables. For audio/video, transcript segments with timestamps and (if available) speaker labels.
Format-specific metadata: e.g. Office document properties, image EXIF, media duration/codec.
Derived fields: detected language; optional tags; optional embeddings.
Processing status: success/partial/failed, plus warnings and errors.

Decide explicitly: whole-document vs. chunked indexing. Long documents may need to be split into chunks for good search relevance. Whatever you choose, the chunking must be a Transformer concern, not baked into extractors, so it can be turned off for the standalone case.

5. Scope and priorities

Build in this order. The architecture must make adding a new format = adding a new Extractor, with no changes to other layers.

Priority 1 — Scanned documents (PDF, JPEG, PNG). These are images of text, handled by a local OCR / document-understanding model. The actual approach and model are decided by the research (Agents B/D and the synthesis), not assumed up front. I've experimented locally and can hand you my findings and model setup as one input to weigh — ask me for them — but treat them as a starting point, not the answer; if the research points to something better, propose that. Build the full end-to-end path (walk → OCR → IR → Meilisearch) for this priority first so the whole architecture is exercised early.

Priority 2 — Microsoft Office files (.doc, .docx, .xls, .xlsx, .ppt, .pptx, .mdb/.accdb, etc.). Note the split between modern OOXML formats and legacy binary formats, and the cross-platform difficulty of Access databases — flag these in research.

Priority 3 — Audio and video files. Extract audio, transcribe with a local ASR model, then treat the transcript like text (optionally with AI summarization/understanding), preserving timestamps in the IR. The specific extraction/transcription approach and models are decided by the research (Agent D), not assumed — the above is the shape of the pipeline, not a fixed tool choice.

6. Research subagents (run in parallel, write findings to `docs/research/`)

Spawn focused subagents. Each must read primary sources, verify current capabilities and versions (your training data may be stale — check the live docs), and produce a written findings doc with recommendations and trade-offs. At minimum:

Agent A — Meilisearch. Read the Meilisearch documentation thoroughly, starting at https://www.meilisearch.com/ and its docs. Cover: index creation and the settings API (searchable / filterable / sortable attributes, ranking rules, primary key, distinct attribute), faceting, synonyms, stop words, typo tolerance, document upsert/delete semantics, batch/task handling, and document size / field limits. Also investigate vector / hybrid / AI-powered search and embedders — since we already run local models, semantic search may be worth supporting. Note version requirements.
Agent B — Local model integration & end-to-end tooling. Research existing frameworks that convert many file types into a structured/markdown intermediate and that allow plugging in our own local models. Evaluate the major "everything-to-structured/markdown" toolkits explicitly — Docling (IBM), MarkItDown (Microsoft), Unstructured, and Apache Tika — alongside OCR stacks (Tesseract / PaddleOCR / docTR / Surya) and locally-served vision-language models. For each, assess: which file types it covers, whether it can use our local models (vs. calling out), output structure quality, and cross-platform/Windows support. Report which we can adopt wholesale, which to use per-format, and where we still need our own extractor.
Agent C — Office & legacy formats. Best cross-platform libraries for OOXML (docx/xlsx/pptx), strategies for legacy binary .doc/.xls/.ppt (e.g. headless LibreOffice conversion), and Access .mdb/.accdb on Windows vs. Linux/macOS. Identify the Windows-only gotchas.
Agent D — Audio/video transcription. Local ASR options (e.g. Whisper-family runtimes), audio extraction (e.g. ffmpeg), timestamps and optional speaker diarization, and GPU vs. CPU performance trade-offs.
Agent E — Search frontend & UX. Evaluate the UI options in Section 10 (Python-native: Streamlit/NiceGUI; minimal-JS: FastAPI + Jinja2 + HTMX; full JS: InstantSearch/instant-meilisearch), how each handles faceting/highlighting/typo-tolerant search, and the direct-to-engine vs. through-the-API trade-off. Recommend a default and note what's needed to keep the UI engine-agnostic.

After research, synthesize the findings into a single recommendations doc.

7. Index-design discussion (the "agents debate the indexes" part)

Have at least two agents take different perspectives and propose competing designs, then reconcile:

one optimizing for search relevance / UX (what should be searchable, filterable, sortable, facetable; how to rank; synonyms; typo tolerance; chunking for relevance), and
one optimizing for data modeling / pipeline cleanliness (how the IR maps to documents, stable IDs, field flattening, avoiding lossy transforms, incremental update semantics, document-size limits).

They should explicitly resolve: single index vs. multiple indexes (e.g. by file type), the IR→document field mapping, which fields are filterable/sortable/searchable, chunking strategy, primary key choice, and how updates/deletes propagate. Capture the outcome as an ADR plus a concrete proposed Meilisearch settings configuration, and a concrete IR JSON schema. Surface any genuinely open trade-offs to me rather than silently picking.

8. Implementation expectations (after I approve the design)

Repo scaffolding: clear module boundaries matching the layers; a README documenting the architecture and the IR contract; the chosen config format with a sample config.
Interfaces first: define Source, Extractor, ModelBackend, Transformer, Sink/Indexer, SearchProvider (read side), and StateStore as explicit protocols/interfaces. Provide a trivial "file sink" (writes IR to disk) so the standalone path works on day one.
Phased build: Priority 1 fully end-to-end and tested before starting Priority 2, etc.
Incremental indexing: content-hash + mtime tracked in the State Store; stable IDs for idempotent upserts; detect and propagate deletions.
Concurrency: parallel file processing with a worker pool, while respecting limits on local model calls (e.g. serialize/queue GPU-bound work). Make concurrency configurable.
Error handling & observability: structured logging, per-file status, a quarantine/dead-letter path for failures, and a status summary (counts of processed/failed/skipped).
Testing: a small fixture corpus per format under tests/; unit tests for each extractor and the IR serialization; an integration test for the end-to-end path against a local Meilisearch (or a mock sink). Keep tests cross-platform.
Cross-platform care: handle Windows path separators, long paths, file encodings, and file-lock situations; gate any OS-specific dependency (Access, LibreOffice, ffmpeg) behind capability checks with clear error messages when missing.
Per-file timeouts & size limits: OCR and transcription can hang or explode on huge inputs. Enforce a per-file timeout and a configurable max-size skip, recorded as a skipped/failed status — never a crashed run.
Reindex / backfill command: because the IR records extractor_version and the embedding-model id/version, provide an explicit command to reprocess or re-embed when a model or the schema improves. Don't rely on manual deletion.
Deduplication policy: decide what happens when the same content hash appears at multiple paths (index once with multiple source paths, vs. one document per path).
Run modes & triggering: support a one-shot CLI run now, and keep the core run loop decoupled from how it's triggered so a watched-folder mode (filesystem events) or a scheduled/background service can be added later — on Windows this may eventually run as a service or scheduled task.
Secrets & keys: model endpoints and search-engine API keys live in config / .env, never committed. Document the keys required.

9. Vector & hybrid search (design for it from day one)

We will support semantic (vector) and hybrid (keyword + vector) search. v1 ships keyword-only — but nothing in the design may make adding vectors later expensive, so the IR, pipeline, and index schema must accommodate them from day one (build the seams, leave the feature switched off).

Embeddings are generated locally, behind the same Model Backend interface as OCR/ASR — never via an external API.
Document-level vs. chunk-level embeddings: good vector relevance usually wants chunk-level embeddings, which ties directly to the chunking decision in Section 7. Keep the two consistent.
The IR can carry embeddings plus the embedding-model id/version that produced them. Changing the embedding model or the chunking strategy invalidates existing vectors and requires a reindex — make that an explicit, supported operation (see the reindex command in Section 8), not a manual scramble.
Index design must reserve for vectors now: have Agent A research Meilisearch's embedder configuration, user-provided vs. auto-generated embeddings, hybrid ranking, and the version this requires, and bake the necessary fields/config into the proposed schema even if it ships disabled.
Keep it engine-agnostic: a replacement search engine might handle vectors differently (or not at all), so the read-side search interface (Section 10) must express keyword, semantic, and hybrid query modes generically rather than leaking Meilisearch specifics.

10. Frontend & search UX

This is the area I'm least familiar with, so propose options with clear trade-offs and a recommended default, then build something usable early even if basic.

Core principle — the UI is just another client. It must not reach into the pipeline and must not hardcode Meilisearch. Define a read-side SearchProvider interface that mirrors the write-side Sink (methods like query, facets, suggest, supporting the keyword/semantic/hybrid modes from Section 9) and expose it through a thin API. Swapping the search engine then means writing one new SearchProvider adapter; the UI never changes.

Key decision to put to me (show the trade-off):

Option A — UI queries the search engine directly (e.g. Meilisearch's InstantSearch components / instant-meilisearch). Best turnkey UX — instant results, faceting, highlighting — for the least effort, but it couples the UI to Meilisearch and exposes a search-only key to the browser, which breaks our swappability promise on the read path.
Option B — UI queries a thin Python search API (FastAPI) that wraps SearchProvider. Keeps the system engine-agnostic end to end at the cost of reimplementing some glue (faceting, highlighting, pagination). Recommended default, given how central modularity is to this project. A rich JS search UI can still be added later in front of the same API.

Frontend technology options (I am not a frontend developer — bias toward staying in Python / minimal JS, and treat the UI as fully replaceable):

Fastest, Python-only: Streamlit or NiceGUI — stand up a working search UI quickly; good for v1 and internal use. NiceGUI gives more real-app structure; Streamlit is the quickest to demo.
Clean, minimal-JS (recommended starting point): FastAPI serving server-rendered templates (Jinja2) enhanced with HTMX for search-as-you-type. Stays almost entirely in Python, is genuinely modular, and is production-reasonable.
Best-in-class UX later: a JS framework (React / Vue / Svelte) using Meilisearch's InstantSearch components, talking to the Python API (Option B) or the engine directly (Option A).

Have the research agents confirm current options and recommend one — but ship a basic working search page early rather than gold-plating.

Search UX features to plan for (not all in v1): search-as-you-type; typo-tolerant matching with highlighted snippets; faceted filters (file type, date, source folder, language); sorting; pagination or infinite scroll; result previews; and a clear way to open or locate the original file — including the page number for PDFs and the timestamp for audio/video so users can jump to the moment. Always show provenance (path, page/slide/sheet, timestamp). Handle empty-result and error states gracefully. An optional admin view can surface indexing status, last run, and failures (reuse the status summary from Section 8).

Frontend security: single-user, no per-user access control — so no tenant tokens or per-document ACLs are needed. Still, never expose the search engine's master key to a browser: if the UI ever queries the engine directly (Option A), use a search-only key. With Option B the API holds the key and the browser sees nothing.

11. CI/CD (Forgejo — set up on day one)

We host on a local Forgejo instance and want continuous integration working from the very first commit, not bolted on later. CI is part of Phase 1 scaffolding, before any extractor is written.

Forgejo Actions. Use Forgejo Actions workflows (largely GitHub-Actions-compatible syntax) under .forgejo/workflows/, run by a Forgejo Runner (act_runner). Confirm with me which runners are registered and their labels/capabilities (Docker vs. host, and whether a Windows runner exists) — don't assume. If syntax/feature gaps from GitHub Actions show up, flag them rather than guessing.
Layered, fast-by-default test suite. Separate the tiers so most pushes stay fast:
- Unit tests — no external services, no real models; run on every push. The ModelBackend and Sink/SearchProvider interfaces make this clean: inject fakes/mocks.
- Integration tests — spin up a real Meilisearch in the job (service container or docker run) and exercise the indexing + query path against it, still with mocked model backends. Run on every PR to the main branch.
- Heavy / real-model tests — OCR/ASR against actual local models are GPU-bound and slow; gate these behind a marker (e.g. a pytest marker / env flag) so they only run locally or on a specifically-labelled runner, never in the default pipeline.
Cross-platform CI with a real Windows runner. Primary target is Windows, so don't let "passes in Linux CI" hide Windows-only breakage. The official Forgejo Runner is Linux-only; use the community Crown0815/Forgejo-runner-windows-builder prebuilt Windows runner, registered against our instance with a windows label, to run a dedicated Windows job. Notes to account for: it's an unofficial community build (pin a known release); antivirus/Defender may flag the binary, so an exception is needed on the runner host; and a native Windows runner runs jobs on the host, not in containers. Therefore split CI by runner: the Linux/Docker runner handles the Meilisearch integration tier (service container), and the Windows host runner handles the path/encoding/file-lock/Windows-only unit tests (host-native, with any needed tools vendored on that host). Confirm the runner labels with me.
Quality gates on every push: linting + formatting (ruff, and ruff format or black), type checking (mypy), and the unit suite — all must pass to merge. Add pytest coverage reporting.
Dependency & tool management: pin dependencies (lockfile), cache them in CI, and pin the Python version(s) tested. Mirror the same checks in pre-commit hooks so failures surface before CI.
Pipeline-as-code, reviewed like code. Workflows live in the repo; keep them minimal and documented in the README (how to run each tier locally, how the runner is expected to be configured).
Fixtures, not real data. Ship a tiny sanitized fixture corpus per format under tests/; never commit real/sensitive documents. Keep large/binary fixtures minimal.

The acceptance bar for Phase 1: a fresh clone, the documented setup, and a green pipeline (unit + Meilisearch integration) on the local Forgejo instance.

12. Packaging & setup (make it as close to zero-install as possible)

A top priority: an operator should be able to get a working system without manually installing dependencies. This is hard here because the stack pulls in heavy native pieces — Meilisearch, an OCR runtime, an ASR runtime + ffmpeg, and LibreOffice for legacy Office conversion. The lever for "no manual installs" is to bundle those, not ask the user to install them.

Primary distribution — one-command Docker Compose stack. Ship a docker compose up that brings up everything wired together: the pipeline, Meilisearch, the search UI/API, and the converter/model services, with ffmpeg / LibreOffice / OCR baked into the images. The operator's only prerequisite is a container runtime (Docker Desktop or Podman) — nothing else to install. Provide sensible CPU-only defaults and a zero-config first run (auto-create the index, ship an example config) so that docker compose up + pointing at a folder yields working search.
GPU is opt-in, and the model backend can live outside Docker. Because models sit behind the ModelBackend interface, the model runtime can be either a container or a host service the stack points at (e.g. a local model server). This avoids forcing every user through GPU-in-Docker setup (which on Windows means WSL2 + the NVIDIA Container Toolkit). Default to CPU; document GPU as an explicit upgrade.
Secondary option for a non-Docker, Windows-first install — flag the trade-off. If you (the human) would rather hand a non-technical Windows user a double-click experience with no Docker concept at all, the alternative is a bundled native installer (e.g. Inno Setup/MSI) that vendors the frozen Python app plus meilisearch.exe, ffmpeg, the OCR runtime, and a portable LibreOffice. This is the friendliest for an end user but the most work to build and maintain cross-platform. Have the research/design agents weigh Docker Compose vs. native bundle and recommend; default to Docker Compose for v1 unless I say otherwise.
Zero-install must not break modularity. Batteries-included is the default, not a lock-in: every bundled service (Meilisearch, the model backend, the UI) stays independently overridable via config / compose overrides, so a user can point at their own Meilisearch or their own model server. The standalone, pip-installable pipeline (no UI, no search engine) remains a supported lighter path for developers.
Document the few prerequisites honestly. Whatever the choice, the README states the single prerequisite (container runtime, or nothing for the native bundle) and gives a copy-paste quickstart. No multi-step "install Tesseract, then ffmpeg, then…" lists.

13. Agile delivery & Forgejo workflow

All planning and execution runs through Forgejo — agile and iterative, on top of the superpowers methodology.

Role-based agents. Spin up as many specialized agents as the work needs, both while detailing the plan and later while executing it: PM (scope, sprint/issue breakdown, acceptance criteria), Tech Lead (architecture, interface contracts, PR review), Senior Fullstack (implementation), UX/UI (the search frontend), QA (tests/fixtures), and any others that help. This composes with superpowers' own subagent-driven development and code-reviewer agent rather than replacing it.

V1 milestone & issues (after I approve the design):

Create a V1 milestone on Forgejo containing all V1 tasks as issues — each with a clear title, description, acceptance criteria, labels (type / component / priority), and an assigned sprint.
Create a V2 milestone with coarse, low-detail issues as placeholders for known-later work. Don't over-specify V2 — it will change as V1 teaches us things. Refine it only once V1 is done or something forces it.

Sprints are end-to-end. V1 is a sequence of sprints, each shipping a thin but complete e2e slice, not a horizontal layer. Illustrative only — the agents set the real breakdown:

Sprint 0 — skeleton: repo scaffolding, green CI (Section 11), the core interfaces, the file-sink, and a trivial walk → stub-extractor → IR → Meilisearch → search path that proves the whole pipe end-to-end.
Sprint 1+ — Priority 1 (scanned docs): a real OCR slice end to end, with tests, behind the same interfaces.
then Priority 2 (Office), then Priority 3 (A/V) — each its own sprint(s).

Every sprint ends with something demonstrable, tested, and merged.

Dev loop (per issue) — unless I tell you otherwise for a given issue or sprint:

Pick an issue from the active sprint of the milestone.
Create a git worktree for a new branch off the latest main (see below) and implement test-first there (superpowers TDD: red → green → refactor); keep the change scoped to that issue.
Open a PR that links the issue (Closes #N).
Review: the code-reviewer / Tech Lead agent reviews against the plan and standards; address blocking findings.
CI must be green (Section 11).
I approve and merge. I do the final approval/merge myself — do not self-merge unless I've explicitly allowed it for that issue or sprint.
Update the issue + milestone status, then pick the next issue.

Always use git worktrees for new branches. Never commit on main and never reuse one working directory across branches. Every issue gets its own git worktree in a sibling folder next to the repo, created from an up-to-date main, using this convention:

Branch: <type>/<issue>-<slug> where type ∈ feat | fix | chore | docs | refactor | test — e.g. feat/42-pdf-ocr-extractor.
Worktree path: ../<repo>.worktrees/<issue>-<slug> — one flat directory per issue, sibling to the repo root (e.g. ../filesearch.worktrees/42-pdf-ocr-extractor). Sibling placement keeps worktrees out of the tree, so no .gitignore entry and no tooling traverses them.
After merge: remove the worktree and delete the branch (git worktree remove … + git branch -d …).

This keeps main clean, isolates each issue's work, and lets multiple role agents work in parallel without colliding. It matches superpowers' branch-finishing flow, which already creates and tidies up worktrees — let it manage them where it can.

Forgejo access. Use the Forgejo MCP connector (already installed) to create the milestone and issues, open PRs, post reviews, and update statuses. First discover what the MCP actually exposes and confirm it supports every operation this workflow needs (create milestone, create/label issues, open PR, request/post review, set status, and — if I ever delegate it — merge). If any needed operation is missing, flag it rather than working around it, and we'll add a Forgejo API-token + tea CLI fallback.

14. Questions to confirm with me before building

These are all answered. The resolved values are filled in below and consolidated in ../research/SYNTHESIS.md §1, with the rationale in the ADRs under ../decisions/.

Runtime: Python — confirmed. Flag if any required dependency forces a non-Python component.
Search type: v1 ships keyword-only — confirmed (vectors designed-for but switched off; Section 9).
Access control: single-user, no per-user access — confirmed (no tenant tokens / ACLs needed).
Forgejo access: the Forgejo MCP is installed — confirmed (used to create the milestone/issues and open PRs).
Forgejo runners: Linux/Docker runner registered with labels docker + ubuntu-latest; a windows runner (Crown0815/Forgejo-runner-windows-builder) to be stood up on a KVM Win11 VM (ADR 0010).
Distribution: Docker Compose is the default — a container runtime is confirmed acceptable; no native installer for v1 (Section 12, ADR 0009).
My existing OCR/scanned-doc setup: Qwen2.5-VL served via Ollama (host service; document-aware Arabic prompt; poppler for PDF rasterization), handed over as the arabic-ocr repo (ADR 0005).
UI approach & tech: Option B (thin Python SearchProvider API) + FastAPI + Jinja2 + HTMX (ADR 0007).
Indexing granularity: chunk long docs in v1 (a whole document is the degenerate single-chunk case); no hard doc/field size beyond Meilisearch's 65,535-word-per-field limit, which chunking respects (ADR 0004, ADR 0003).
Scale: medium — 10k–500k files, ~0.5–5 TB.
Document languages: Arabic + English (RTL).
Hardware: CPU-capable, GPU-optional with auto-fallback; dev box has no GPU but 128 GB unified RAM.
Where does Meilisearch run: bundled in our Docker Compose on the same machine (not pre-deployed; zero-config first run) (ADR 0009).
File access: local disks only for v1 (still cross-platform path care).
How should indexing run: manual CLI now (scan/extract/index/run/status/reindex), with the run loop decoupled so a watched-folder / scheduled / background-service mode can be added later.

Working agreement

Be pragmatic: don't over-engineer, prefer proven libraries that meet the local-model constraint, and keep the layering honest. When a decision has real trade-offs, show me the options and your recommendation instead of guessing. The intermediate representation and the swappable-sink boundary are the two things that must never be compromised for short-term convenience. Keep me in the loop at the gates that matter: I approve the plan before building, and I approve and merge each PR myself — don't self-merge unless I've said so.

33 KiB Raw Permalink Blame History Unescape Escape