Commit graph

30 commits

Author SHA1 Message Date
8adc70b727 Merge pull request 'chore: sprint skill + self-review gate in the dev loop' (#48) from chore/sprint-skill into main 2026-07-01 17:23:41 +04:00
Randa
cd9c710026 chore: add mid-sprint resume procedure + live groom-doc status
- sprint skill: explicit "Resume — session closed mid-sprint" section
  that reconstructs the frontier from groom doc -> Forgejo -> git, and a
  rule to keep per-issue Status live (todo -> in progress -> PR open ->
  merged) and commit/push per TDD step so state is never archaeology.
- docs/sprints/README.md: add a Status column to the groom template.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 17:21:42 +04:00
Randa
5500c111d9 chore: add sprint skill + self-review gate to the dev loop
- CLAUDE.md dev loop: after opening the PR, wait for CI green, then
  self-review the diff (/code-review) before saying it's ready; never
  claim ready before CI-green + self-review; still never self-merge.
- .claude/skills/sprint: thin per-project skill so "start sprint N" /
  "groom sprint N" / "continue the sprint" deterministically runs the
  groom -> per-issue dev-loop -> retro flow, enforcing the guardrails
  (plan is source of truth, groom-first, CI-green + self-review before
  ready, human merges). Composes with the superpowers skills.
- docs/sprints/README.md: fold the CI-green + self-review gate into the
  per-sprint Definition of Done.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 17:19:35 +04:00
b47deb86d5 Merge pull request 'docs(plan): v1 sprint-based implementation plan' (#3) from docs/v1-plan into main 2026-07-01 17:11:02 +04:00
Randa
8363549f69 docs: make the plan a living doc + add docs/sprints grooming convention
- CLAUDE.md: plan is the source of truth (keep it in sync with Forgejo
  issues both ways); groom each sprint just-in-time into
  docs/sprints/sprint-<n>-<slug>.md from plan + current state + prior
  retro; scope changes flow groom -> plan -> issues; close with a retro.
- Plan doc: add a "Living plan & sprint grooming" section.
- docs/sprints/README.md: the grooming rules + a template.
- docs/sprints/sprint-0-skeleton.md: the first groom (Sprint 0), with
  PR sequencing, the Windows-runner human-touch risk, and DoD.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 17:04:05 +04:00
Randa
f37e035554 docs(plan): move Compose engine bring-up into Sprint 0
Split the Docker Compose work so the engine (Meilisearch + pipeline)
lands in Sprint 0 as S0-14, making the Sprint-0 e2e slice runnable with
one command and delivering ADR 0009's zero-config first run early. S2-9
slims to extending Compose with the ui service once the UI exists.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 16:59:03 +04:00
Randa
ec8d3a4000 docs(plan): add v1 sprint-based implementation plan
Turn the approved design (ADRs 0001-0010, IR + Meilisearch contracts,
research SYNTHESIS) into an agile, sprint-based v1 roadmap where every
sprint ships a working end-to-end slice:

- Sprint 0: skeleton — scaffolding, green Linux + native Windows CI,
  the seven interfaces, FileSink, config + SQLite StateStore,
  MeilisearchSink + index/settings (dormant embedder), and a trivial
  walk -> stub -> IR -> Meilisearch -> search slice.
- Sprint 1: Priority 1 scanned-doc OCR e2e (harness bake-off early,
  chunking transformer, incremental/replace/delete semantics).
- Sprint 2: Priority 2 native-digital + Office (Docling + metadata
  augmenter) and the FastAPI/HTMX SearchProvider UI; Compose stack.
- Sprint 3: Priority 3 A/V via Docling ASR (large-v3, segment times).
- Coarse v2 milestone for deferred work.

Honors the invariants: IR is the sole contract; Docling/Meilisearch/
models swappable behind interfaces; local-only inference; v1
keyword-only (vectors dormant); fail-isolated; CI green from commit one.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 16:47:29 +04:00
705380ed89 Merge pull request 'docs(plan): make Windows CI runner wiring in-scope Sprint-0 issues' (#2) from docs/plan-windows-runner into main 2026-07-01 16:41:26 +04:00
Randa
016f17e121 docs(plan): make Windows CI runner wiring in-scope Sprint-0 issues
The Win11 KVM VM is provisioned and ready, so the runner wiring is no longer
a human prerequisite. Turn it into three concrete, pickup-able Sprint-0 infra
issues (register the Crown0815 runner with a windows label; vendor the host
toolchain; wire the runs-on: windows CI job), each with acceptance criteria.
Also refresh the status note now that the design PR is merged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 16:39:51 +04:00
a0a9329134 Merge pull request 'docs: research findings and v1 design (IR contract, index design, ADRs)' (#1) from docs/research-and-design into main 2026-07-01 16:27:11 +04:00
Randa
068dad5340 docs(ir): add v2 example record with a populated bge-m3 1024-dim embedding
ir-examples.jsonl gains a fifth record — a Word chunk whose embedding{} is
populated (model_id bge-m3, dimensions 1024, a real 1024-length vector) —
so the embedding shape and the sink's vector->_vectors.digger_semantic
mapping have a concrete, schema-valid fixture. The four v1 records stay
embedding:null. id/parent_id are computed per the path-stable formula.
decisions/README labels it as the v2 illustration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 16:17:00 +04:00
Randa
2e1f8f46bd docs: set bge-m3 (1024-dim) as the planned default embedder
Switch the committed userProvided embedder dimension 768 -> 1024 to match
bge-m3, the planned v2 default (research B already recommended it as the
Arabic+English choice; ADR 0005 previously carried e5-base/768 as the
example default, an inconsistency this resolves). Free to change now since
v1 writes no vectors.

Updated: meilisearch-settings.json, ADR 0003/0005, ir-schema.json,
SYNTHESIS, README, and A-meilisearch config/prose. Remaining 768 mentions
are legitimate (alternative 768-dim models; nomic-embed-text's real dim).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 16:10:43 +04:00
Randa
d65751e12d docs(adr): record v2 UI-driven model selection (ADR 0005 + 0007 admin view)
Capture the operator-picks-model-from-available-Ollama-models idea as a v2
feature so the seams stay honored. Constraints noted: OCR limited to
vision-capable models (prompt is model-coupled); v2 embedder selection is
reindex-gated (dimension change invalidates vectors); ASR is out of scope
(served by Docling/faster-whisper, not Ollama). provenance.model +
reindex --model-changed already handle post-swap consistency.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 16:03:18 +04:00
Randa
96d4409845 docs(adr): path-stable parent_id for clean chunk-family replace/delete
Decouple the two identifiers so a changed file's whole chunk family can be
replaced/deleted by one stable key, with no orphans and no transient double-hit:

- id (primary key) = sha256(path|content_hash|chunk_index) — content-addressed
  per the brief; chunk_index=0 for a whole doc; changes when content changes.
- parent_id = sha256(path) — path-stable, identical across chunks and edits.

- ADR 0003 + meilisearch-settings.json: add parent_id to filterableAttributes
  (delete-by-filter needs it; it was previously distinctAttribute-only, so the
  documented delete-by-parent_id could not have executed).
- ADR 0003/0008: replace-on-change = delete family by filter(parent_id) then
  PUT new chunks; sink tasks confirmed before StateStore commit (crash-convergent).
- ADR 0002/0004 + ir-schema.json + ir-examples.jsonl: updated formulas, dropped
  the old 'parent_id == id for whole docs' framing (kept as a rejected alternative).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 15:42:39 +04:00
Randa
daa7c0de85 docs(adr): state Ollama support plainly in ADR 0001; drop 'earlier misstatement' meta-note
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 15:28:43 +04:00
Randa
5e9420e8ed docs(research): reconcile research docs with final decisions; drop banners
Fold each research doc's superseded 'reconciliation banner' corrections into
the body so nothing contradicts the ADRs, then remove the banners:

- I: ScannedOcrExtractor rename; routing table gets v1/v2 phase markers;
  .msg marked unsupported; A/V verdicts reframed v1=Docling ASR / v2=faster-whisper.
- A: settings JSON matched to committed meilisearch-settings.json (path not
  searchable, distinctAttribute=parent_id, maxValuesPerFacet=200, typo disable
  list); truncate->chunk in v1; open questions resolved.
- B: tier tables corrected (Docling owns EML/CSV/WebVTT; Unstructured .msg/ZIP/
  edge-XML V2-only); scanned OCR = Sprint-1 bake-off.
- C: OOXML = Docling-sole + metadata augmenter, docx2python dropped; legacy
  binary Office + Access deferred to v2 (v1 -> skipped).
- D: v1=Docling built-in ASR / v2=faster-whisper upgrade throughout.
- F: Meili CI service image -> v1.48.3.
- G: Unstructured niche note trimmed (.msg/ZIP/edge-XML V2; EML is Docling's).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 15:25:39 +04:00
Randa
d8326103ab docs: rewrite SYNTHESIS §2 to state final decisions; drop stale Agent-B tier table + G/H/I chronology
The per-layer section presented the original Agent-B tiered table (wrong on
EML/CSV routing, docx2python, scanned->Qwen-directly) with only a bolt-on note,
then re-stated the real decision in separate G/H/I subsections. Replaced with a
single clean 'Extraction' subsection (Docling front-half + narrow custom
extractors), collapsed G/H/I to a one-line pointer to the research docs, and
fixed the embedder-dims wording. No decision change; SYNTHESIS is now a clean
statement of the final per-layer decisions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 15:09:24 +04:00
Randa
10d540bc41 docs: prefer small, functional PRs; split large tasks into multiple PRs
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 15:00:26 +04:00
Randa
364de11549 docs: add PLANNING_KICKOFF.md (prompt for the planning session) + README pointer
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 14:08:44 +04:00
Randa
8b3d7ea7e3 fix: correct relative links in brief Section 14 (docs/ root, no ../)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 14:03:15 +04:00
Randa
791232a656 docs: fill in brief Section 14 answers + point to SYNTHESIS/ADRs
The <FILL IN> placeholders in the kickoff brief are replaced with the
confirmed answers (UI=Option B + HTMX, chunk-in-v1, medium scale, Arabic+
English, CPU/GPU-optional, bundled Meilisearch, local disks, CLI-now), with
a pointer to SYNTHESIS §1 and the relevant ADRs. No open placeholders remain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 14:02:53 +04:00
Randa
c4b9842e1e docs: second full sweep — fix factual errors, tighten contracts, normalize
Three independent Opus reviewers (consistency/gaps, contract artifacts,
factual skeptic) + a mechanical scan. Fixes applied:

Factual errors (were WRONG in the authoritative layer):
- Docling ASR default is WHISPER_TINY, not WHISPER_TURBO (ADR 0005/0006).
- Docling does NOT ingest Outlook .msg (only EML/CSV/WebVTT); .msg -> Unstructured
  in v2 (ADR 0006, SYNTHESIS).
- Forgejo bare `uses:` expand against DEFAULT_ACTIONS_URL = data.forgejo.org, not
  code.forgejo.org; keep code.forgejo.org/forgejo/upload-artifact (ADR 0010, SYNTHESIS).
- ADR 0009 said "faster-whisper in the pipeline image" — v1 ASR is Docling; fixed.

Overstated claims softened (skeptic, with sources):
- faster-whisper large-v3 int8 CPU ~5-6x real-time (not 10x), 2-3 GB peak RAM.
- large-v3-turbo Arabic regression: reported in third-party benchmarks (OpenAI
  documents Thai/Cantonese), not a documented Arabic fact.
- Qwen2.5-VL "best for Arabic handwriting" qualified to fine-tuned; base is only
  moderate (Sprint-1 corpus-validation risk).
- Docling ASR word-level timestamps are computed but not surfaced (not "absent").

Contract artifacts:
- Embedder dimension reconciled: bge-m3 is 1024-dim, not 768; embed model is a
  v2 TBD that must emit 768 (or commit 1024 before first vector). ADR 0003/0005.
- ir-schema: relaxed embedding.dimensions/vector from const-768 to
  dimensions-driven + added embedding `required`; extractor_name id list +
  ADR 0006 id<->class mapping; tags searchable-only (not filterable).
- ir-examples: extractor_name pdf_ocr -> scanned_ocr; fixed the empty-string
  content_hash on the video example; v1 A/V record uses `docling`.
- settings _comment: softened the "ignores unknown keys" claim.

Gaps & consistency:
- Added a per-file limits config block (timeout/max-size) to ADR 0005.
- ADR 0010: pin CI Meilisearch service container to v1.48.3; noted the `heavy`
  runner is an unregistered placeholder.
- SYNTHESIS vector-seam described as the IR `embedding{}` object (sink maps to
  _vectors), not the Meilisearch field.
- Normalized v1/v2 casing across the authoritative layer.
- Reconciliation banners added to research B/D/F/I and C §4.4; A banner narrowed.

Verified: JSON parses, all IR examples schema-valid, all links resolve.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 13:58:08 +04:00
Randa
6210b9464f docs: consistency sweep — align all docs, fix drift, state swappability once
Full-sweep review (mechanical scan + independent Opus reviewer) found drift
accumulated across the many revisions. Fixes:

Contradictions resolved:
- ADR 0005: removed the leftover "default OCR path stays the direct wrapper"
  claim that contradicted the same ADR's Sprint-1 bake-off row; now consistent
  (Docling-VLM leading, wrapper fallback, bake-off decides).
- README: Office/legacy line (docx2python dropped; legacy + Access are V2) and
  A/V line (v1 = Docling ASR, faster-whisper is V2) corrected.
- ADR 0001: default ASR is Docling ASR in v1 (faster-whisper = V2), not the
  previous "faster-whisper (ASR)".
- SYNTHESIS: "whole-doc default in v1" -> "chunk long docs in v1"; the "open
  trade-offs" section rewritten as "resolved" (dedup=one-per-path,
  chunk-in-v1, KVM Win11 VM); cross-cutting items corrected (chunk seams,
  ASR default); tier table annotated as refined by Agent I / ADR 0006.

Schema/settings:
- ir-schema.json: dropped the nonexistent 'office_docx' extractor example.
- ir-examples.jsonl: v1 A/V record relabeled extractor_name -> 'docling'.
- ADR 0003: note that meilisearch-settings.json is authoritative for exact
  shapes (localizedAttributes object form, proximityPrecision, searchCutoffMs,
  typoTolerance extras).

Swappability stated once, authoritatively:
- ADR 0001 gains a "Swappability invariant" block (search engine / model
  runtime / extraction engine incl. Docling all behind interfaces; DoclingDocument
  never escapes; CanonicalDocument is the sole contract). Echoed in README and
  added to CLAUDE.md invariants.

Evidence trail kept honest (no deletions):
- Supersession banners added to research A (chunk-in-v1 / distinctAttribute) and
  research C (docx2python dropped; legacy+Access V2); research I naming/count note.

Verified: JSON parses, all IR examples schema-valid, all links resolve, stale
phrases gone.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 13:29:57 +04:00
Randa
9de81f8949 docs: defer legacy-binary Office and Access to V2 for a leaner V1
Scope decision to keep V1 as thin as possible while still exercising the whole
pipe end-to-end:

- Modern OOXML (.docx/.xlsx/.pptx) stays in V1 — it comes free through Docling's
  front-half, so V1 still delivers Office search for the common case.
- Legacy binary Office (.doc/.xls/.ppt) and Access (.mdb/.accdb) move to V2
  (LegacyOfficeExtractor, LegacyXlsExtractor, AccessExtractor; unoserver +
  xlrd + pyodbc/ACE). In V1 these files route to status: skipped with a clear
  reason and appear in the status summary, so a legacy-heavy corpus is visible
  and can be re-prioritized — never a crash.
- Bonus: the ~800 MB LibreOffice/unoserver converter service drops out of the
  V1 Docker Compose, lightening the zero-install footprint.

Net: V1 custom code shrinks to one ScannedOcrExtractor (harness bake-off) plus
the thin OfficeMetadataAugmenter; everything else is Docling + Meilisearch +
UI + CI.

Touches ADR 0006 (routing rows + legacy/Access subsections marked V2), ADR 0009
(converter service = V2), SYNTHESIS (Office/legacy bullets, Agent-I summary,
pins). IR (0002), architecture (0001), chunking (0004) unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 13:14:27 +04:00
Randa
8434b5449b docs: A/V ASR via Docling in v1; OCR harness as a Sprint-1 bake-off
Two design refinements from review feedback, both leaning further on Docling
and both fully behind the Extractor interface (IR unchanged):

1. Audio/video ASR — use Docling's built-in ASR for v1 (preset overridden to
   large-v3, since TURBO regresses on Arabic), as just-another-Docling-format.
   Accepted v1 limitations: segment-level timestamps only, no Silero VAD, no
   diarization. The dedicated ffmpeg + faster-whisper extractor (word-level
   timestamps, VAD, fine-tuned Arabic, diarization) becomes the v2 upgrade.
   IR already carries segment times and reserves words/speaker null, so the
   swap needs no schema change.

2. Scanned Arabic OCR — do NOT pre-commit to a bespoke extractor. The model
   (Qwen2.5-VL) and the arabic-ocr prompt are fixed; the invocation harness is
   an implementation detail decided by a Sprint-1 bake-off on the real Arabic
   corpus: Docling's VLM pipeline (ApiVlmOptions -> Ollama, MARKDOWN, ~300 DPI)
   is the leading candidate (unifies OCR under the Docling path); a thin ~50-line
   direct Ollama wrapper is the fallback if Docling's Markdown re-parsing
   distorts the Arabic output, can't render high-enough DPI, or handles
   slow-CPU streaming poorly. Renamed the route to ScannedOcrExtractor.

Rationale: after correcting the ApiVlmOptions deprecation (it is NOT deprecated)
and recognizing the MARKDOWN-path bbox-zeroing is a wash (Qwen returns text),
the objections to Docling-VLM mostly dissolved; the remainder are empirical and
cheap to test, so this becomes a "let the documents decide" detail, not a fork.

Touches ADR 0005, ADR 0006, SYNTHESIS (+ a decision-note in research I). IR
(0002) and architecture (0001)/chunking (0004) unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 13:07:38 +04:00
Randa
2db29ffe5d docs: Docling as the front-half engine (Agent I) + fold into ADRs
Adds docs/research/I-docling-front-half.md answering the human's final check:
"is there OSS that reads files -> local models of choice -> IR, that we can
extend?" Answer: Docling, with a precisely bounded perimeter.

Findings (verified at Docling source level, 2.107.0):
- Docling is the sole content extractor for every native-digital format
  (PDF-text, DOCX, PPTX, XLSX, HTML, EPUB, ODF, LaTeX, MD, EML, MSG, CSV,
  WebVTT). Resolves the ADR 0006 redundancy that listed both Docling AND the
  dedicated Office libs.
- Office libs DEMOTED, not dropped: docx2python removed; python-docx/pptx +
  openpyxl shrink to ~5-line metadata augmenters (core properties + sheet
  names Docling doesn't expose); calamine deferred; xlrd kept for legacy .xls.
- Keep the dedicated faster-whisper extractor: Docling ASR lacks word-level
  timestamps (needed for A/V deep-links) and defaults to WHISPER_TURBO.
- Corrects H/ADR: ApiVlmOptions is NOT formally deprecated in 2.107.0 (only
  HuggingFaceVlmOptions is); it stays the only in-Docling custom-prompt path.
- 5 custom extractors remain: Qwen Arabic OCR, A/V, legacy-Office (.doc/.ppt
  via unoserver->Docling), legacy .xls (xlrd), Access (Windows-gated).

Folded in: ADR 0006 routing table rewritten (Docling front-half + augmenter,
Arabic-vs-printed scanned split, corrected ApiVlmOptions note); ADR 0005
deprecation wording corrected; ADR 0002 note on binary_hash (compute SHA-256
independently); H gets a correction banner; SYNTHESIS + pins updated.
ADR 0001/0004 unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 12:47:39 +04:00
Randa
32ffdb7147 docs: fork/extend re-evaluation (Agent H) + correct Ollama facts in ADRs
Adds docs/research/H-framework-fork-reevaluation.md — a source-level re-eval
prompted by the human's pushback that (1) the Ollama claim was wrong and (2)
forking is allowed (incl. forking LlamaIndex to use our own IR).

Findings, all verified at source:
- CORRECTED: Ollama (LLM+embeddings) is universally supported; Docling's VLM
  pipeline can drive Qwen2.5-VL via Ollama with a custom prompt.
- Verdict (A) build-our-own STANDS on the correct grounds: no framework ships
  a Meilisearch keyword sink; adopting one demotes the IR to a passenger;
  forking llama-index-core to make CanonicalDocument native is a whole-framework
  fork (BaseNode threads through ~74 core files; dedup on computed node.hash;
  no schema versioning) with a permanent rebase tax — to buy only the cheap
  dedup we want to own. Forking LlamaIndex for our IR: NO.
- Docling-VLM-as-OCR: keep the direct Ollama wrapper as default; Docling-VLM
  is an opt-in alternate backend (its custom-prompt ApiVlmOptions is deprecated
  in docling 2.107.0; PLAINTEXT response raises, MARKDOWN zeroes bboxes).

Folded in as wording corrections (no architecture change):
- ADR 0001: correct the Ollama claim; record the fork-rejection rationale; ref H.
- ADR 0005: note Ollama ubiquity + Docling-VLM as a documented alt OCR backend;
  default stays the direct wrapper.
- ADR 0006: strengthen the Docling-VLM rejection with the deprecation +
  response_format findings.
- ADR 0002 (IR) and ADR 0004 (chunking) reaffirmed unchanged.
- SYNTHESIS + README updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 04:54:15 +04:00
Randa
8624e5de52 docs: RAG ingestion-framework survey (Agent G) + fold into ADRs
Adds docs/research/G-rag-ingestion-frameworks.md answering "should we adopt
an existing RAG ingestion framework as the backbone?" Verdict: (A) build our
own thin pipeline; the ecosystem is vector-store-centric and no framework
provides a Meilisearch keyword sink, pluggable Arabic VLM OCR, or a persisted
engine-agnostic IR as the contract.

Folds the concrete reuse findings into the design:
- ADR 0001: records why no framework backbone (necessary, not optional).
- ADR 0004: docling-core HybridChunker/HierarchicalChunker as the chunking
  vehicle for Docling-sourced content; segment packer for OCR/ASR.
- ADR 0006: DoclingDocument as an internal Extractor->Transformer transport
  (never escapes; CanonicalDocument stays the sole external contract).
- ADR 0008: LlamaIndex doc_id+doc_hash cited as prior art for the StateStore.
- SYNTHESIS + README updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 04:03:36 +04:00
Randa
5cc8c99109 docs: research findings and v1 design (IR contract, index, ADRs)
Design phase output for digger — no implementation yet.

Research (docs/research/): six findings docs (Meilisearch, local-model
tooling incl. the existing arabic-ocr setup, Office/legacy, audio/video,
frontend/UX, Forgejo CI + Windows runner) plus SYNTHESIS.md.

Design (docs/decisions/): the Canonical Document IR JSON Schema v1.0
(the contract) with worked examples, the concrete Meilisearch settings,
and ADRs 0001–0010 covering architecture/layering, the IR, index design
(single index, chunk-granularity collapsed by parent_id), chunking,
model backends + Ollama deployment, conversion routing, the read-side
SearchProvider + HTMX UI, dedup/StateStore/incremental/reindex,
Docker-Compose packaging, and layered CI with a native Windows runner.

Confirmed decisions baked in: Arabic+English; one document per path;
chunk long docs in v1; vectors designed-for but switched off; Ollama as
a host service; Windows CI on a KVM VM.

Also adds project README, CLAUDE.md, the brief, and .gitignore.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 03:14:06 +04:00
Randa
57b51329f7 Initial commit: empty README 2026-07-01 00:15:49 +04:00