- sprint skill: explicit "Resume — session closed mid-sprint" section
that reconstructs the frontier from groom doc -> Forgejo -> git, and a
rule to keep per-issue Status live (todo -> in progress -> PR open ->
merged) and commit/push per TDD step so state is never archaeology.
- docs/sprints/README.md: add a Status column to the groom template.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- CLAUDE.md dev loop: after opening the PR, wait for CI green, then
self-review the diff (/code-review) before saying it's ready; never
claim ready before CI-green + self-review; still never self-merge.
- .claude/skills/sprint: thin per-project skill so "start sprint N" /
"groom sprint N" / "continue the sprint" deterministically runs the
groom -> per-issue dev-loop -> retro flow, enforcing the guardrails
(plan is source of truth, groom-first, CI-green + self-review before
ready, human merges). Composes with the superpowers skills.
- docs/sprints/README.md: fold the CI-green + self-review gate into the
per-sprint Definition of Done.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- CLAUDE.md: plan is the source of truth (keep it in sync with Forgejo
issues both ways); groom each sprint just-in-time into
docs/sprints/sprint-<n>-<slug>.md from plan + current state + prior
retro; scope changes flow groom -> plan -> issues; close with a retro.
- Plan doc: add a "Living plan & sprint grooming" section.
- docs/sprints/README.md: the grooming rules + a template.
- docs/sprints/sprint-0-skeleton.md: the first groom (Sprint 0), with
PR sequencing, the Windows-runner human-touch risk, and DoD.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Split the Docker Compose work so the engine (Meilisearch + pipeline)
lands in Sprint 0 as S0-14, making the Sprint-0 e2e slice runnable with
one command and delivering ADR 0009's zero-config first run early. S2-9
slims to extending Compose with the ui service once the UI exists.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Turn the approved design (ADRs 0001-0010, IR + Meilisearch contracts,
research SYNTHESIS) into an agile, sprint-based v1 roadmap where every
sprint ships a working end-to-end slice:
- Sprint 0: skeleton — scaffolding, green Linux + native Windows CI,
the seven interfaces, FileSink, config + SQLite StateStore,
MeilisearchSink + index/settings (dormant embedder), and a trivial
walk -> stub -> IR -> Meilisearch -> search slice.
- Sprint 1: Priority 1 scanned-doc OCR e2e (harness bake-off early,
chunking transformer, incremental/replace/delete semantics).
- Sprint 2: Priority 2 native-digital + Office (Docling + metadata
augmenter) and the FastAPI/HTMX SearchProvider UI; Compose stack.
- Sprint 3: Priority 3 A/V via Docling ASR (large-v3, segment times).
- Coarse v2 milestone for deferred work.
Honors the invariants: IR is the sole contract; Docling/Meilisearch/
models swappable behind interfaces; local-only inference; v1
keyword-only (vectors dormant); fail-isolated; CI green from commit one.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Win11 KVM VM is provisioned and ready, so the runner wiring is no longer
a human prerequisite. Turn it into three concrete, pickup-able Sprint-0 infra
issues (register the Crown0815 runner with a windows label; vendor the host
toolchain; wire the runs-on: windows CI job), each with acceptance criteria.
Also refresh the status note now that the design PR is merged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ir-examples.jsonl gains a fifth record — a Word chunk whose embedding{} is
populated (model_id bge-m3, dimensions 1024, a real 1024-length vector) —
so the embedding shape and the sink's vector->_vectors.digger_semantic
mapping have a concrete, schema-valid fixture. The four v1 records stay
embedding:null. id/parent_id are computed per the path-stable formula.
decisions/README labels it as the v2 illustration.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Switch the committed userProvided embedder dimension 768 -> 1024 to match
bge-m3, the planned v2 default (research B already recommended it as the
Arabic+English choice; ADR 0005 previously carried e5-base/768 as the
example default, an inconsistency this resolves). Free to change now since
v1 writes no vectors.
Updated: meilisearch-settings.json, ADR 0003/0005, ir-schema.json,
SYNTHESIS, README, and A-meilisearch config/prose. Remaining 768 mentions
are legitimate (alternative 768-dim models; nomic-embed-text's real dim).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Capture the operator-picks-model-from-available-Ollama-models idea as a v2
feature so the seams stay honored. Constraints noted: OCR limited to
vision-capable models (prompt is model-coupled); v2 embedder selection is
reindex-gated (dimension change invalidates vectors); ASR is out of scope
(served by Docling/faster-whisper, not Ollama). provenance.model +
reindex --model-changed already handle post-swap consistency.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Decouple the two identifiers so a changed file's whole chunk family can be
replaced/deleted by one stable key, with no orphans and no transient double-hit:
- id (primary key) = sha256(path|content_hash|chunk_index) — content-addressed
per the brief; chunk_index=0 for a whole doc; changes when content changes.
- parent_id = sha256(path) — path-stable, identical across chunks and edits.
- ADR 0003 + meilisearch-settings.json: add parent_id to filterableAttributes
(delete-by-filter needs it; it was previously distinctAttribute-only, so the
documented delete-by-parent_id could not have executed).
- ADR 0003/0008: replace-on-change = delete family by filter(parent_id) then
PUT new chunks; sink tasks confirmed before StateStore commit (crash-convergent).
- ADR 0002/0004 + ir-schema.json + ir-examples.jsonl: updated formulas, dropped
the old 'parent_id == id for whole docs' framing (kept as a rejected alternative).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-layer section presented the original Agent-B tiered table (wrong on
EML/CSV routing, docx2python, scanned->Qwen-directly) with only a bolt-on note,
then re-stated the real decision in separate G/H/I subsections. Replaced with a
single clean 'Extraction' subsection (Docling front-half + narrow custom
extractors), collapsed G/H/I to a one-line pointer to the research docs, and
fixed the embedder-dims wording. No decision change; SYNTHESIS is now a clean
statement of the final per-layer decisions.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The <FILL IN> placeholders in the kickoff brief are replaced with the
confirmed answers (UI=Option B + HTMX, chunk-in-v1, medium scale, Arabic+
English, CPU/GPU-optional, bundled Meilisearch, local disks, CLI-now), with
a pointer to SYNTHESIS §1 and the relevant ADRs. No open placeholders remain.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Three independent Opus reviewers (consistency/gaps, contract artifacts,
factual skeptic) + a mechanical scan. Fixes applied:
Factual errors (were WRONG in the authoritative layer):
- Docling ASR default is WHISPER_TINY, not WHISPER_TURBO (ADR 0005/0006).
- Docling does NOT ingest Outlook .msg (only EML/CSV/WebVTT); .msg -> Unstructured
in v2 (ADR 0006, SYNTHESIS).
- Forgejo bare `uses:` expand against DEFAULT_ACTIONS_URL = data.forgejo.org, not
code.forgejo.org; keep code.forgejo.org/forgejo/upload-artifact (ADR 0010, SYNTHESIS).
- ADR 0009 said "faster-whisper in the pipeline image" — v1 ASR is Docling; fixed.
Overstated claims softened (skeptic, with sources):
- faster-whisper large-v3 int8 CPU ~5-6x real-time (not 10x), 2-3 GB peak RAM.
- large-v3-turbo Arabic regression: reported in third-party benchmarks (OpenAI
documents Thai/Cantonese), not a documented Arabic fact.
- Qwen2.5-VL "best for Arabic handwriting" qualified to fine-tuned; base is only
moderate (Sprint-1 corpus-validation risk).
- Docling ASR word-level timestamps are computed but not surfaced (not "absent").
Contract artifacts:
- Embedder dimension reconciled: bge-m3 is 1024-dim, not 768; embed model is a
v2 TBD that must emit 768 (or commit 1024 before first vector). ADR 0003/0005.
- ir-schema: relaxed embedding.dimensions/vector from const-768 to
dimensions-driven + added embedding `required`; extractor_name id list +
ADR 0006 id<->class mapping; tags searchable-only (not filterable).
- ir-examples: extractor_name pdf_ocr -> scanned_ocr; fixed the empty-string
content_hash on the video example; v1 A/V record uses `docling`.
- settings _comment: softened the "ignores unknown keys" claim.
Gaps & consistency:
- Added a per-file limits config block (timeout/max-size) to ADR 0005.
- ADR 0010: pin CI Meilisearch service container to v1.48.3; noted the `heavy`
runner is an unregistered placeholder.
- SYNTHESIS vector-seam described as the IR `embedding{}` object (sink maps to
_vectors), not the Meilisearch field.
- Normalized v1/v2 casing across the authoritative layer.
- Reconciliation banners added to research B/D/F/I and C §4.4; A banner narrowed.
Verified: JSON parses, all IR examples schema-valid, all links resolve.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Full-sweep review (mechanical scan + independent Opus reviewer) found drift
accumulated across the many revisions. Fixes:
Contradictions resolved:
- ADR 0005: removed the leftover "default OCR path stays the direct wrapper"
claim that contradicted the same ADR's Sprint-1 bake-off row; now consistent
(Docling-VLM leading, wrapper fallback, bake-off decides).
- README: Office/legacy line (docx2python dropped; legacy + Access are V2) and
A/V line (v1 = Docling ASR, faster-whisper is V2) corrected.
- ADR 0001: default ASR is Docling ASR in v1 (faster-whisper = V2), not the
previous "faster-whisper (ASR)".
- SYNTHESIS: "whole-doc default in v1" -> "chunk long docs in v1"; the "open
trade-offs" section rewritten as "resolved" (dedup=one-per-path,
chunk-in-v1, KVM Win11 VM); cross-cutting items corrected (chunk seams,
ASR default); tier table annotated as refined by Agent I / ADR 0006.
Schema/settings:
- ir-schema.json: dropped the nonexistent 'office_docx' extractor example.
- ir-examples.jsonl: v1 A/V record relabeled extractor_name -> 'docling'.
- ADR 0003: note that meilisearch-settings.json is authoritative for exact
shapes (localizedAttributes object form, proximityPrecision, searchCutoffMs,
typoTolerance extras).
Swappability stated once, authoritatively:
- ADR 0001 gains a "Swappability invariant" block (search engine / model
runtime / extraction engine incl. Docling all behind interfaces; DoclingDocument
never escapes; CanonicalDocument is the sole contract). Echoed in README and
added to CLAUDE.md invariants.
Evidence trail kept honest (no deletions):
- Supersession banners added to research A (chunk-in-v1 / distinctAttribute) and
research C (docx2python dropped; legacy+Access V2); research I naming/count note.
Verified: JSON parses, all IR examples schema-valid, all links resolve, stale
phrases gone.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Scope decision to keep V1 as thin as possible while still exercising the whole
pipe end-to-end:
- Modern OOXML (.docx/.xlsx/.pptx) stays in V1 — it comes free through Docling's
front-half, so V1 still delivers Office search for the common case.
- Legacy binary Office (.doc/.xls/.ppt) and Access (.mdb/.accdb) move to V2
(LegacyOfficeExtractor, LegacyXlsExtractor, AccessExtractor; unoserver +
xlrd + pyodbc/ACE). In V1 these files route to status: skipped with a clear
reason and appear in the status summary, so a legacy-heavy corpus is visible
and can be re-prioritized — never a crash.
- Bonus: the ~800 MB LibreOffice/unoserver converter service drops out of the
V1 Docker Compose, lightening the zero-install footprint.
Net: V1 custom code shrinks to one ScannedOcrExtractor (harness bake-off) plus
the thin OfficeMetadataAugmenter; everything else is Docling + Meilisearch +
UI + CI.
Touches ADR 0006 (routing rows + legacy/Access subsections marked V2), ADR 0009
(converter service = V2), SYNTHESIS (Office/legacy bullets, Agent-I summary,
pins). IR (0002), architecture (0001), chunking (0004) unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two design refinements from review feedback, both leaning further on Docling
and both fully behind the Extractor interface (IR unchanged):
1. Audio/video ASR — use Docling's built-in ASR for v1 (preset overridden to
large-v3, since TURBO regresses on Arabic), as just-another-Docling-format.
Accepted v1 limitations: segment-level timestamps only, no Silero VAD, no
diarization. The dedicated ffmpeg + faster-whisper extractor (word-level
timestamps, VAD, fine-tuned Arabic, diarization) becomes the v2 upgrade.
IR already carries segment times and reserves words/speaker null, so the
swap needs no schema change.
2. Scanned Arabic OCR — do NOT pre-commit to a bespoke extractor. The model
(Qwen2.5-VL) and the arabic-ocr prompt are fixed; the invocation harness is
an implementation detail decided by a Sprint-1 bake-off on the real Arabic
corpus: Docling's VLM pipeline (ApiVlmOptions -> Ollama, MARKDOWN, ~300 DPI)
is the leading candidate (unifies OCR under the Docling path); a thin ~50-line
direct Ollama wrapper is the fallback if Docling's Markdown re-parsing
distorts the Arabic output, can't render high-enough DPI, or handles
slow-CPU streaming poorly. Renamed the route to ScannedOcrExtractor.
Rationale: after correcting the ApiVlmOptions deprecation (it is NOT deprecated)
and recognizing the MARKDOWN-path bbox-zeroing is a wash (Qwen returns text),
the objections to Docling-VLM mostly dissolved; the remainder are empirical and
cheap to test, so this becomes a "let the documents decide" detail, not a fork.
Touches ADR 0005, ADR 0006, SYNTHESIS (+ a decision-note in research I). IR
(0002) and architecture (0001)/chunking (0004) unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds docs/research/I-docling-front-half.md answering the human's final check:
"is there OSS that reads files -> local models of choice -> IR, that we can
extend?" Answer: Docling, with a precisely bounded perimeter.
Findings (verified at Docling source level, 2.107.0):
- Docling is the sole content extractor for every native-digital format
(PDF-text, DOCX, PPTX, XLSX, HTML, EPUB, ODF, LaTeX, MD, EML, MSG, CSV,
WebVTT). Resolves the ADR 0006 redundancy that listed both Docling AND the
dedicated Office libs.
- Office libs DEMOTED, not dropped: docx2python removed; python-docx/pptx +
openpyxl shrink to ~5-line metadata augmenters (core properties + sheet
names Docling doesn't expose); calamine deferred; xlrd kept for legacy .xls.
- Keep the dedicated faster-whisper extractor: Docling ASR lacks word-level
timestamps (needed for A/V deep-links) and defaults to WHISPER_TURBO.
- Corrects H/ADR: ApiVlmOptions is NOT formally deprecated in 2.107.0 (only
HuggingFaceVlmOptions is); it stays the only in-Docling custom-prompt path.
- 5 custom extractors remain: Qwen Arabic OCR, A/V, legacy-Office (.doc/.ppt
via unoserver->Docling), legacy .xls (xlrd), Access (Windows-gated).
Folded in: ADR 0006 routing table rewritten (Docling front-half + augmenter,
Arabic-vs-printed scanned split, corrected ApiVlmOptions note); ADR 0005
deprecation wording corrected; ADR 0002 note on binary_hash (compute SHA-256
independently); H gets a correction banner; SYNTHESIS + pins updated.
ADR 0001/0004 unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds docs/research/H-framework-fork-reevaluation.md — a source-level re-eval
prompted by the human's pushback that (1) the Ollama claim was wrong and (2)
forking is allowed (incl. forking LlamaIndex to use our own IR).
Findings, all verified at source:
- CORRECTED: Ollama (LLM+embeddings) is universally supported; Docling's VLM
pipeline can drive Qwen2.5-VL via Ollama with a custom prompt.
- Verdict (A) build-our-own STANDS on the correct grounds: no framework ships
a Meilisearch keyword sink; adopting one demotes the IR to a passenger;
forking llama-index-core to make CanonicalDocument native is a whole-framework
fork (BaseNode threads through ~74 core files; dedup on computed node.hash;
no schema versioning) with a permanent rebase tax — to buy only the cheap
dedup we want to own. Forking LlamaIndex for our IR: NO.
- Docling-VLM-as-OCR: keep the direct Ollama wrapper as default; Docling-VLM
is an opt-in alternate backend (its custom-prompt ApiVlmOptions is deprecated
in docling 2.107.0; PLAINTEXT response raises, MARKDOWN zeroes bboxes).
Folded in as wording corrections (no architecture change):
- ADR 0001: correct the Ollama claim; record the fork-rejection rationale; ref H.
- ADR 0005: note Ollama ubiquity + Docling-VLM as a documented alt OCR backend;
default stays the direct wrapper.
- ADR 0006: strengthen the Docling-VLM rejection with the deprecation +
response_format findings.
- ADR 0002 (IR) and ADR 0004 (chunking) reaffirmed unchanged.
- SYNTHESIS + README updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds docs/research/G-rag-ingestion-frameworks.md answering "should we adopt
an existing RAG ingestion framework as the backbone?" Verdict: (A) build our
own thin pipeline; the ecosystem is vector-store-centric and no framework
provides a Meilisearch keyword sink, pluggable Arabic VLM OCR, or a persisted
engine-agnostic IR as the contract.
Folds the concrete reuse findings into the design:
- ADR 0001: records why no framework backbone (necessary, not optional).
- ADR 0004: docling-core HybridChunker/HierarchicalChunker as the chunking
vehicle for Docling-sourced content; segment packer for OCR/ASR.
- ADR 0006: DoclingDocument as an internal Extractor->Transformer transport
(never escapes; CanonicalDocument stays the sole external contract).
- ADR 0008: LlamaIndex doc_id+doc_hash cited as prior art for the StateStore.
- SYNTHESIS + README updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design phase output for digger — no implementation yet.
Research (docs/research/): six findings docs (Meilisearch, local-model
tooling incl. the existing arabic-ocr setup, Office/legacy, audio/video,
frontend/UX, Forgejo CI + Windows runner) plus SYNTHESIS.md.
Design (docs/decisions/): the Canonical Document IR JSON Schema v1.0
(the contract) with worked examples, the concrete Meilisearch settings,
and ADRs 0001–0010 covering architecture/layering, the IR, index design
(single index, chunk-granularity collapsed by parent_id), chunking,
model backends + Ollama deployment, conversion routing, the read-side
SearchProvider + HTMX UI, dedup/StateStore/incremental/reindex,
Docker-Compose packaging, and layered CI with a native Windows runner.
Confirmed decisions baked in: Arabic+English; one document per path;
chunk long docs in v1; vectors designed-for but switched off; Ollama as
a host service; Windows CI on a KVM VM.
Also adds project README, CLAUDE.md, the brief, and .gitignore.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>