2 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Setup
# Create and activate the virtualenv
python3 -m venv arabic_ocr_env
source arabic_ocr_env/bin/activate
# Install dependencies
pip install -r requirements.txt
# System dep required for pdf2image:
sudo apt-get install -y poppler-utils
Running
source arabic_ocr_env/bin/activate
# Auto-detect document type per page
python arabic_ocr_smart.py document.pdf [output.txt]
# Force a specific document type
python arabic_ocr_smart.py scan.pdf --type [handwritten|certificate|id|table|form|mixed]
# Custom Ollama host (default: http://192.168.122.1:11434)
python arabic_ocr_smart.py scan.pdf --host http://localhost:11434
Architecture
Single-file script (arabic_ocr_smart.py) with no tests or build system.
Pipeline: PDF → PIL images (via pdf2image/poppler at 300 DPI) → base64 → Ollama /api/chat → structured text output.
Two-pass per page (when no --type forced):
- Detection pass — sends the
"detect"prompt toqwen2.5vl:7bto classify the page into one of:handwritten,certificate,id,table,form,mixed. - Extraction pass — sends the type-specific prompt from the
PROMPTSdict to extract structured text.
Key constants (top of file):
DEFAULT_HOST— Ollama endpoint (VM bridged network by default)MODEL—qwen2.5vl:7bDPI— render resolution (300)TIMEOUT— 300 s per Ollama request
Prompts are stored in the PROMPTS dict. Arabic prompts are used for Arabic-language outputs (handwritten, certificate, id, form, mixed); the table prompt is English to get Markdown output. Adding a new document type means adding a key to PROMPTS and it is automatically available via --type.
Output format: pages separated by === headers, each labeled with page number and detected type. Default output filename is <input_stem>_ocr.txt alongside the input PDF.