arabic-ocr/CLAUDE.md
Randa 5aec8a5c6c Initial commit: smart Arabic OCR script with document-aware prompting
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-26 18:31:44 +04:00

2 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Setup

# Create and activate the virtualenv
python3 -m venv arabic_ocr_env
source arabic_ocr_env/bin/activate

# Install dependencies
pip install -r requirements.txt
# System dep required for pdf2image:
sudo apt-get install -y poppler-utils

Running

source arabic_ocr_env/bin/activate

# Auto-detect document type per page
python arabic_ocr_smart.py document.pdf [output.txt]

# Force a specific document type
python arabic_ocr_smart.py scan.pdf --type [handwritten|certificate|id|table|form|mixed]

# Custom Ollama host (default: http://192.168.122.1:11434)
python arabic_ocr_smart.py scan.pdf --host http://localhost:11434

Architecture

Single-file script (arabic_ocr_smart.py) with no tests or build system.

Pipeline: PDF → PIL images (via pdf2image/poppler at 300 DPI) → base64 → Ollama /api/chat → structured text output.

Two-pass per page (when no --type forced):

  1. Detection pass — sends the "detect" prompt to qwen2.5vl:7b to classify the page into one of: handwritten, certificate, id, table, form, mixed.
  2. Extraction pass — sends the type-specific prompt from the PROMPTS dict to extract structured text.

Key constants (top of file):

  • DEFAULT_HOST — Ollama endpoint (VM bridged network by default)
  • MODELqwen2.5vl:7b
  • DPI — render resolution (300)
  • TIMEOUT — 300 s per Ollama request

Prompts are stored in the PROMPTS dict. Arabic prompts are used for Arabic-language outputs (handwritten, certificate, id, form, mixed); the table prompt is English to get Markdown output. Adding a new document type means adding a key to PROMPTS and it is automatically available via --type.

Output format: pages separated by === headers, each labeled with page number and detected type. Default output filename is <input_stem>_ocr.txt alongside the input PDF.