CURIO: Curvature-Aligned and Efficient OCR for Low-Resource Historical Manuscripts
Abstract
We present CURIO, an OCR system for low-resource historical manuscripts. In many challenging cases, manuscripts feature curved text lines, unsegmented lines with lack of spacing between words, and highly variable line lengths — conditions under which existing OCR methods fail. To tackle this challenge, we first extract lines and corresponding curvature profiles from manuscripts, then straighten them using a rectification procedure to reduce redundant background within each line. Because data is scarce, we compliment real data with synthetic data. To bridge the synthetic–real gap, we generate line images by warping rendered straight text along the rectified profiles, ensuring both real and synthetic lines align in their curvature characteristics. Our recognizer is a lightweight CNN–Transformer with padding-aware null activations, sparse attention and optimized with CTC loss for efficient training. We evaluate our method on challenging manuscript collections written in Sharada, a rare and endangered Indic script. CURIO outperforms strong CNN+RNN and Transformer baselines, with the largest gains on high-curvature lines and long lines. CURIO further transfers zero-shot to printed Sharada text, indicating robustness beyond manuscript domain.