Learning Beyond Labels: Self-Supervised Handwritten Text Recognition
Abstract
This paper addresses a key challenge in Handwritten Text Recognition (HTR): the dependence on large volumes of labeled data. To overcome this, we propose a self-supervised learning (SSL) framework, LoGo-HTR, that minimizes labeling requirements while achieving strong recognition performance. We introduce a large-scale dataset, SSL-HWD of 10 million word-level handwritten images from diverse scanned documents, partitioned into a small labeled subset and a much larger unlabeled subset.The LoGo-HTR combines a local contrastive loss for spatial consistency and a global decorrelation loss to enhance feature diversity. This dual objective enables robust, invariant, and spatially discriminative feature learning. After self-supervised pretraining, we fine-tune a transformer-based decoder using limited labeled data. Extensive experiments on standard HTR benchmarks --- IAM and GHNK, demonstrate that, after SSL pretraining on our unlabeled dataset, our method consistently outperforms state-of-the-art approaches, even when fine-tuned using only 80% and 20% of the available labeled training data from the respective benchmarks. Ablation studies highlight the effectiveness of our dual loss design and demonstrate the potential of scalable, label-efficient handwritten text recognition. The SSL-HWD and the LoGo-HTR will be released publicly for community use and research advancement.