UniTabBank: A Large Scale Multi-Lingual, Multi-Layout, Multi-Type, Multi-Format Dataset for Table Detection
Abstract
Tables play a key role in conveying structured data across documents. Accurate table detection is crucial for downstream tasks like structure recognition and information extraction. However, current datasets lack diversity in format, language, and layout, limiting real-world generalization. This underscores the need for well-annotated datasets that are multi-lingual, layout-diverse, document-agnostic and format-rich.To address these limitations, we introduce UniTabBank, a large scale, diverse table detection dataset designed to reflect realistic use cases. UniTabBank is characterized by five key attributes: (i) Multi-Lingual --- supporting 28 languages (including Arabic, English, Hindi, etc.); (ii) Multi-Layout --- encompassing both single-column and multi-column documents; (iii) Multi-Type --- covering a wide range of document genres such as annual reports, books, newspapers, and magazines; (iv) Multi-Format --- comprising scanned documents, photographed pages, and PDFs; and finally (v) Scale and Annotation Quality --- consists of 55,443 document page images with 81,179 accurately annotated table instances, offering scale and annotation precision.Additionally, we introduce UniTabDet, a YOLO-based model for table detection, which outperforms state-of-the-arts on eight out of nine table detection benchmarks. Evaluated in a zero-shot setting on four layout analysis benchmarks, UniTabDet also shows strong generalization across diverse documents without additional fine-tuning. The UniTabBank dataset and the UniTabDet model will be released publicly for community use and research advancement.