CoreCaption: Core Caption based Text-to-Video Retrieval
Abstract
The explosive growth of online video content has heightened the need for efficient text-to-video retrieval systems. These systems heavily rely on accurate video-text representations for effective retrieval. However, challenges such as the multi-thematic nature of videos and inconsistent caption quality hinder performance by making it difficult to capture comprehensive video-text relationships. To address these issues, we propose CoreCaption, a novel framework that extracts and leverages core captions—the most representative captions capturing essential video themes. Our approach includes a unique core caption extraction method based on similarity-based density estimation and clustering, and introduces the Core Caption Guided Attention (CCGA) mechanism to integrate video-specific semantic information into text queries while preserving their original intent. Furthermore, we employ a teacher-student architecture for efficient inference without reliance on core captions during deployment. Extensive experiments on benchmark datasets like MSR-VTT, VATEX, and MSVD demonstrate that CoreCaption outperforms state-of-the-art methods. These results validate the effectiveness and robustness of our framework across diverse video-text datasets.