Saliency-Guided DETR for Moment Retrieval and Highlight Detection
Abstract
With the rapid growth of video content available, the ability to search for specific moments within videos using textual queries has become increasingly relevant. This is crucial in many scenarios, from surveillance cameras where it may be necessary to find specific events in extensive video streams to searching for exciting movie scenes. However, existing approaches for video Moment Retrieval and Highlight Detection often struggle to effectively align text and video features, limiting their performance. We argue that utilizing recent foundational video models designed for video-text alignment can overcome these limitations. We propose a novel architecture that utilizes such models to test this hypothesis. Combined with our novel Saliency-Guided Cross Attention mechanism and a hybrid DETR architecture, our approach provides significantly improved results. To further enhance our approach, we developed InterVid-MR - a large-scale, high-quality dataset specifically designed for a pretraining stage. Extensive experiments and comparisons with current state-of-the-art methods confirm the effectiveness of the approach, achieving 58.8 mAP on QVHighlights, 60.7 R@1 mIoU on Charades-STA, and 42.4 R@1 mIoU on TACoS. These results highlight the efficiency and scalability of the method for video-language tasks in both zero-shot and fine-tuning scenarios.