Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts
Abstract
When searching for videos, users often rely on surrounding context such as background elements or temporal details beyond salient content. However, existing video models struggle with fine-grained spatio-temporal understanding particularly surrounding contexts, and there are no datasets that effectively evaluate their performance.We introduce our SS Datasets, a collection of three video retrieval datasets that offer detailed salient and surrounding captions aligned with semantically segmented clips. To capture rich, temporally localized contexts aligned with meaningful scene changes, we segment videos based on scene transitions and generate captions with a vision-language model. Then, we analyze current video models, revealing their challenges in matching surrounding context queries and handling temporally complex videos. To overcome these challenges, we propose simple yet effective baselines that improve retrieval across various query types, enabling models to generalize robustly to real-world scenarios.