Oral Session
Oral Session 6B: Video Recognition and Understanding I
Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance
Francesco Ragusa ⋅ Michele Mazzamuto ⋅ Rosario Forte ⋅ Irene D'Ambra ⋅ James Fort ⋅ Jakob Engel ⋅ Antonino Furnari ⋅ Giovanni Farinella
We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a "Wizard of OZ" data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 45k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlights the limitations of current models when used to provide expert-level assistance to the user. Ego-EXTRA dataset will be publicly shared. We believe that Ego-EXTRA will support the benchmark of egocentric video-language assistants.
Similarity-aware Probabilistic Embeddings Modeling for Video-Text Retrieval
Yuliang Huang ⋅ Pengxu Wei ⋅ Zhicheng Dong ⋅ Liang Lin
Video-text retrieval is a fundamental task in multi-modal learning, aiming to accurately retrieve videos that match given textual descriptions. While recent contrastive methods have made significant progress by embedding videos and texts into a joint space, they often suffer from semantic over-clustering—a phenomenon where semantically distinct videos are mapped to overly similar embeddings due to dominant but uninformative visual patterns (e.g., recurring backgrounds or common objects). This effect becomes particularly problematic under short or ambiguous queries, where it suppresses fine-grained semantics and degrades retrieval precision. To address this, we propose Similarity-aware Probabilistic Embeddings Modeling (SPEM), a novel framework that refines video representations by modeling them as adaptive probability distributions rather than static vectors. SPEM incorporates cross-modal attention to highlight text-relevant visual content and suppress irrelevant patterns, and leverages multi-level similarity features to dynamically adjust the embedding variance, thereby preserving subtle but critical semantic cues. To further improve alignment, we employ a Semantic-Distribution Contrastive Loss to optimize the alignment structure in the probabilistic space, encouraging more discriminative separation across hard negatives. Extensive experiments on five widely-used video-text retrieval benchmarks—MSRVTT, DiDeMo, VATEX, MSVD, and Charades—demonstrate that SPEM consistently outperforms strong CLIP-based baselines.
PromptGAR: Flexible Promptive Group Activity Recognition
Zhangyu Jin ⋅ Andrew Feng ⋅ Ankur Chemburkar ⋅ Celso de Melo
We present PromptGAR, a novel framework for Group Activity Recognition (GAR) that offering both input flexibility and high recognition accuracy. The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, fixed number of frames and instances, and the lack of actor consistency.To bridge the gap, we proposed PromptGAR, which is the first GAR model to provide input flexibility across prompts, frames, and instances without the need for retraining.We leverage diverse visual prompts—like bounding boxes, skeletal keypoints, and instance identities—by unifying them as point prompts. A recognition decoder then cross-updates class and prompt tokens for enhanced performance.To ensure actor consistency for extended activity durations, we also introduce a relative instance attention mechanism that directly encodes instance identities.Comprehensive evaluations demonstrate that PromptGAR achieves competitive performances both on full prompts and partial prompt inputs, establishing its effectiveness on input flexibility and generalization ability for real-world applications.
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Zitian Tang ⋅ Rohan Krishnan ⋅ Zhiqiu Yu ⋅ Chen Sun
Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g., visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning.
Broadcast2Pitch: Game State Reconstruction from Unconstrained Soccer Videos
Yin May Oo ⋅ Yewon Hwang ⋅ Muhammad Robbani ⋅ VANYI CHAO ⋅ Ankhzaya Jamsrandorj ⋅ Hoang Nguyen ⋅ Kyung-Ryoul Mun ⋅ Jinwook Kim
Game State Reconstruction (GSR) aims to reconstruct the 2D positions and identities of all athletes from broadcast soccer videos, requiring robust tracking, localization, and identity association under dynamic and unconstrained camera motions. We propose a modular GSR framework that integrates a multi-task keypoint and line detection model with an optimization-based homography estimation module. This approach leverages dense geometric cues from lines, circles, and keypoints to achieve robust spatial localization on a frame-by-frame basis, providing reliable alignment in diverse broadcast scenarios. To address identity consistency, we use appearance-based re-identification and a vision-language-guided tracklet refinement strategy to reduce ID switches and enforce temporal coherence. Comprehensive ablation studies validate the contribution of each component, and our framework achieves state-of-the-art performance on the SoccerNet-GSR benchmark, outperforming existing baselines by a significant margin. The proposed framework demonstrates strong robustness, generalization across scenes, and practical utility for structured game understanding in real-world broadcast sports analytics.