Oral Session
Oral Session 8B: Video Recognition and Understanding II
CAST: Evaluating Multi-Object Trackers with Context-Aware Switch and Transfer Scores
Jin Bai ⋅ Gregory Hager
Multi-object tracking (MOT) has been a subject of intensive research fordecades. Multiple standard datasets and benchmarks have been set up, and severalevaluation metrics, such as MOTA, IDF1 and HOTA. These metrics have become thede facto standard for comparing and ranking trackers on standardized datasets tomeasure progress. In this paper, we focus on MOTA and HOTA, and present a studyof cases where these metrics' behaviors may not be desirable. In addition, wedemonstrate how they might not be ideal when used as a tool to inspect atracker's failure cases. We point out that these issues are related to the sizesof the context windows in which they measure association quality, where MOTA istoo nearsighted while HOTA can be too holistic depending on the task settings.In this paper, we rethink the familiar notion of identity switches (IDSw)proposed in MOTA, and propose a generalized version of it by introducing acontext window when evaluating the ID assignment choice for each detection. Weshow that the proposed metric named CAST mitigates the limitations of HOTA andMOTA, and demonstrate its usefulness when diagnosing model failures. Our codeand toolkit will be available for the community to advance both the developmentand application of MOT.
Advancing Player Identification and Tracking with Global ID Fusion (GIF)
Karol Wojtulewicz ⋅ Minxing Liu ⋅ Niklas Carlsson
Rapid player motion, occlusions, substitutions, and jersey changes pose major challenges for identity-consistent tracking in sports. Existing multi-object tracking (MOT) methods struggle in long-term and multi-perspective settings like broadcast footage, where views change frequently. To address this, we first introduce MuPNIT, the first MOT and ReID benchmark to capture these multi-perspective dynamics along with long-term appearance variations across seasons, teams, and jersey changes. Second, we propose Global ID Fusion (GIF), a novel context-aware tracking-with-identification framework that enables robust tracking of both seen and unseen players. Unlike prior approaches, GIF performs single pass global ID association and supports zero-shot identity recognition. Our approach achieves state-of-the-art results, improving HOTA by 25.3% and IDF1 by 79.5% over OC-SORT. Finally, to assess identity consistency, we introduce five Global ID metrics that reveal tradeoffs in tracking stability. By bridging MOT and ReID, our work advances identity-aware player tracking in sports and sets a new benchmark with applications in sports analytics, surveillance, and long-term person search.
Distilling What and Why: Enhancing Driver Intention Prediction with MLLMs
SAINITHIN ARTHAM ⋅ Avijit Dasgupta ⋅ Shankar Gangisetty ⋅ Jawahar CV
Predicting a drivers’ intent (e.g., turns, lane changes) is a critical capability for modern Advanced Driver Assistance Systems (ADAS). While recent Multimodal Large Language Models (MLLMs) show promise in general vision-language tasks, we find that zero-shot MLLMs still lag behind domain-specific approaches for Driver Intention Prediction (DIP). To address this, we introduce DriveXplain, a zero-shot frame- work based on MLLMs that leverages rich visual cues such as optical flow and road semantics to automatically generate both intention maneuver (what) and rich natural language explanations (why). These maneuver-explanation pairs are then distilled into a compact MLLM, which jointly learns to predict intentions and corresponding explanations. We show that incorporating explanations during training leads to substantial gains over models trained solely on labels, as distilling explanations instills reasoning capabilities by enabling the model to understand not only what decisions to make but also why those decisions are made. Comprehensive experiments across structured (Brain4Cars, AIDE) and unstructured (DAAD) datasets demonstrate that our approach achieves state-of-the-art results in DIP task, outperforming zero-shot and domain-specific baselines. We also present ablation studies to evaluate key design choices in our frame- work. This work sets a direction for more explainable and generalizable intention prediction in autonomous driving systems. We plan to release our codebase to support research.
LASER: Lip Landmark Assisted Speaker Detection for Robustness
Le Thien Phuc Nguyen ⋅ Zhuoran Yu ⋅ Yong Jae Lee
Active Speaker Detection (ASD) aims to identify who is speaking in complex visual scenes. While humans naturally rely on lip-audio synchronization, existing ASD models often misclassify non-speaking instances when lip movements and audio are unsynchronized. To address this, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER), which explicitly incorporates lip landmarks during training to guide the model’s attention to speech-relevant regions. Given a face track, LASER extracts visual features and encodes 2D lip landmarks into dense maps. To handle failure cases such as low resolution or occlusion, we introduce an auxiliary consistency loss that aligns lip-aware and face-only predictions, removing the need for landmark detectors at test time. LASER outperforms state-of-the-art models across both in-domain and out-of-domain benchmarks. To further evaluate robustness in realistic conditions, we introduce LASER-bench, a curated dataset of modern video clips with varying levels of background noise. On the high-noise subset, LASER improves mAP by 3.3 and 4.3 points over LoCoNet and TalkNet, respectively, demonstrating strong resilience to real-world acoustic challenges.
VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
Ying Cheng ⋅ Yu-Ho Lin ⋅ Min-Hung Chen ⋅ Fu-En Yang ⋅ Shang-Hong Lai
Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.