Track: Oral Session 8B: Video Recognition and Understanding II

Tue 10 March 13:30 - 13:42 PDT

CAST: Evaluating Multi-Object Trackers with Context-Aware Switch and Transfer Scores

Jin Bai ⋅ Gregory Hager

Multi-object tracking (MOT) has been a subject of intensive research fordecades. Multiple standard datasets and benchmarks have been set up, and severalevaluation metrics, such as MOTA, IDF1 and HOTA. These metrics have become thede facto standard for comparing and ranking trackers on standardized datasets tomeasure progress. In this paper, we focus on MOTA and HOTA, and present a studyof cases where these metrics' behaviors may not be desirable. In addition, wedemonstrate how they might not be ideal when used as a tool to inspect atracker's failure cases. We point out that these issues are related to the sizesof the context windows in which they measure association quality, where MOTA istoo nearsighted while HOTA can be too holistic depending on the task settings.In this paper, we rethink the familiar notion of identity switches (IDSw)proposed in MOTA, and propose a generalized version of it by introducing acontext window when evaluating the ID assignment choice for each detection. Weshow that the proposed metric named CAST mitigates the limitations of HOTA andMOTA, and demonstrate its usefulness when diagnosing model failures. Our codeand toolkit will be available for the community to advance both the developmentand application of MOT.

Tue 10 March 13:42 - 13:54 PDT

Advancing Player Identification and Tracking with Global ID Fusion (GIF)

Karol Wojtulewicz ⋅ Minxing Liu ⋅ Niklas Carlsson

Rapid player motion, occlusions, substitutions, and jersey changes pose major challenges for identity-consistent tracking in sports. Existing multi-object tracking (MOT) methods struggle in long-term and multi-perspective settings like broadcast footage, where views change frequently. To address this, we first introduce MuPNIT, the first MOT and ReID benchmark to capture these multi-perspective dynamics along with long-term appearance variations across seasons, teams, and jersey changes. Second, we propose Global ID Fusion (GIF), a novel context-aware tracking-with-identification framework that enables robust tracking of both seen and unseen players. Unlike prior approaches, GIF performs single pass global ID association and supports zero-shot identity recognition. Our approach achieves state-of-the-art results, improving HOTA by 25.3% and IDF1 by 79.5% over OC-SORT. Finally, to assess identity consistency, we introduce five Global ID metrics that reveal tradeoffs in tracking stability. By bridging MOT and ReID, our work advances identity-aware player tracking in sports and sets a new benchmark with applications in sports analytics, surveillance, and long-term person search.

Tue 10 March 13:54 - 14:06 PDT

Distilling What and Why: Enhancing Driver Intention Prediction with MLLMs

SAINITHIN ARTHAM ⋅ Avijit Dasgupta ⋅ Shankar Gangisetty ⋅ Jawahar CV

Predicting a drivers’ intent (e.g., turns, lane changes) is a critical capability for modern Advanced Driver Assistance Systems (ADAS). While recent Multimodal Large Language Models (MLLMs) show promise in general vision-language tasks, we find that zero-shot MLLMs still lag behind domain-specific approaches for Driver Intention Prediction (DIP). To address this, we introduce DriveXplain, a zero-shot frame- work based on MLLMs that leverages rich visual cues such as optical flow and road semantics to automatically generate both intention maneuver (what) and rich natural language explanations (why). These maneuver-explanation pairs are then distilled into a compact MLLM, which jointly learns to predict intentions and corresponding explanations. We show that incorporating explanations during training leads to substantial gains over models trained solely on labels, as distilling explanations instills reasoning capabilities by enabling the model to understand not only what decisions to make but also why those decisions are made. Comprehensive experiments across structured (Brain4Cars, AIDE) and unstructured (DAAD) datasets demonstrate that our approach achieves state-of-the-art results in DIP task, outperforming zero-shot and domain-specific baselines. We also present ablation studies to evaluate key design choices in our frame- work. This work sets a direction for more explainable and generalizable intention prediction in autonomous driving systems. We plan to release our codebase to support research.

Tue 10 March 14:06 - 14:18 PDT

LASER: Lip Landmark Assisted Speaker Detection for Robustness

Le Thien Phuc Nguyen ⋅ Zhuoran Yu ⋅ Yong Jae Lee

Active Speaker Detection (ASD) aims to identify who is speaking in complex visual scenes. While humans naturally rely on lip-audio synchronization, existing ASD models often misclassify non-speaking instances when lip movements and audio are unsynchronized. To address this, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER), which explicitly incorporates lip landmarks during training to guide the model’s attention to speech-relevant regions. Given a face track, LASER extracts visual features and encodes 2D lip landmarks into dense maps. To handle failure cases such as low resolution or occlusion, we introduce an auxiliary consistency loss that aligns lip-aware and face-only predictions, removing the need for landmark detectors at test time. LASER outperforms state-of-the-art models across both in-domain and out-of-domain benchmarks. To further evaluate robustness in realistic conditions, we introduce LASER-bench, a curated dataset of modern video clips with varying levels of background noise. On the high-noise subset, LASER improves mAP by 3.3 and 4.3 points over LoCoNet and TalkNet, respectively, demonstrating strong resilience to real-world acoustic challenges.

Tue 10 March 14:18 - 14:30 PDT

VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

Ying Cheng ⋅ Yu-Ho Lin ⋅ Min-Hung Chen ⋅ Fu-En Yang ⋅ Shang-Hong Lai

Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.