Fused Similarity Measure Based Alignment with Dual-Scale Adaptive Selection for Weakly Supervised Video Anomaly Detection
Yuegao Lu · Hong-Jie Xing · Chun-Guo Li
Abstract
In most multi-modal weakly supervised video anomaly detection (WSVAD) methods, cross-modal alignment relies on cosine similarity measure, which captures only directional consistency but neglects feature magnitude information. Moreover, the existing multi-instance learning frameworks usually adopt the Top-$k$ selection strategy, which is difficult to adapt to the anomalous proportions of different videos, resulting in missed detection of anomalous segments and introduction of label noise. To address these problems, an alignment and selection enhanced WSVAD (ASE-WSVAD) method is proposed. ASE-WSVAD combines cross-modal alignment method based on the fused similarity measure (CA-FSM) with dual-scale adaptive selection (DSAS) to improve semantic consistency and detection performance. Specifically, visual-textual alignment is implemented by a fusion of complementary similarity measures (i.e., cosine and Euclidean distance similarity measures) so that the alignment objective jointly leverages both direction and magnitude. DSAS combines local instance-level selection (LILS) with global batch-aware selection (GBAS) to efficiently detect anomalous segments and handle videos with varying proportions of anomalies. Experimental results demonstrate that ASE-WSVAD achieves the state-of-the-art performance with AUC value of 89.00\% and AP value of 85.44\% on UCF-Crime and XD-Violence, respectively.
Successful Page Load