Unified Video Anomaly Detection Model for Detecting Different Anomaly Types
Abstract
Video anomaly detection (VAD) is a crucial task for public safety and workforce reduction. Due to the rarity of abnormal events and the high cost of data collection, one-class classification (OCC) methods are extensively used. OCC methods are divided into object- and frame-centric approaches, each with its limitations. Object-centric methods fail to detect nonobject anomalies because they focus solely on objects, whereas frame-centric methods struggle to identify abnormalities due to a higher background rate than the foreground rate in video frames. To this end, we define three types of abnormal events, namely, human, appearance, and nonobject anomalies, and propose a unified VAD (UniVAD) model that effectively detects each defined anomaly type. UniVAD comprises three streams, namely, skeleton, local-visual, and global-visual, and each stream focuses on a specific type of anomaly. In addition, each stream uses an autoencoder; thus, we introduce the feature future past prediction task, which predicts past and future features based on present feature to suppress the strong generalization capacity of autoencoders. We validate the proposed model on three public benchmarks, ShanghaiTech, UBnormal, and NWPUCampus, and demonstrate that it achieves state-of-the-art performance by a significant margin.