SVD-Det: A Lightweight Framework for Video Forgery Detection Using Semantic and Visual Defect Cues
Abstract
With the rapid proliferation of AI-generated content (AIGC) on multimedia platforms, efficient and reliable video forgery detection has become increasingly important. Existing approaches often rely on either visual artifacts or semantic inconsistencies, but suffer from high computational costs, limiting their deployment at scale. In this work, we propose SVD-Det, a lightweight and efficient pipeline that leverages both semantic and Visual Defect cues to detect forged videos. SVD-Det fuses spatiotemporal representations from raw RGB frames and compression-induced distortions using a 3D-Swin Transformer, and augments semantic understanding via CLIP-based embeddings. To integrate these heterogeneous modalities, we introduce Domain-Query Attention (DoQA), a novel attention mechanism that hierarchically aggregates spatial and temporal features. Experiments across seven video generation domains demonstrate that SVD-Det not only achieves state-of-the-art detection performance but also reduces model size and inference time by over 97% and 98%, respectively, compared to LMM-based baselines. Our results highlight the practicality and robustness of SVD-Det for scalable AIGC detection in real-world scenarios.