Training-free Detection of Text-to-video Generations via Over-coherence
Abstract
Text-to-video generative models have emerged as powerful tools in content creation, capable of synthesizing highly realistic videos from textual prompts. However, the rapid advancement of these models introduces significant security and trust concerns, as traditional detection methods struggle to generalize to unseen generative techniques. Existing approaches commonly rely on supervised learning, requiring continuous dataset curation and retraining, which is impractical given the fast-paced evolution of generative models. In this work, we introduce the first training-free detection method for AI-generated videos, eliminating the need for labeled training data or prior exposure to generation techniques. Our approach exploits a fundamental weakness in text-to-video models: Unnatural temporal over-coherence in frame transitions. By leveraging a novel time-coherence detection criterion, our method identifies distinct artifacts in video embeddings, which are absent in real videos. We extensively evaluate our approach, demonstrating that it significantly outperforms existing baselines - maintaining robustness to unseen generative models. This work establishes a new direction for training-free detection of text-to-video generated content, providing a scalable and time resilient solution.