AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization
Abstract
With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the perceptual integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., visual lip movements) based on the other (e.g., audio waveform). This cross-modal reconstruction becomes significantly more challenging, leading to amplified discrepancies, in manipulated regions, thereby providing robust discriminative cues for precise forgery localization. AuViRe outperforms the State-of-the-Art by +8.9 AP\@0.95 on LAV-DF, +9.6 AP\@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code will be publicly available upon acceptance.