GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection
Abstract
Active Speaker Detection (ASD) seeks to determine who is speaking at each moment by modeling the complex interplay between audio and visual modalities. While most state-of-the-art approaches rely on late fusion, combining multimodal features only at high semantic levels, they often fail to capture the fine-grained cross-modal interactions present at lower layers, interactions that are critical for robust performance in unconstrained scenarios. In this work, we introduce \textbf{GateFusion}, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8\% mAP (+9.4\%), 86.1\% mAP (+2.9\%), and 96.1 mAP (+0.5\%) on Ego4D, UniTalk, and WASD, respectively, and delivers competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments further demonstrate the generalization capability of our model, while comprehensive ablations validate the complementary benefits of each proposed component.