Alignment and Distillation: A Robust Framework for Multimodal Domain Generalizable Human Action Recognition
Abstract
Human Action Recognition (HAR) in real-world scenarios is significantly challenged by unseen domain shifts, such as variations in the camera viewpoint, illumination, lighting, or background. Although recent advancements in video domain generalization have shown promise in HAR by introducing models that are robust to these shifts, existing methods often fall short. They typically depend on a single modality or employ static frame-level fusion approaches, which inherently limit the capture of multi-scale temporal dependencies and the alignment of the asynchronous modalities frequently present in video data. To address these limitations, we propose a Multimodal Alignment and Distillation for Domain Generalization (MAD‑DG), a novel framework that explicitly models multi-scale temporal relationships through a segment-wise temporal binding window contrastive alignment mechanism by effectively aligning asynchronous modalities. Furthermore, we integrate online self-distillation to extract robust domain-invariant representations. Extensive experiments conducted on widely recognized benchmarks demonstrate that MAD-DG achieves state-of-the-art performance and exhibits better generalization capabilities across both single-source and multi-source domain generalization settings.