Adaptive Residual Graph Attention for Contrastive Multimodal Representation Learning
Abstract
Multimodal representation learning is becoming increasingly important due to the growing availability of diverse multimodal data across various domains. Particularly, the ability to adapt to arbitrary numbers or types of modalities is useful for improving flexibility. We propose CLARGA, a general-purpose multimodal representation learning architecture that builds a learned attention-weighted graph over modality features and uses Graph Attention Networks to fuse them. CLARGA is trained end-to-end with combined supervised and contrastive loss, which aligns modalities while maintaining each modality's own strength. We demonstrate CLARGA's effectiveness in diverse multimodal representation learning tasks across 7 datasets spanning finance, human-computer interaction, general multimedia classification, and complex affective computing. It consistently outperforms baselines, ablations, and recent state-of-the-art approaches. Particularly, we demonstrate the highest known performance on the DAIC-WoZ dataset for multimodal depression identification. Our results show that CLARGA is an accurate and robust general-purpose fusion framework suitable for a wide range of complex multimodal learning tasks.