Robust Multimodal Emotion Recognition from Incomplete Modalities via Query-Based Unimodal and Cross-Modal Learning
Abstract
Multimodal emotion recognition (MER) aims to identify human emotions from inputs such as text, vision, and audio. However, existing methods often assume complete modality availability during training and inference, which is unrealistic in real-world scenarios due to sensor failures or privacy constraints.We propose Dual-Query Fusion (DQF), a framework that enables robust MER using only incomplete modality inputs, without relying on reconstruction or knowledge distillation.DQF introduces two types of learnable queries: Q-UA for extracting informative unimodal features, and Q-CA for adaptive cross-modal integration. These modules are designed to operate effectively even when some modalities are missing.Experiments on two public datasets demonstrate that DQF achieves superior performance and robustness compared to existing methods, even when trained exclusively on incomplete inputs. These results highlight the effectiveness and practicality of DQF for real-world MER tasks.