Bridging the Domain Gap in Small Multimodal Models: A Dual-level Alignment Perspective
Abstract
Small Multimodal Models (SMMs) fine-tuned with the Low-Rank Adaptation (LoRA) technique perform well on vision–language tasks, yet LoRA remains vulnerable to distribution shift. Unsupervised Domain Adaptation (UDA) is a common remedy for this issue, but existing theory and methods are designed primarily for single- or dual-encoder architectures, overlooking the encoder–decoder structure of SMMs, whose fusion mechanism introduces additional shift. This work bridges this gap in two steps. First, we derive a dual-divergence risk bound that separates encoder divergence from fusion divergence and illustrate its tightness compared to the classical encoder-only bound with a negation-flip example. Second, motivated by this theory, we propose Dual-level Adversarial Alignment (DuAA), a two-stage alignment algorithm. DuAA inserts domain-discriminative adapters after the encoder and within the decoder to minimize both divergences. Furthermore, DuAA employs selective pseudo-labeling to refine target semantics. We compile twelve new cross-domain VQA tasks with distinct visual and textual shifts from existing datasets and observe that DuAA consistently outperforms standard fine-tuning across all tasks.