Poster

R-MMA: Enhancing Vision-Language Models with Recurrent Adapters for Few-Shot and Cross-Domain Generalization

Md Fahim · Md Ishmam · Mir Sazzat Hossain · M Ashraful Amin · Amin Ali · A K M Mahbubur Rahman

Abstract

Despite the strong generalization capabilities of pre-trained vision-language models (VLMs) like CLIP, adapting these models for few-shot generalization tasks still presents a fundamental challenge. This challenge, often termed the discrimination–generalization dilemma, highlights the need to fine-tune task-specific knowledge while simultaneously preserving the model's general, pre-trained knowledge. Prompt learning offers a partial solution but often struggles to capture rich visual-textual interactions. Adapter-based methods like MMA improve alignment by adding learnable modules, but their use of multiple independent adapters increases parameter overhead and can limit transferability due to naive fusion of adapted and frozen features.To address these limitations, we introduce \textbf{R}ecurrent \textbf{M}ulti-\textbf{M}odal \textbf{A}dapter (R-MMA), a lightweight and efficient extension of MMA that enhances both performance and generalization. R-MMA employs a recurrent adapter module with shared weights across multiple layers of the image and text encoders. This design substantially reduces the parameter count while maintaining high expressive capacity. Additionally, R-MMA integrates an attention-based alignment mechanism to harmonize the adapter outputs with the frozen encoder features before fusion. This ensures better preservation of the pre-trained representations and enhances cross-modal consistency. Extensive experiments across 15 datasets on diverse tasks, including few-shot learning, generalization to novel classes and domains, and dataset transfer scenarios, demonstrate that R-MMA consistently surpasses state-of-the-art baselines, achieving strong performance with improved efficiency and a better balance between adaptation and generalization. Our work achieves one of the highest forms of parameter efficiency with only three trainable weight matrices for the whole network, regardless of the network depth.