Track: Oral Session 7B: Vision+Language and Other Modalities II

Tue 10 March 9:45 - 9:57 PDT

DREAM: Dynamic Prompts and GuidedMix for Efficient Continual Adaptation of Visual-Language Models

Evelyn Chee ⋅ Mong-Li Lee ⋅ Wynne Hsu

Vision-language models (VLMs) exhibit impressive zero-shot transfer, but remain static and cannot adapt when exposed to new tasks. Meanwhile, conventional continual learning often overlooks preserving this zero-shot capability during adaptation. In this work, we present DREAM, a parameter-efficient framework that enables continual adaptation of VLMs while minimizing forgetting and preserving zero-shot performance. DREAM employs a dynamic prompt system with lightweight, task-specific parameters managed by two modules: a prompt composition module that dynamically generates prompts to adapt the VLM, and a query-key module that uses learned token weights to reliably activate the appropriate parameters at inference. To enhance robustness, we propose GuidedMix, which creates semantically meaningful mixed images, and pair them with mixture‑aware text embeddings to strengthen representation learning through image-text alignment. We further leverage the GuidedMix samples to estimate task-specific query-key similarity thresholds that identify samples of unseen tasks and and prevent spurious prompt usage on the VLM, thereby safeguarding its zero-shot behavior. Experiments show that our method adapts efficiently, mitigates forgetting, and maintains strong zero-shot transfer with substantially fewer trainable parameters, showing consistent gains even under partial supervision.

Tue 10 March 9:57 - 10:09 PDT

brat: Aligned Multi-View Embeddings for Brain MRI Analysis

Maxime Kayser ⋅ Maksim Gridnev ⋅ Wanting Wang ⋅ Max Bain ⋅ Aneesh Rangnekar ⋅ Avijit Chatterjee ⋅ Aleksandr Petrov ⋅ Harini Veeraraghavan ⋅ Nathaniel Swinburne

We present brat (brain report alignment transformer), a multi-view representation learning framework for brain magnetic resonance imaging (MRI) trained on MRIs paired with clinical reports. Brain MRIs present unique challenges due to the presence of numerous, highly varied, and often subtle abnormalities that are localized to a few slices within a 3D volume. To address these challenges, we introduce a brain MRI dataset $10\times$ larger than existing ones, containing approximately 80,000 3D scans with corresponding radiology reports, and propose a multi-view pre-training approach inspired by advances in document retrieval. We develop an implicit query-feature matching mechanism and adopt concepts from quality-diversity to obtain multi-view embeddings of MRIs that are aligned with the clinical features given by report sentences. We evaluate our approach across multiple vision-language and vision tasks, demonstrating substantial performance improvements. By publicly releasing our suite of model weights, we aim to facilitate further research in brain MRI analysis.

Tue 10 March 10:09 - 10:21 PDT

Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score

Eman Ali ⋅ Sathira Silva ⋅ Chetan Arora ⋅ Muhammad Haris Khan

Vision-language models (VLMs) like CLIP excel in zero-shot learning by aligning image and text representations through contrastive pretraining. Existing approaches to unsupervised adaptation (UA) for fine-grained classification with VLMs either rely on fixed alignment scores that may not capture evolving, subtle class distinctions or rely on computationally expensive pseudo-labeling strategies that limit scalability.In contrast, we show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels and substantially improves performance over state-of-the-art (SOTA) methods.We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings through a set of Class Description Anchors (CDA). This enables the definition of a Learned Alignment Score (LAS), which incorporates CDA as an adaptive classifier, facilitating cross-modal interactions to improve self-training in unsupervised adaptation.Furthermore, we propose a self-training weighting mechanism designed to refine pseudo-labels in the presence of inter-class ambiguities.Our approach, FAIR, delivers a substantial performance boost in fine-grained unsupervised adaptation, achieving a notable overall gain of 2.78% across 13 fine-grained datasets compared to SOTA methods.

Tue 10 March 10:21 - 10:33 PDT

Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation

Liu He ⋅ Xiao Zeng ⋅ Yizhi Song ⋅ Albert Chen ⋅ Lu Xia ⋅ Shashwat Verma ⋅ Sankalp Dayal ⋅ Min Sun ⋅ Cheng-Hao Kuo ⋅ Daniel Aliaga

Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.

Tue 10 March 10:33 - 10:45 PDT

CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering

Ben Vardi ⋅ Oron Nir ⋅ Ariel Shamir

Vision-Language Models (VLMs) demonstrate remarkable capabilities in visual understanding and reasoning, such as in Visual Question Answering (VQA), where the model is asked a question related to a visual input. Still, these models can make distinctly unnatural errors, for example, providing (wrong) answers to unanswerable VQA questions, such as questions asking about objects that do not appear in the image.To address this issue, we propose CLIP-UP: CLIP-based Unanswerable Problem detection, a novel lightweight method for equipping VLMs with the ability to withhold answers to unanswerable questions. CLIP-UP leverages CLIP-based similarity measures to extract question-image alignment information to detect unanswerability, requiring efficient training of only a few additional layers, while keeping the original VLMs' weights unchanged.Tested across several models, CLIP-UP achieves significant improvements on benchmarks assessing unanswerability in both multiple-choice and open-ended VQA, surpassing other methods, while preserving original performance on other tasks. We will release our code and training data to support future research.