Enhancing Vision Language Corruption Robustness using Cross Distribution & Prompted Denoisers
Abstract
The current generation of Vision Language Models (VLMs) has excelled in idealistic conditions, but their performance drops significantly when exposed to realistic multimodal corruptions, e.g., blurry images, grammatically incorrect texts. Our work addresses this by establishing a novel multimodal corruption and denoising benchmark, with a rich suite of 18 visual and 18 textual corruption functions, to evaluate the system robustness of VLMs. To enhance robustness, we employ: (i) cross-distribution visual denoisers, inspired by the Mixture of Experts (MoE) architecture, and (ii) a prompted zero-shot textual denoiser. Experimental results show up to a 5.5% overall accuracy gain and up to 9% improvement in certain VL tasks.Our experiments reveal the vulnerability of models against specific corruptions and the over-reliance on the textual modality. We envision that the detailed behavioral insights from our benchmark will help in developing robust VLM systems.