The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs
Abstract
Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language decoders while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), leaving open whether progress reflects genuine visual grounding or language-side scaling. Existing evaluations emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present the Perceptual Observatory, a framework that benchmarks MLLMs across three verticals: (i) simple vision tasks, such as face matching and OCR capabiities; (ii) local vs. global understanding, encompassing image matching, grid pointing game, and attribution; which tests general perceptual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through low-level augmentations and high-level style-transfer illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield holistic insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.