Understanding the Visual Projection Space of Multimodal LLMs
SungHeon Jeong · Yoojeong Song · Yoojeong Song
Abstract
What does a single vision token really do inside a multimodal large language model (MLLM)? Despite their success, today’s MLLMs rely on a surprisingly simple mechanism: a single projected visual feature $z=P(f_x)$ prepended to the text. But whether this vector merely adds context or actively steers generation remains an open question. In this work, we peel back the layers of this design and introduce a geometric probing framework that reveals how $z$ shapes the model’s output token space. Through latent–token alignment, subspace sensitivity, and controlled perturbations, we uncover distinct vision–language coupling patterns across popular MLLMs (LLaVA, BLIP-2, Kosmos-2). Our findings show that projected vision features lie in low-dimensional, anisotropic cones and can either dominate or defer to the language model depending on the architecture. Some models treat $z$ as a rigid prior; others allow it to guide flexibly. These geometric signatures offer a new lens on MLLM behavior—revealing not just what models say, but how vision tells them to say it.
Successful Page Load