Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination
Abstract
Rapid progress in large vision-language models(LVLMs) has achieved unprecedented performancein vision-language tasks. However, due to the strong prior of large language models (LLMs) and misaligned attention across modalities,LVLMs often generate outputs inconsistent withvisual content - termed hallucination. To addressthis, we propose \textbf{Scalpel}, a method that reduceshallucination by refining attention activationdistributions toward more credible regions. Scalpelpredicts trusted attention directions for each headin Transformer layers during inference and adjustsactivations accordingly. It employs a Gaussian mixturemodel to capture multi-peak distributions ofattention in trust and hallucination manifolds, anduses entropic optimal transport (equivalent to Schr{\"o}dinger bridge problem) to map Gaussian componentsprecisely.During mitigation, Scalpel dynamicallyadjusts intervention strength and direction basedon component membership and mapping relationshipsbetween hallucination and trust activations.Extensive experiments across multiple datasets andbenchmarks demonstrate that Scalpel effectivelymitigates hallucinations, outperforming previousmethods and achieving state-of-the-art performance.Moreover, Scalpel is model- and data-agnostic,requiring no additional computation, only a singledecoding step.