A Universal Self-Attention Enhancement for Bridging Low-bit Quantization and Vision Transformers
Abstract
Low-bit quantization of Vision Transformers presents significant challenges due to the intrinsic properties of their Multi-head Self-attention modules. In this work, we investigate the quantization issues specific to this mechanism and identify two critical challenges. First, quantization amplifies discrepancies among attention heads, thereby impairing the model’s capability to focus on the most informative regions. Second, high-magnitude values in the attention maps, which encode essential relational information, exhibit an extremely sharp distribution that renders them especially prone to substantial information loss during quantization. To address these challenges, we propose Quantized-aware Multi-head Self-attention (Q-MHSA), a universal self-attention enhancement module that integrates two lightweight components within Multi-head Self-attention (MHSA). The Cross-head Concordance Module (CCM) enforces adaptive consistency across attention heads, while the Learnable Smoothness Controller (LSC) replaces the fixed normalization factor with an adaptive mechanism that selectively smooths the distribution of high-magnitude attention values while disregarding less informative low values. Designed for seamless integration with any Vision Transformer architecture and quantization-aware training method, Q-MHSA incurs minimal overhead while consistently improving model accuracy. For instance, under the VVTQ quantization framework applied to a 4-bit Swin-S model, the incorporation of Q-MHSA yields a top-1 accuracy of 83.5\%, representing a 0.9\% improvement over the baseline, while incurring a marginal overhead of 0.01\% in both parameters and FLOPs.