MR-Pruner: Training-free Multi-resolution Visual Token Pruning for Multi-modal Large Language Models
Abstract
Large Language Models (LLMs) extended to multi-modal inputs have led to Multi-modal LLMs (MLLMs) that perform strongly on vision-language tasks. Recent MLLMs adopt multi-resolution inputs to capture both global context and local details, but this substantially increases visual tokens and computational cost. Existing pruning methods reduce redundancy but are designed for single-resolution settings, overlooking the characteristics of multi-resolution tokens. We observe two key properties: tokens from different resolutions follow distinct distributions of information content, and tokens across resolutions exhibit mutual complementarity, such that pruning one type can often be compensated by the other. Based on this observation, we propose Multi-Resolution Token Pruning method (MR-Pruner), a training-free, graph-based pruning framework for multi-resolution MLLMs. MR-Pruner incorporates three components—Intra-resolution, Cross-resolution Token Scoring, and Informativeness-aware Token Pruning—that adaptively allocate pruning ratios and facilitate information propagation across resolutions. Experiments on eight benchmarks show that MR-Pruner achieves superior efficiency–performance trade-offs. For example, when only 10\% of the visual tokens are retained, it leads to an average performance degradation of 3.6\%. For reproducibility, the source code is available at https://anonymous.4open.science/r/MR-Pruner.