LVM-Lite: Training Large Vision Models with Efficient Sequential Modeling
Xianhang Li · Hongru Zhu · Sucheng Ren · Linjie Yang · Peng Wang · Heng Wang · Xiaohui Shen · Qing Liu · Cihang Xie
Abstract
Generative pre-training has significantly advanced natural language understanding. Building upon this success, recent research begins to innovate Large Vision Models (LVM) by leveraging large-scale pre-training on visual sequences, where simultaneous consideration of image token sequences within single images and across a set of images is of key importance. This paper shows that sequential modeling on single images and across multiple images can be efficiently and effectively decoupled. We introduce a two-stage learning pipeline, starting with single-image pre-training, followed by fine-tuning on long image/video sequences. We term this method Large Vision Model Lite (LVM-Lite). Extensive experiments showcase the impressive performance of LVM-Lite across various generative and discriminative benchmarks, comparable to specifically trained models without the need for task-specific training. Importantly, LVM-Lite accelerates training speed substantially up to $2.7\times$ and demonstrates strong scalability.
Successful Page Load