Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis
Abstract
While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they either extract information from discriminative pretrained models or rely solely on semantic token reconstruction, which requires an external tokenizer during training --- introducing a significant computational overhead. In this work, we introduce Sorcen, a novel Unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our novel Contrastive objective, leverages the generative capabilities of Sorcen and eliminates the need for additional image crops or augmentations during training. Sorcen generates contrastive positive samples, called Echoes, directly in the semantic token space using the reconstruction objective. This on-the-fly Echo generation, enables Sorcen to operate exclusively on precomputed tokens, eliminating the need for an online tokenizer during training. Sorcen significantly reduces the computational overhead by 60.8% compared to token reconstruction SoTA. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively. Additionally, Sorcen establishes as a new single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.