OSEG: Improving Diffusion sampling through Orthogonal Smoothed Energy Guidance
Abstract
Pre-trained text-to-image (T2I) diffusion models produce high-quality images from random noise guided by textual prompts, but classifier-free guidance (CFG) often yields suboptimal samples. Retraining diffusion models can enhance performance, but the associated computational cost is often prohibitive. To address this, researchers have proposed modifying the CFG step through perturbation techniques for cost-effective sampling improvements. For instance, Smoothed Energy Guidance (SEG) enhances image quality by reducing the curvature of the energy landscape in self-attention layers. However, SEG’s indiscriminate curvature reduction can suppress fine details and cause prompt misalignment. We introduce Orthogonal Smoothed Energy Guidance (OSEG), which selectively smooths the orthogonal components of token embeddings to preserve critical details while maintaining prompt alignment. Theoretical analysis justifies OSEG’s effectiveness, supported by empirical comparisons with recent methods. By plugging into text-to-video (T2V) models, OSEG consistently improves visual and temporal coherence. Finally, we present extensive metric comparisons to demonstrate the efficiency of OSEG for both image and video domains.