SphereEdit: Spherical Semantic Editing in Diffusion Models
Abstract
Despite significant advances in diffusion models, achieving precise and composable image editing without task-specific training remains a challenge. Existing approaches often rely on iterative optimization or linear latent operations, which are slow, brittle, and prone to attribute entanglement (e.g., editing “lipstick” inadvertently alters skin tone). We introduce SphereEdit, a training-free framework that leverages the spherical geometry of diffusion embeddings and token aware cross-attention to enable interpretable, fine-grained control. We represent semantic attributes as unit vector directions in the denoiser’s prediction space and show that antipodal symmetry ("old" is approximately the negation of "young") naturally supports bidirectional edits, while approximate orthogonality enables clean composition through spherical coefficient. At inference, these directions modulate cross-attention activations, producing spatially localized edits without optimization or fine-tuning. SphereEdit achieves sharper, more disentangled edits than prior baselines, while remaining plug-and-play and applicable across diverse image editing tasks.