Dragonite: Single-Step Drag-based Image Editing with Geometric-Semantic Guidance
Abstract
Precision and efficiency are crucial in image editing, while existing methods face certain trade-offs. Drag-based image editing techniques enable precise pixel-level manipulation but often suffer from semantic ambiguity and require iterative optimization, which is time-consuming. Conversely, text-based editing methods provide global semantic guidance but lack spatial precision. To address this fundamental trade-off, we introduce Dragonite, a unified single-step image editing framework that seamlessly integrates geometric and semantic control. Our key innovation is a Dual Guidance Module that computes geometric guidance vectors through latent deformation mapping and projects semantic guidance from CLIP losses into the same vector space. An angle-aware fusion strategy then combines these guidance vectors, yielding a unified representation that preserves both semantic cues and geometric constraints. Meanwhile, we propose a Latent Optimization Module that performs single-step latent relocation followed by mean-adjusted interpolation, enhancing editing quality while minimizing distortions. Furthermore, we employ a Latent Stability Control mechanism to ensure image consistency throughout the diffusion inversion process. Comprehensive evaluations on the DragBench benchmark demonstrate that Dragonite successfully resolves the conventional trade-off between semantic accuracy and geometric precision, providing an intuitive, real-time solution for image editing.