SmoothDiffusion-VE: Real-time Generative Video Editing Using Adaptive Feature Cache
Mustafa Munir · Sophia Zalewski · Shiqiu Liu · David Tarjan · Sushmitha Belede · Anjul Patney · Radu Marculescu
Abstract
Video editing with diffusion models presents significant challenges, especially under real-time constraints. Current methods either enhance temporal consistency at the cost of slow processing or rely on frame-by-frame editing, leading to flickering and temporal artifacts. To address both challenges, we propose SmoothDiffusion-VE, a streaming-based editing approach that improves temporal consistency and processing speed through our proposed Adaptive Feature Cache (AFC) and motion-guided attention. The AFC dynamically adjusts the caching behavior based on perceptual similarity (LPIPS) between frames, i.e., shifting to a mini-cache mode for similar frames to reduce computational load. Conversely, significant frame changes trigger deeper caching to maintain robust temporal coherence. Our motion-guided attention selectively focuses on dynamic regions using optical flow, reducing unnecessary computations in static areas and accelerating processing. SmoothDiffusion-VE can run 60 FPS on one A100 GPU and 28 FPS on one RTX 4090 GPU, achieving a 1564$\times$ speedup over Plug-and-Play Diffusion (PNP) and a 1916$\times$ speedup over Diffusion Motion Transfer (DMT), delivering a powerful solution for fast and consistent video editing.
Successful Page Load