MuseDance: A Diffusion-based Music-Driven Image Animation System
Abstract
Image animation is a rapidly developing area in multimodal research, with a focus on generating videos from reference images. While much of the work has emphasized generic video generation guided by text, music-driven dance image animation remains underexplored. In this paper, we introduce MuseDance, an end-to-end model that animates reference images using both music and text inputs. By integrating music as a conditioning modality, MuseDance generates personalized videos that not only adhere to textual descriptions but also synchronize character movements with the rhythm and dynamics of the music. Unlike existing methods, MuseDance eliminates the need for explicit motion guidance, such as pose sequences or depth maps, reducing the complexity of video generation while enhancing accessibility and flexibility. To support further research in this field, we present a new multimodal dataset comprising of 2,904 dance videos, each paired with the corresponding background music and text descriptions. Our approach leverages diffusion-based methods to achieve robust generalization, precise control, and temporal consistency, setting a new benchmark for music-driven image animation task.