Mobile-Oriented Video Diffusion: Enabling Text-to-Video Generation on Mobile Devices Without Retraining, Compression, or Pruning
Abstract
We present Mobile-Oriented Video Diffusion (MOVD) framework, the first diffusion-based text-to-video generation framework designed for efficient on-device execution on smartphone-grade hardware, without requiring retraining, compression or pruning. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed solution, MOVD applies two novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Furthermore, by integrating Concurrent Inference with Dynamic Loading (CI-DL), which enables large models to be split into smaller segments for execution in limited memory environments, into the MOVD framework, we enable text-to-video diffusion generative model to run on an iPhone 15 Pro, achieving performance comparable to that of high-end GPUs. These results show that MOVD enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed MOVD as a significant first step toward democratizing state‑of‑the‑art generative technologies, enabling video generation on mobile and embedded devices without resource‑intensive optimization procedures (e.g., retraining, compression, or pruning).