MoSCo: Real-time and Efficient Text-to-Motion Synthesis via Delta Training
Zhiyuan Zhang · Lingqiao Liu
Abstract
Generating expressive, fine-grained human motion from text remains a formidable challenge, particularly when aiming for high fidelity without incurring excessive computational cost. Existing methods often rely on complex, multi-stage pipelines with slow inference and large memory footprints, hindering real-time deployment. To address these limitations, we introduce MoSCo, a simple autoregressive text-to-motion framework that discretizes motion into part-level token sequences and models temporal dynamics via $\textbf{Delta-based training}$ strategy —i.e., predicting the motion difference from the previous time step—before fusing these tokens with textual embeddings through our $\textbf{Part-Aware Coordinator(PAO)}$ and generating the full sequence with a single, lightweight transformer decoder. MoSCo sets a new milestone in text-to-motion inference speed—achieving an AITS of just $\textbf{0.002s}$(vs 0.03s), over an order of magnitude faster than all prior methods—while maintaining a compact model footprint and delivering highly realistic motions ($\textbf{FID 0.085}$), making real-time, high-quality generation practical.
Successful Page Load