ATM: Enhanced Alignment for Text-to-Motion Generation
Ke Han · Yueming Lyu · Weichen Yu · Nicu Sebe
Abstract
Existing text-to-motion (T2M) generation methods primarily rely on regression-based objectives, such as minimizing positional errors. However, they lack effective semantic supervision and correction mechanisms, often leading to substantial misalignment between text and motion. To address this, we propose $\textbf{Aligned Text-to-Motion (ATM)}$, a semantics-aware generation framework that automatically identifies and corrects text-motion misalignment. ATM incorporates two key components: (1) $\textbf{Inter-motion alignment}$, which detects semantic contradictions across motions and applies adaptive corrections based on the degree of semantic discrepancy, flexibly handing diverse misalignments and ensuring global text-motion consistency; (2) $\textbf{Intra-motion alignment}$, which refines locally missing or inaccurate motion semantics in an unsupervised manner by inferring semantic proxies, effectively addressing the absence of localized textual annotations. ATM is model-agnostic and can be seamlessly integrated into various T2M methods as a plug-and-play module. Extensive experiments on HumanML3D and KIT demonstrate that ATM consistently improves both generation quality and text-motion alignment. Code will be released upon acceptance.
Successful Page Load