BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis
Abstract
Text-to-motion generation allows language-driven animation, yet current models struggle to deliver long-range coherence and fine-grained limb coordination. A competitive system must (i) preserve temporal consistency across hundreds of frames, (ii) synchronize limb motions, and (iii) align nuanced sentences with a spectrum of plausible trajectories. We introduce BiPO, the first part-based bidirectional autoregressive network trained with a lightweight Partial Occlusion regulariser. Each limb attends to both past and future frames for anticipatory coordination, while stochastic masking weakens spurious cross-part dependencies and encourages varied solutions. On HumanML3D and KIT-ML, BiPO lowers FID by 15–30\% relative to MoMask and BAMM, secures the highest human-perceived realism scores, and sets new state-of-the-art results on motion-editing tasks requiring infill from partial sequences. These findings demonstrate that bidirectional reasoning coupled with Partial Occlusion yields a length-agnostic, high-fidelity framework for expressive, language-conditioned motion synthesis.