A Unified Diffusion-Based Framework for Multi-Agent Trajectory Prediction Integrating Structured Multi-Modal Representations
Abstract
Autonomous multi-agent trajectory prediction in open-world scenarios presents persistent challenges, including high behavioral uncertainty, long-horizon dependencies, and the lack of structured guidance during generation. Existing generative approaches often compromise behavioral fidelity in favor of accuracy or diversity, resulting in predictions that are either unrealistic or difficult to control. We propose M²Traj, a unified framework that couples a closed-loop conditional diffusion model with structured trajectory reasoning and behavior-driven constraints.M²Traj features a history-guided encoder that captures long-range cross-agent dependencies and scene semantics, and a dynamic closed-loop rollout mechanism that refines predictions through goal-conditioned denoising with iterative feedback. To enable fine-grained control, we introduce a learnable behavior guidance module that softly enforces constraints on velocity, collision risk, comfort, and traffic rule adherence. By jointly modeling agent interactions, future constraints, and uncertainty within a structured generative process, M²Traj delivers controllable and reliable predictions across diverse urban scenarios. Extensive experiments on three large-scale benchmarks—Waymo, HighD, and MoCAD—demonstrate that M²Traj achieves competitive or superior performance across standard accuracy, diversity, and behavior-sensitive metrics, highlighting its potential as a generalizable solution for controllable, structure-aware trajectory prediction in complex multi-agent environments.