VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction
Abstract
Multi-agent trajectory prediction is a key task in computer vision for autonomous systems, particularly in dense and socially interactive environments. Existing methods often struggle to jointly model goal-driven behavior and complex social dynamics, which leads to unrealistic predictions. In this paper, we introduce \textbf{VISTA}, a recursive goal-conditioned transformer architecture that features (1) a cross-attention fusion mechanism to integrate long-term goals with past motion, (2) a social-token attention module enabling fine-grained interaction modeling across agents, and (3) pairwise attention maps to show social influence patterns during inference. Our model enhances the single-agent goal-conditioned approach into a cohesive multi-agent forecasting framework. In addition to the standard evaluation metrics, we also consider trajectory collision rates, which capture the realism of the joint predictions. Evaluated on the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy with dramatically improved interaction modeling. On MADRAS, our approach reduces the average collision rate of strong baselines from 2.14\% to 0.03\%, and on SDD, it achieves a 0\% collision rate while outperforming SOTA models in terms of ADE/FDE and minFDE. These results highlight the model’s ability to generate socially compliant, goal-aware, and interpretable trajectory predictions, making it well-suited for deployment in safety-critical autonomous systems.