iMotion-LLM: Instruction-Conditioned Trajectory Generation
Abstract
We introduce iMotion-LLM, a multimodal large language model (LLM) integrated with trajectory prediction modules for interactive motion generation. Unlike conventional approaches, it generates feasible, safety-aligned trajectories based on textual instructions, enabling adaptable and context-aware driving behavior. It combines an encoder-decoder trajectory prediction model with a pre-trained LLM fine-tuned using LoRA, projecting scene features into the LLM input space and mapping special tokens to a multimodal trajectory decoder for text-based interaction and interpretable justification of driving behavior. To support this framework, we introduce two datasets: (1) InstructWaymo, an extension of the Waymo Open Motion Dataset with direction-based motion instructions, and (2) Open-Vocabulary InstructNuPlan, which features safety-aligned instruction-caption pairs and corresponding safe trajectory scenarios. Our experiments validate that instruction conditioning enables trajectory generation that follows the intended condition. iMotion-LLM further demonstrates strong contextual comprehension, achieving 84\% average accuracy in direction feasibility detection and 96\% average accuracy in safety evaluation of open-vocabulary instructions. This work lays the foundation for text-guided motion generation in autonomous driving, supporting simulated data generation, model interpretability, and robust safety alignment testing for trajectory generation models. Code, datasets, and models will be made publicly available.