V2XScene: Multi-View Consistent 3D Scene Simulation for Collaborative Perception
Abstract
Realistic scene simulation is a promising way to improve autonomous driving. While existing diffusion-based 2D augmentation and 3D asset libraries show potential for synthesizing diverse driving scenarios, they often struggle with multi-view photorealistic rendering and consistency. These issues are particularly challenging for vehicle-to-everything (V2X) collaborative perception, since its effectiveness relies on precise geometric alignment and visual coherence across multiple viewpoints. To address these challenges, we propose V2XScene, a 3D driving scene editing framework. This framework enhances V2X collaborative perception by enabling high-quality 3D vehicle asset generation and consistent multi-view insertion. V2XScene consists of three components: a visual question answering (VQA) guided generation module for query-driven 3D vehicle asset synthesis; a 3D object mapping module for vehicle placement optimization and occlusion reasoning; and a realistic insertion module for lighting estimation and virtual vehicle insertion. Extensive experiments demonstrate that V2XScene can generate multi-view consistent and realistic driving scenes, which significantly improves V2X perception accuracy.