SynPlay: Large-Scale Synthetic Human Data with Real-World Diversity for Aerial-View Perception
Abstract
We introduce \textbf{SynPlay}, a large-scale synthetic human dataset purpose-built for advancing multi-perspective human identification, with a predominant focus on aerial-view perception. SynPlay departs from traditional synthetic datasets by addressing a critical but underexplored challenge: identifying humans in aerial scenes where subjects often occupy only tens of pixels in the image. In such scenarios, fine-grained details like facial features or textures become irrelevant, shifting the burden of recognition to human motion, behavior, and interactions. To meet this need, SynPlay implements a novel rule-guided motion generation framework that combines real-world motion capture with motion evolution graphs. This design enables human actions to evolve dynamically through high-level game rules rather than predefined scripts, resulting in effectively uncountable motion variations. Unlike existing synthetic datasets—which either focus on static visual traits or reuse a limited set of mocap-driven actions—SynPlay captures a wide spectrum of spontaneous behaviors, including complex interactions that naturally emerge from unscripted gameplay scenarios. SynPlay also introduces an extensive multi-camera setup that spans UAVs at random altitudes, CCTVs, and a freely roaming UGV, achieving true near-to-far perspective coverage in a single dataset. The majority of instances are captured from aerial viewpoints at varying scales, directly supporting the development of models for long-range human analysis—a setting where existing datasets fall short. Our data contains over 73k images and 6.5M human instances, with detailed annotations for detection, segmentation, and keypoint tasks. Extensive experiments demonstrate that training with SynPlay significantly improves human identification performance, especially in few-shot and data-scarce scenarios. SynPlay will be publicly released upon acceptance.