Oral Session
Oral Session 7A: Biometrics, Face, Gesture, and Body Pose II
Motion-Aware Graph Fusion NetWork for 3D Human Pose Estimation
Yen Pham ⋅ Xiaohui Yuan ⋅ Chengyuan Zhuang
Recent state-of-the-art (SOTA) methods in 3D human pose estimation (HPE) typically prioritize lifting 2D pose coordinates to 3D but tend to underemphasize the importance of generalizing under real-world conditions with noisy 2D inputs from off-the-shelf 2D detector. In this paper, we introduce Graph Attention Fusion Network (GAtFuN), a novel motion-aware framework that integrates our spatial and temporal graph attention mechanisms to explicitly model joint velocities and motion transformations, resulting in more stable and coherent 3D pose predictions despite being trained with the same dataset pipeline as other SOTA methods. GAtFuN achieves a 7.8\% improvement in MPJPE over the current SOTA on the Human3.6M dataset and a 1.9\% improvement on the MPI-INF-3DHP dataset, while demonstrating more robust performance on the 3DPW dataset in the wild.
UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training
Jiawei Qin ⋅ Xucong Zhang ⋅ Yusuke Sugano
Despite decades of research on data collection and model architectures, current gaze estimation models encounter significant challenges in generalizing across diverse data domains. Recent advances in self-supervised pre-training have demonstrated remarkable generalization across various vision tasks. However, their effectiveness in gaze estimation remains unexplored. We propose UniGaze, for the first time, leveraging large-scale in-the-wild facial datasets for gaze estimation through self-supervised pre-training. Through systematic investigation, we clarify critical factors that are essential for effective pre-training in gaze estimation. Our experiments reveal that self-supervised approaches designed for semantic tasks fail when applied to gaze estimation, while our carefully designed pre-training pipeline consistently improves cross-domain performance. Through comprehensive experiments of challenging cross-dataset evaluation and novel protocols, including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. Source code and model will be available.
Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities
Fan Yang ⋅ Quanting Xie ⋅ Atsunori Moteki ⋅ Shoichi Masui ⋅ Shan Jiang ⋅ Kanji Uchino ⋅ Yonatan Bisk ⋅ Graham Neubig
Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities---characterized by simple structures and high-contrast patterns---have been widely studied, long-term periodic workflows with low-contrast patterns remain largely underexplored. To bridge this gap, we introduce the first benchmark comprising 580 multimodal human activity sequences featuring long-term periodic workflows. The benchmark supports three evaluation tasks aligned with real-world applications: unsupervised periodic workflow detection, task completion tracking, and procedural anomaly detection. We also propose a lightweight, training-free baseline for modeling diverse periodic workflow patterns. Experiments show that: (i) our benchmark presents significant challenges to both unsupervised periodic detection methods and zero-shot approaches based on powerful large language models (LLMs); (ii) our baseline outperforms competing methods by a substantial margin in all evaluation tasks; and (iii) in real-world applications, our baseline demonstrates deployment advantages on par with traditional supervised workflow detection approaches, eliminating the need for annotation and retraining.Our code and dataset will be available on GitHub and HuggingFace.
VAST-ReID: A Low-Light Benchmark Dataset for Person Re-Identification with Visual and Attribute-Rich Semantic Tracking
Hammad Khan ⋅ Rakesh Giri ⋅ Kamalakar Thakare ⋅ Heeseung Choi ⋅ Hyungjoo Jung ⋅ Debi Dogra ⋅ Ig-Jae Kim
Person Re-Identification (ReID) task is important for designing intelligent surveillance systems. ReID can be highly challenging in low-light and low resolution scenarios. Existing ReID datasets predominantly feature cropped pedestrian images captured in well-lit environments, often lacking semantic richness, frame-level temporal continuity, and robustness to adverse conditions. To address these limitations, we introduce VAST-ReID, a new benchmark dataset specifically designed for the low-light person ReID task in real-world surveillance contexts. VAST-ReID consists of 1,211 surveillance videos collected at 21 different locations, capturing 169 distinct pedestrians of various age groups. The dataset emphasizes naturally low-light and visually degraded scenarios. Each identity is annotated with dense bounding boxes and enriched with auxiliary semantic labels, including pedestrian attributes and LLM-generated descriptions. While these annotations are not used during supervised training, they provide valuable semantic context for advancing research in language-guided retrieval and attribute-aware modeling. Additionally, we release identity-aligned image crops under the BoxTrack-ReID subset, which has over 14.3K frames sampled at 1fps from the raw videos, with standard training, gallery, and query splits compatible with the Market-1501 evaluation protocol, enabling straightforward benchmarking. The dataset has been benchmarked against SOTA methods, and experiments reveal that there is huge scope for improvement in ReID research.
DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors
Kaustubh Kundu ⋅ Hrishav Barua ⋅ Lucy Robertson-Bell ⋅ Zhixi Cai ⋅ Kalin Stefanov
The trend in sign language generation is centered around data-driven generative methods. These methods require vast amounts of precise 2D and 3D human pose data to achieve a generation quality acceptable to the Deaf community. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information.However, manual production of accurate 2D and 3D human pose information from videos is a labor-intensive process. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11\% in the estimation of body and hand poses compared to the state-of-the-art.