SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering
Abstract
Accurate 3D human pose estimation is fundamental for applications such as autonomous driving, augmented reality, and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs.To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. SkelSplat models human pose as a skeleton of 3D Gaussians, one for each joint, leveraging Gaussian Splatting to obtain a volumetric pose representation. The Gaussians are then optimized by minimizing a differentiable rendering loss, allowing seamless integration of information from arbitrary numbers and configurations of cameras without requiring any training on 3D ground truth. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of individual human joints. SkelSplat outperforms prior approaches, achieving 20.3\,mm error on Human3.6M and 20.9\,mm on CMU Panoptic Studio, without relying on any 3D ground-truth supervision. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate its robustness to occlusions, without scenario-specific fine-tuning.Our code is available here: [removed for blind review].