4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos
Abstract
Reconstructing animatable 3D animals from videos traditionally depends on sparse semantic keypoints to fit parametric models. Acquiring these keypoints is labor-intensive, and detectors trained on limited animal datasets are often unreliable. We propose \textbf{4D-Animal}, a keypoint-free framework that reconstructs animatable 3D animals directly from videos. Our method employs a dense feature network to map 2D image representations to SMAL parameters, improving both efficiency and stability. Additionally, we introduce a hierarchical alignment strategy that leverages silhouette, part-level, pixel-level, and temporal cues from pretrained 2D models, ensuring accurate and temporally coherent reconstructions. Extensive experiments demonstrate that 4D-Animal outperforms both model-based and model-free baselines on dog dataset. Moreover, the high-quality 3D assets generated by our method can benefit other 3D tasks, underscoring its potential for large-scale applications. The code will be released online.