Visibility guided Self-Supervised Occlusion Resilient Human Pose Estimation
Abstract
Occlusion remains a significant challenge for existing human pose estimation algorithms, often resulting in inaccurate and anatomically implausible predictions. Although recent occlusion-robust methods report strong performance, they typically rely heavily on supervised learning and privileged information, such as multiview data or temporal sequences. Furthermore, these models often fail under domain changes. Domain-adaptive human pose estimation seeks to mitigate this issue; however, when occlusions are present in the target domain, a common occurrence in real-world applications, performance of these algorithms deteriorates significantly. To address these challenges, we propose VisOR, a novel Visibility guided Self-Supervised algorithm for Occlusion-Resilient Human Pose Estimation. VisOR achieves robustness to both domain shifts and occlusions by integrating contextual reasoning with iterative pseudo-label refinement. It mitigates the overfitting to noisy labels from occluded regions via a visibility-driven curriculum learning strategy, which progressively introduces the model to increasingly occluded training samples. Additionally, VisOR is regularized by a learned human pose prior that maintains anatomical plausibility throughout the adaptation process. Recognizing the scarcity of human pose datasets with realistic occlusions, we introduce BOW: Blended Occlusions in-the-Wild, a rigorously constructed context-aware synthetic benchmark designed to evaluate the occlusion resilience of human pose estimation algorithms. BOW offers a diverse range of context-aware occlusions across both indoor and outdoor environments, simulating real-world conditions. Through extensive experiments, we demonstrate that VisOR outperforms current state-of-the-art methods by ~7% in challenging occluded human pose estimation benchmarks and provides a baseline performance on BOW, against existing algorithms.