Confidence Through Parallel Attention for Depth and Uncertainty Estimation in Dynamic Environments
Abstract
Monocular depth estimation is crucial for robotics, offering a lightweight and scalable alternative to stereo or LiDAR-based systems. While recent methods have achieved high accuracy, their efficacy degrades under real-world conditions such as occlusion, texture ambiguity, and domain shifts. We introduce ConFiDeNet, a unified framework that jointly predicts metric depth and associated uncertainty, enabling risk-aware robotic perception. ConFiDeNet employs a lightweight parallel attention module that efficiently fuses semantic cues from DINOv2 dense descriptors and SAM2-based segmentation for densely occluded objects, enhancing structural understanding without sacrificing real-time performance. Furthermore, we explicitly condition the model on environment type, improving generalization across diverse indoor and outdoor scenes without retraining. Our method achieves state-of-the-art results across six datasets under both supervised and zero-shot settings, outperforming nine prior techniques, including Marigold, ZeoDepth, PatchFusion, and MonoProb. With significantly faster inference and high prediction confidence, ConFiDeNet is readily deployable for embodied AI, self-driving applications, and robotic manipulation tasks.