Oral Session
Oral Session 6A: 3D Computer Vision II
OMeGa: Joint Optimization of Explicit Meshes and Gaussian Splats for Robust Scene-Level Surface Reconstruction
Yuhang Cao ⋅ Haojun Yan ⋅ Danya Yao
Neural rendering algorithms for Novel View Synthesis and Scene Reconstruction tasks have recently received much attention with the advancement of 3D Gaussian Splatting. For mesh reconstruction, most existing works over-fit a Gaussian Splatting model with multi-view images and obtain the triangle mesh from the model using post-optimization extraction strategies. However, such methods exhibit following limitations: 1) Gaussian Splats often yield inaccurate geometry in indoor scene reconstructions, particularly in texture-less regions, leading to suboptimal triangle mesh quality; 2) the mesh extraction is entirely decoupled from the optimization process, neglecting the potential of using mesh geometry as constraints to guide the optimization of Gaussian Splats. To address these challenges, this paper introduces a novel end-to-end differentiable framework for both rendering and geometry reconstruction tasks. Our key contribution involves jointly optimizing 2D splats and an explicit 3D mesh representation through a flexible binding strategy during the training process. This allows our approach to effectively leverage mesh geometry constraints to guide the optimization of 2D splats while preserving sufficient flexibility, resulting in both accurate alignment with scene surfaces and expressive texture representation. Furthermore, as another core component of our method, we design an iterative mesh refinement technique, including a novel gradient-based subdivision strategy and a mesh face removal strategy, to further improve the detail and accuracy of the reconstructed mesh. Extensive experiments show that our joint-representation framework achieves overall state-of-the-art performance on challenging benchmarks, effectively addressing prior limitations associated with indoor scene reconstruction.
Confidence Through Parallel Attention for Depth and Uncertainty Estimation in Dynamic Environments
Onkar Susladkar ⋅ Rohit Pawar ⋅ Chirag Sehgal ⋅ Samaksh Ujjawal ⋅ Sparsh Mittal
Monocular depth estimation is crucial for robotics, offering a lightweight and scalable alternative to stereo or LiDAR-based systems. While recent methods have achieved high accuracy, their efficacy degrades under real-world conditions such as occlusion, texture ambiguity, and domain shifts. We introduce ConFiDeNet, a unified framework that jointly predicts metric depth and associated uncertainty, enabling risk-aware robotic perception. ConFiDeNet employs a lightweight parallel attention module that efficiently fuses semantic cues from DINOv2 dense descriptors and SAM2-based segmentation for densely occluded objects, enhancing structural understanding without sacrificing real-time performance. Furthermore, we explicitly condition the model on environment type, improving generalization across diverse indoor and outdoor scenes without retraining. Our method achieves state-of-the-art results across six datasets under both supervised and zero-shot settings, outperforming nine prior techniques, including Marigold, ZeoDepth, PatchFusion, and MonoProb. With significantly faster inference and high prediction confidence, ConFiDeNet is readily deployable for embodied AI, self-driving applications, and robotic manipulation tasks.
BiNAR: A Bi-Modal Framework for Non-Aligned RGB-IR 3D Reconstruction via Gaussian Splatting
Zhongwen Wang ⋅ Han Ling ⋅ Weihao Zhang ⋅ Yinghui Sun ⋅ Quansen Sun
Existing RGB-IR (infrared) bi-modal 3D reconstruction methods generally have difficulty in simultaneously processing non-aligned multi-modal data with significant differences in resolution and spectral characteristics and achieving high-precision pixel-level reconstruction. Non-aligned RGB-IR 3D reconstruction and rendering represents a new domain. To this end, we propose BiNAR, a bi-modal framework that can directly process non-aligned data collected by conventional RGB and IR cameras and generate high-resolution, pixel-level aligned renderings. BiNAR first uses cross-modal multi-camera joint calibration to accurately estimate the internal and external parameters of the RGB-IR camera and unify the coordinate system; then, it fuses the features of different modalities in the Unified Gaussian Field and jointly optimizes the Gaussians to achieve cross-modal consistent 3D scene expression. Experimental results show that BiNAR significantly outperforms traditional single-modal and bi-modal Gaussian splatting methods in rendering quality, achieving a sub-pixel average reprojection error of 0.242 px and improves IR PSNR by 12.22 dB. We also build a pixel-level aligned RGB-IR dataset covering a variety of indoor and outdoor scenes and including real temperature data, providing a reliable benchmark for subsequent multi-modal research. The code and dataset will be available.
Spec-Gloss Surfels and Normal–Diffuse Priors for Relightable Glossy Objects
Georgios Kouros ⋅ Minye Wu ⋅ Tinne Tuytelaars
Accurate reconstruction and relighting of glossy objects remain a longstanding challenge, as object shape, material properties, and illumination are inherently difficult to disentangle. Existing neural rendering approaches often rely on simplified BRDF models or parameterizations that couple diffuse and specular components, which restricts faithful material recovery and limits relighting fidelity. We propose a relightable framework that integrates a microfacet BRDF with the specular–glossiness parameterization into 2D Gaussian Splatting with deferred shading. This formulation enables more physically consistent material decomposition, while diffusion-based priors for surface normals and diffuse color guide early-stage optimization and mitigate ambiguity. Furthermore, a coarse-to-fine optimization of the environment map accelerates convergence and preserves high-dynamic-range specular reflections. Extensive experiments on complex glossy scenes demonstrate that our method achieves high-quality geometry and material reconstruction and delivers substantially more realistic and consistent relighting under novel illumination compared to existing Gaussian splatting methods. The code will be released upon acceptance.
Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning
Lintao XU ⋅ Yinghao WANG ⋅ Chaohui Wang
Occlusion Boundary Estimation (OBE) identifies boundaries arising from both inter-object occlusions and self-occlusion within individual objects, distinguishing them from ordinary edges and semantic contours to support more accurate scene understanding.This task is closely related to Monocular Depth Estimation (MDE), which infers depth from a single image, as Occlusion Boundaries (OBs) provide critical geometric cues for resolving depth ambiguities, while depth can conversely refine occlusion reasoning. In this paper, we propose MoDOT, a novel method that jointly estimates depth and OBs from a single image for the first time. MoDOT incorporates a new module, CASM, which combines cross-attention and multi-scale strip convolutions to leverage mid-level OB features for improved depth prediction. It also includes an occlusion-aware loss, OBDCL, which encourages more accurate boundaries in the predicted depth map.Extensive experiments demonstrate the mutual benefits of jointly estimating depth and OBs, and validate the effectiveness of MoDOT's design. Our method achieves state-of-the-art (SOTA) performance on two synthetic datasets and the widely used NYUD-v2 real-world dataset, significantly outperforming multi-task baselines. Furthermore, the cross-domain results of MoDOT on real-world depth prediction—trained solely on our synthetic dataset—yield promising results, preserving sharp OBs in the predicted depth maps and demonstrating improved geometric fidelity compared to competitors. We will release our code, pre-trained models, and dataset at [link].