Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis
Abstract
Synthesizing extrapolated views remains a difficult task, especially in urban driving scenes, where the only reliable sources of data are forward-facing RGB captures and sparse LiDAR points. To address this problem, we present PointmapDiffusion, a novel framework for novel view synthesis that utilizes pre-trained 2D diffusion models. Our method leverages pointmaps (i.e. rasterized 3D scene coordinates) as a conditioning signal, capturing geometric and photometric priors from the reference images to guide the image generation process. With our proposed reference attention blocks and ControlNet for pointmap features, the model generates accurate and consistent results across varying viewpoints while respecting geometric information. Experiments on real-life driving data demonstrate that PointmapDiffusion achieves high-quality generation with flexible control over pointmap conditioning signals (e.g. dense depth map or even sparse LiDAR points) and can be used to distill to 3D representations such as 3D Gaussian Splatting for improving view extrapolation.