SAIL: Self-supervised Learning of Lighting-Invariant Representations from Real Images with Latent Diffusion
Abstract
Intrinsic image decomposition aims at separating an image into its underlying albedo and shading components, isolating the base color from lighting effects to enable downstream applications such as virtual relighting and scene editing.Despite the rise and success of learning-based approaches, intrinsic image decomposition from real-world images remains a significantly challenging task due to the scarcity of labeled ground-truth data.Most existing solutions rely on synthetic data as supervised setups, limiting their ability to generalize to real-world scenes. Self-supervised methods, on the other hand, often produce albedo-like maps that contain reflections and lack consistency under different lighting conditions.To address this, we propose SAIL, an approach designed to estimate illumination-invariant representations from single-view real-world images to specifically target plausible relighting. We repurpose the prior knowledge of a latent diffusion model for unconditioned scene relighting as a surrogate objective for learning light-invariant estimates. To achieve this, we introduce a novel intrinsic image decomposition fully formulated in the latent space.To guide the training of our latent diffusion model, we introduce regularization terms that constrain both the lighting-dependent and -independent components of our latent image decomposition.Through our experiments, we demonstrate that SAIL produces stable albedo-like representations under varying lighting conditions and generalizes to multiple scenes, using only unlabeled multi-illumination data available online.