Domain Generalizing DINO for Visual Regression via Latent Distractor Subspace Consistency
Abstract
Vision Foundation Models, such as \dino\cite{Dinov2}, have demonstrated remarkable generalization in classification; however, their application to out-of-domain visual regression tasks remains a significant and underexplored challenge. Unlike classification, domain generalization in regression poses distinct challenges: regression produces continuous outputs and is particularly sensitive to high-variance, label-irrelevant factors (e.g., illumination, blur, or contrast). These factors can entangle with task-relevant features and induce spurious correlations. While recent regression methods~\cite{c-mixup,ranksim,circe,fds,conr} have shown promise, they often rely on CNN backbones and require the pre-specification of known distractors. This demands significant domain expertise and fails to address spurious correlations that emerge during training. To address these challenges, we propose \proposedapproach, a \textbf{L}atent \textbf{D}istractor \textbf{S}ubspace \textbf{C}onsistency framework that disentangles intermediate feature representation into task-relevant and latent distractor subspaces, and regularizes the latter under photometric perturbations to suppress spurious correlations while preserving discriminative features during training. Our proposed method, \proposedapproach, is the first to effectively adapt the powerful \dino backbone for domain generalized visual regression. \proposedapproach achieves state-of-the-art results on seven benchmark regression datasets, demonstrating its strong performance in domain generalization for visual regression with percentage improvements of (41.75\%, 20.12\%, 52.05\%, 8.27\%, 22.21\%, 3.55\%) over state-of-the-art \DG regression methods, respectively. Source code is provided in the supplementary.