UniDiff: Parameter-Efficient Adaptation of Diffusion Models for Land Cover Classification with Multi-Modal Remotely Sensed Imagery and Sparse Annotations
Abstract
Sparse annotations fundamentally constrain multimodal remote sensing: even recent state-of-the-art supervised methods such as MSFMamba are limited by the availability of labeled data, restricting their practical deployment despite architectural advances. ImageNet-pretrained models provide rich visual representations, but adapting them to heterogeneous modalities such as hyperspectral imaging (HSI) and synthetic aperture radar (SAR) without large labeled datasets remains challenging.We propose UniDiff, a parameter-efficient framework that adapts a single ImageNet-pretrained diffusion model to multiple sensing modalities using only target-domain data. UniDiff combines FiLM-based timestep-modality conditioning, parameter-efficient adaptation of approximately 5\% of parameters, and pseudo-RGB anchoring to preserve pre-trained representations and prevent catastrophic forgetting. This design enables effective feature extraction from remote sensing data under sparse annotations.Our results with two established multi-modal benchmarking datasets demonstrate that unsupervised adaptation of a pre-trained diffusion model effectively mitigates annotation constraints and achieves effective fusion of multi-modal remotely sensed data.