MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
Abstract
Self-supervised learning (SSL) holds great promise for Earth observation (EO), but standard SSL methods must be adapted to the unique characteristics of EO data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and reconstruction target normalizations for multimodal, multitemporal, and multispectral data. Based on these findings, we propose MAESTRO, a novel adaptation of the Masked Autoencoder (MAE) that combines optimized fusion strategies with a target normalization scheme, introducing an effective multispectral prior as a self-supervisory signal to learn a better deep representations. Evaluated on four diverse EO datasets, MAESTRO sets a new state of the art on tasks with strong multitemporal components, while remaining highly competitive on tasks dominated by a single mono-temporal modality. Code and pretrained models will be released publicly upon publication.