X-JEPA: A Novel Joint Learning Cross-Modal Predictive Alignment Framework for Remote Sensing Image Retrieval
Abstract
The growing scale and heterogeneity of remote sensing (RS) imagery demand robust, scalable frameworks for content-based image retrieval across sensor modalities. We introduce X-JEPA, a novel predictive self-supervised architecture explicitly designed for cross-modal remote sensing image retrieval (RS-CMIR), and the first to extend joint embedding predictive paradigms beyond unimodal domains. Unlike prior contrastive or reconstruction-based methods, X-JEPA formulates representation learning as a latent forecasting task: predicting the semantic embedding of a target modality given context from another. To enforce modality-invariant alignment, we propose a geometry-aware Prediction Space Alignment (PSA) loss, which captures the structure of the latent space without requiring pixel-level reconstruction or modality pairing. We evaluate X-JEPA on two large-scale benchmarks—BEN-14K (Sentinel-1/Sentinel-2) and fMoW (RGB/Sentinel) across both unimodal and cross-modal retrieval tasks. X-JEPA consistently outperforms state-of-the-art self-supervised baselines, including MAE, SatMAE, CrossMAE, CSMAE-SESD, CROMA, SkySense, DeCUR and REJEPA, achieving up to 11.0% F1-score improvement in cross-modal retrieval and 9.8% in unimodal settings. Despite its high retrieval accuracy, the model remains lightweight, requiring fewer parameters and yielding 8–10% F1-score gains on average, establishing a new state-of-the-art for scalable, sensor-agnostic RS-CMIR.