Skip to yearly menu bar Skip to main content


Oral Session

Oral Session 5B: Remote Sensing and Sensors

Mon 9 Mar 1:30 p.m. PDT — 2:30 p.m. PDT
Abstract:
Chat is not available.

Mon 9 March 13:30 - 13:42 PDT

CalibBEV: LiDAR-Camera Calibration via BEV Alignment

Filippo D'Addeo ⋅ Lorenzo Cipelli ⋅ Adriano Cardace ⋅ Emanuele Ghelfi ⋅ Andrea Zinelli ⋅ Massimo Bertozzi

We present $\textbf{CalibBEV}$, a novel Bird's Eye View (BEV) alignment approach for LiDAR-camera calibration. Our method unifies LiDAR and camera data into a shared 3D spatial representation, enabling accurate and robust cross-modal calibration.CalibBEV extracts sensor-wise BEV features from each modality using domain-specific architectures and estimates the calibration matrix through a two-step alignment process. First, we perform an implicit alignment by regressing a coarse calibration matrix directly from the BEV features. To ease this alignment, we enforce semantic consistency between BEV representations across modalities using a contrastive loss inspired by CLIP, guiding both networks toward a unified feature space.In the second step, we leverage our BEV formulation to explicitly align the features of one modality with the other, refining the initial coarse estimate into a final, more accurate calibration matrix.CalibBEV significantly outperforms prior point-to-pixel matching methods, achieving state-of-the-art calibration accuracy. On the KITTI and nuScenes benchmarks, our method reduces the Relative Rotation Error (RRE) by 51% and 68%, and the Relative Translation Error (RTE) by 80% and 91%, respectively, compared to previous methods.

Mon 9 March 13:42 - 13:54 PDT

X-JEPA: A Novel Joint Learning Cross-Modal Predictive Alignment Framework for Remote Sensing Image Retrieval

Shabnam Choudhury ⋅ Yash Salunkhe ⋅ Vaibhav Rajan ⋅ Subhasis Chaudhuri ⋅ Biplab Banerjee

The growing scale and heterogeneity of remote sensing (RS) imagery demand robust, scalable frameworks for content-based image retrieval across sensor modalities. We introduce X-JEPA, a novel predictive self-supervised architecture explicitly designed for cross-modal remote sensing image retrieval (RS-CMIR), and the first to extend joint embedding predictive paradigms beyond unimodal domains. Unlike prior contrastive or reconstruction-based methods, X-JEPA formulates representation learning as a latent forecasting task: predicting the semantic embedding of a target modality given context from another. To enforce modality-invariant alignment, we propose a geometry-aware Prediction Space Alignment (PSA) loss, which captures the structure of the latent space without requiring pixel-level reconstruction or modality pairing. We evaluate X-JEPA on two large-scale benchmarks—BEN-14K (Sentinel-1/Sentinel-2) and fMoW (RGB/Sentinel) across both unimodal and cross-modal retrieval tasks. X-JEPA consistently outperforms state-of-the-art self-supervised baselines, including MAE, SatMAE, CrossMAE, CSMAE-SESD, CROMA, SkySense, DeCUR and REJEPA, achieving up to 11.0% F1-score improvement in cross-modal retrieval and 9.8% in unimodal settings. Despite its high retrieval accuracy, the model remains lightweight, requiring fewer parameters and yielding 8–10% F1-score gains on average, establishing a new state-of-the-art for scalable, sensor-agnostic RS-CMIR.

Mon 9 March 13:54 - 14:06 PDT

SSMRadNet : A Sample-wise State-Space Framework for Efficient and Ultra-Light Radar Segmentation and Object Detection

Anuvab Sen ⋅ Mir Sayeed Mohammad ⋅ Saibal Mukhopadhyay

We introduce SSMRadNet, the first multi-scale State Space Model (SSM) based detector for Frequency Modulated Continuous Wave (FMCW) radar that sequentially processes raw ADC samples through two SSMs. One SSM learns a chirp-wise feature by sequentially processing samples from all receiver channels within one chirp, and a second SSM learns a representation of a frame by sequentially processing chirp-wise features. The latent representations of a radar frame are decoded to perform segmentation and detection tasks. Comprehensive evaluations on the RADIal dataset show SSMRadNet has 10–33× fewer parameters and 60–88× less computation (GFLOPs) while being 3.7× faster than state-of-the-art transformer and convolution-based radar detectors at competitive performance for segmentation tasks.

Mon 9 March 14:06 - 14:18 PDT

Rank-based Geographical Regularization: Revisiting Contrastive Self-Supervised Learning for Multispectral Remote Sensing Imagery

Tom Burgert ⋅ Leonard Hackel ⋅ Paolo Rota ⋅ Begüm Demir

Self-supervised learning (SSL) has become a powerful paradigm for learning from large, unlabeled datasets, particularly in computer vision (CV). However, applying SSL to multispectral remote sensing (RS) images presents unique challenges and opportunities due to the geographical and temporal variability of the data. In this paper, we introduce GeoRank, a novel regularization method for contrastive SSL that improves upon prior techniques by directly optimizing spherical distances to embed geographical relationships into the learned feature space. GeoRank outperforms or matches prior methods that integrate geographical metadata and consistently improves diverse contrastive SSL algorithms (e.g., BYOL, DINO). Beyond this, we present a systematic investigation of key adaptations of contrastive SSL for multispectral RS images, including the effectiveness of data augmentations, the impact of dataset cardinality and image size on performance, and the task dependency of temporal views. All code and models will be made publicly available upon acceptance.