Beyond Vision: Multimodal Perspectives for Cross-View Geo-Localization
Abstract
The increasing availability of geospatial data from heterogeneous modalities, including aerial and satellite imagery, ground-level views, and textual descriptions, has made cross-view geo-localization a critical research area with applications in autonomous navigation, urban monitoring, and augmented reality. Despite progress, challenges remain in handling extreme viewpoint variations, scaling across diverse domains, and integrating multimodal information. Recent developments in multimodal learning and Generative AI, such as Large Multimodal Models (LMMs), have introduced new paradigms for geo-localization. LMMs enable more generalized cross-view matching by incorporating language as an additional modality, supporting tasks such as text-based geo-localization, scene description, and multimodal reasoning. These capabilities not only improve performance but also expand the scope of cross-view geo-localization to broader multimodal applications. This tutorial will provide a comprehensive overview of these developments, highlighting the latest methodologies, datasets, and open research directions that are shaping the future of cross-view geo-localization