Towards Unconstrained Cross-View Pose Estimation
Abstract
Cross-view pose estimation entails predicting the relative 3 Degrees-of-Freedom (3DoF) pose of an image within an aerial view. Existing work focuses on imagery in controlled settings featuring highly constrained parameters. In contrast, a wide variety of camera parameterizations are seen in-the-wild across tasks where such estimation is useful. To address this gap, we propose a method capable of performing cross-view pose estimation in these less constrained scenarios with ground-view images of unknown FoV, pitch, roll, and projection type (panoramic or rectilinear). Namely, our method avoids common assumptions—such as gravity/horizon alignment needed for geometric-based projections—and purely relies on a transformer to learn the cross-view relationships in a data-driven manner, paired with prediction modules to enable continuous querying of the pose search space. Evaluations of our approach demonstrates it's ability to perform competitively with the state-of-the-art over the VIGOR benchmark, while maintaining performance in those harder less constrained scenarios. This supports our work as the first generalized approach to this task that is capable of operating with less-constrained imagery. The code will be made available at a later date.