RegionAligner: Bridging Ego-Exo Views for Object Correspondence via Unified Text-Visual Learning
Abstract
Establishing object correspondence between egocentric (ego) and exocentric (exo) views is a critical capability for robotic learning and human-robot interaction. The core task involves segmenting an object in one view given a query mask from the opposing view. This is notoriously difficult due to cluttered scenes with many task-irrelevant objects and drastic appearance changes across perspectives. To address this, we introduce RegionAligner, a unified text-visual framework that strategically focuses learning on task-relevant regions. Our method first uses a large vision-language model to identify and name salient objects, effectively filtering out visual distractors. These object phrases are then fused with visual features from both views. We introduce a novel region-guided supervision strategy that promotes focus, enforces spatial alignment, and minimizes appearance disparity between the ego-exo views. Furthermore, our framework seamlessly adapts to unsupervised settings by automatically generating pseudo-labels from matched mask proposals, drastically reducing annotation costs. Extensive experiments on the challenging Ego-Exo4D dataset show RegionAligner significantly outperforms existing baselines, improving IoU by 10.16\% (ego-to-exo) and 6.04\% (exo-to-ego).