KMOPS: Keypoint-Driven Method for Multi-Object Pose and Metric Size Estimation from Stereo Images
Abstract
The six-degree-of-freedom (6-DoF) pose and metric size estimation of multiple objects from RGB images only remains a challenging task, particularly due to significant variations in object shape, appearance, and frequent occlusions in complex scenes. To address these challenges, we introduce KMOPS, a Keypoint-driven method tailored specifically for occlusion-robust Multi-Object Pose and metric Size estimation from a single calibrated stereo image pair. Leveraging the stereo input, our approach first extracts the 2D keypoints of the enclosing bounding boxes of the objects in each view, followed by triangulating them for accurate 3D positions. Then, a pose fitting module is employed to accurately obtain each object’s rotation, translation, and dimensions by registering the triangulated 3D keypoints with the canonical ones using a closed-form weighted Procrustes alignment. Our formulation eliminates the need for predefined 3D search spaces or volumetric anchors, which are often required by other methods to constrain the vast 3D solution space. With extensive experiments on the challenging Transparent Object Dataset (TOD) and the large-scale StereOBJ-1M benchmark, the proposed method consistently achieves state-of-the-art results, outperforming other monocular and stereo methods with a simple and effective architecture.