Modeling and Learning Multiple Hypotheses for Monocular 3D Object Detection
Hyeonjeong Park · Peixi Xiong · Pei Yu · Wei Tang
Abstract
Detecting objects in 3D space using a monocular image is inherently a highly ill-posed problem: multiple plausible 3D bounding boxes can explain the same 2D observation of an object. Existing approaches typically follow a single-point prediction paradigm, failing to capture this multimodal nature and often regressing to an implausible mean solution. This paper introduces MonoMH, a novel multi-hypothesis framework for monocular 3D object detection. By explicitly modeling and learning the multimodal distribution of plausible 3D object configurations, MonoMH not only significantly improves detection performance but also provides richer information to support downstream decision-making. MonoMH introduces three key innovations: (1) a novel multi-hypothesis predictor that leverages spatially diverse features across different windows within an RoI to generate a rich variety of hypotheses without increasing model complexity; (2) a new multi-hypothesis learning approach that derives diverse and relevant hypotheses from single-modal ground truth by integrating uncertainty modeling with best-of-many learning; and (3) a novel adaptive hypothesis filtering mechanism that enhances detection capability by dynamically retaining a variable number of plausible hypotheses based on each object's uncertainty. Experimental results demonstrate the effectiveness of our approach. Notably, MonoMH achieves 29.12 (easy), 20.88 (mod.), and 17.93 (hard) Car AP$_{3D}$ on the KITTI test set, a significant boost over the previous state-of-the-art methods. We will make the code publicly available.
Successful Page Load