Track: Oral Session 3B: Image Recognition and Understanding I

Sun 8 March 15:00 - 15:12 PDT

Layout Anything: One Transformer for Universal Room Layout Estimation

Md Sohag Mia ⋅ Muhammad Abdullah Adnan

We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer's universal segmentation architecture to geometric structure prediction. Our approach integrates OneFormer's task-conditioned queries and contrastive learning with two key modules: (1) a layout degeneration strategy that augments training data while preserving Manhattan-world constraints through topology-aware transformations, and (2) differentiable geometric losses that directly enforce planar consistency and sharp boundary predictions during training. By unifying these components in an end-to-end framework, the model eliminates complex post-processing pipelines while achieving real-time inference at 114 ms. Extensive experiments demonstrate state-of-the-art performance across standard benchmarks, with pixel error (PE) of 5.43% and corner error (CE) of 4.02% on the LSUN dataset and PE of 7.04% (CE 5.17%) on the Hedau datasets. The framework's combination of geometric awareness and computational efficiency makes it particularly suitable for augmented reality applications and large-scale 3D scene reconstruction tasks.

Sun 8 March 15:12 - 15:24 PDT

BOP-Distrib: Revisiting 6D Pose Estimation Benchmarks for Better Evaluation under Visual Ambiguities

Boris Meden ⋅ Asma Brazi ⋅ Fabrice Mayran de Chamisso ⋅ Steve Bourgeois ⋅ Vincent Lepetit

6D pose estimation aims at determining the object pose that best explains the camera observation. The unique solution for non-ambiguous objects can turn into a multi-modal pose distribution for symmetrical objects or when occlusions of symmetry-breaking elements happen, depending on the viewpoint.Currently, 6D pose estimation methods are benchmarked on datasets that consider, for their ground truth annotations, visual ambiguities as only related to global object symmetries, whereas they should be defined per-image to account for the camera viewpoint. We thus first propose an automatic method to re-annotate those datasets with a 6D pose distribution specific to each image, taking into account the object surface visibility in the image to correctly determine the visual ambiguities. Second, given this improved ground truth, we re-evaluate the state-of-the-art single pose methods and show that this greatly modifies the ranking of these methods. Third, as some recent works focus on estimating the complete set of solutions, we derive a precision/recall formulation to evaluate them against our image-wise distribution ground truth, making it the first benchmark for pose distribution methods on real images.

Sun 8 March 15:24 - 15:36 PDT

Cosine Similarity is Almost All You Need (for Prototypical-Part Models)

Luke Moffett ⋅ Frank Willard ⋅ Maximillian Machado ⋅ Emmanuel Mokel ⋅ Jon Donnelly ⋅ Zhicheng Guo ⋅ Adam Costarino ⋅ Julia Yang ⋅ Giyoung Kim ⋅ Alina Barnett ⋅ Cynthia Rudin

Prototypical-part networks are a popular interpretable alternative to black-box deep learning models for computer vision because of their faithful, prototype-based self-explanations.However, in practice, they have proven difficult to train because they are highly sensitive to hyperparameter tuning and difficult to comprehend because they contain a large number of prototypes.We show that replacing $\ell_2$ distance with an angular prototype similarity in the original ProtoPNet greatly improves robustness to hyperparameter selection and is sufficient to produce accuracy and sparsity competitive with state-of-the-art on many backbones and datasets.We also show cosine similarity leads to superior accuracy for five different ProtoPNet architectures (ProtoPNet, TesNet, Deformable ProtoPNet, ProtoTree, and ST-ProtoPNet).Finally, we demonstrate ProtoPNet with cosine similarity produces better semantics than $\ell_2$: prototypes from cosine models score better on prototype quality metrics and are perceived as more similar 3:2 in a user study.

Sun 8 March 15:36 - 15:48 PDT

Orca: Object Recognition and Comprehension for Archiving Marine Species

Yuk Kwan Wong ⋅ Liang Haixin ⋅ Zeyu Ma ⋅ Yiwei Chen ⋅ Ziqiang Zheng ⋅ Rinaldi Gotama ⋅ Pascal Sebastian ⋅ Lauren Sparks ⋅ Sai-Kit Yeung

Marine visual understanding is essential for monitoring and protecting marine ecosystems, enabling automatic and scalable biological surveys. However, progress is hindered by limited training data and the lack of a systematic task formulation that aligns domain-specific marine challenges with well-defined computer vision tasks, thereby limiting effective model application. To address this gap, we present Orca, a multi-modal benchmark for marine research comprising 14,647 images from 478 species, with 42,217 bounding box annotations and 22,321 expert-verified instance captions. The dataset provides fine-grained visual and textual annotations that capture morphology-oriented attributes across diverse marine species. To catalyze methodological advances, we evaluate 18 state-of-the-art models on three tasks: object detection (closed-set and open-vocabulary), instance captioning, and visual grounding. Results highlight key challenges, including species diversity, morphological overlap, and specialized domain demands, underscoring the difficulty of marine understanding. Orca thus establishes a comprehensive benchmark to advance research in marine domain.