Track: Oral Session 4A: Image Recognition and Understanding II

Mon 9 March 9:45 - 9:57 PDT

Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

Saarthak Kapse ⋅ Robin Betz ⋅ Srinivasan Sivanandan

State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from $L$ sequential steps to $log(L)$ parallel steps with respect to the number of input tokens ($L$). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2$\times$ reduction in the number of parallel steps in SSM block. Our model offers up to 72.5% speedup in inference speed compared to baseline Vision Mamba models on high resolution (2048$\times$2048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection.

Mon 9 March 9:57 - 10:09 PDT

Extreme Amodal Face Detection

Changlin Song ⋅ Yunzhong Hou ⋅ Michael Barnes ⋅ Rahul Shome ⋅ Dylan Campbell

Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded.In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches.

Mon 9 March 10:09 - 10:21 PDT

ENCORE : A Neural Collapse Perspective on Out-of-Distribution Detection in Deep Neural Networks

A. Q. M. Sazzad Sayyed ⋅ Nathaniel Bastian ⋅ Francesco Restuccia

Out-of-Distribution (OOD) detection is of paramountimportance in guaranteeing safe and reliable deploymentof a Deep Neural Network (DNN) model in real-world set-tings. However, most OOD detection approaches still lackmotivation rooted in established properties of the DNNs.This disconnect between the proposed approach and theo-retical underpinning to measurable DNN properties makesthese approaches unreliable. To bridge this gap, we takea different perspective to using energy scoring for OODdetection. Specifically, we look at energy score throughthe lens of the properties of neural collapse and observethat simple feature scaling can improve the separation be-tween In-Distribution (ID) and OOD inputs. Based on thisobservation, we propose ENCORE , which scales featuresof each input adaptively and uses them to obtain modi-fied logits based on insights from theory of neural collapse.We show that ENCORE outperforms state-of-the-art ap-proaches across a variety of benchmarks; for example, by1.37% on CIFAR10 and by 1.07% on Imagenet benchmarks.

Mon 9 March 10:21 - 10:33 PDT

Performance of Conformal Prediction in Capturing Aleatoric Uncertainty

Misgina Tsighe Hagos ⋅ Claes Lundström

Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty remains limited.