Robust and scalable visual out-of-distribution detection via label name mining using CLIP models
Abstract
Large-scale visual out-of-distribution (OOD) detection has witnessed remarkable progress by leveraging vision-language models such as CLIP. Currently, such OOD methods require access to the in-distribution ground truth label names (positives), while widely available text corpora are solely utilized for mining unrelated concepts (negatives) that are likely OOD. In this work, we present a general framework for mining positive and negative concepts from a text corpus. Additionally, we propose a novel label mining method, ClusterMine, which is the first method to achieve state-of-the-art OOD detection performance when ground-truth label names are inaccessible. ClusterMine extracts in-distribution-related concepts from a large text corpus by enforcing visual sample consistency along with zero-shot inference. Our extensive experimental study reveals that ClusterMine i) is scalable across a plethora of CLIP models, and ii) achieves state-of-the-art robustness to covariate in-distribution shifts, on average. The code will be released.