Sea-CLIP: Mining Semantic-Aware Representations for Few-Shot Anomaly Detection with CLIP
Abstract
Few-shot Anomaly Detection (FSAD) is a classic computer vision task, and recent FSAD methods utilize the pre-trained Vision-Language model, \textit{i.e.}, CLIP, to achieve remarkable performance. However, existing CLIP-based methods overlook object semantics, a crucial element that could guide comparisons between semantically consistent patches and enhance FSAD performance. To address this limitation, we propose a novel method, Sea-CLIP, that incorporates semantic-aware representations from DINOv2 to improve FSAD representation learning. Specifically, Sea-CLIP uses semantic-aware representations obtained from DINOv2 in a patch-matching module for segmenting anomalies. Secondly, a lightweight anomaly matching decoder is introduced to convert CLIP and DINOv2 features into the anomaly mask, formulating FSAD as a feature matching task. The Stable Diffusion is leveraged for data augmentation, enhancing the Sea-CLIP to capture diverse anomaly patterns. Our Sea-CLIP achieves state-of-the-art FSAD performance on MvTec and VisA AD datasets. The source code will be released upon acceptance.