Training-Free Few-Shot Segmentation via Vision-Language Guided Prompting
Euihyun Yoon · Taejin Park · Jaekoo Lee
Abstract
Object segmentation relies heavily on costly pixel-level annotations and struggles to generalize to unseen domains. The recent introduction of the Segment Anything Model (SAM), a foundation model for segmentation, offers a prompt-driven, zero-shot capability that has been applied in various domains (e.g., autonomous driving, satellite imagery, medical imaging) and extended to Few-Shot Segmentation (FSS) tasks. However, existing SAM-based FSS methods typically generate prompts by using a vision encoder to measure support–query image similarity, which often biases towards the support images and fails when there are significant support–query context shifts. To address this limitation, we propose a training-free FSS approach that combines visual and textual cues to generate effective prompts for the target class. By leveraging both vision and language information, our approach bridges the support–query gap and guides SAM to segment novel objects more reliably. Without any additional training, our method outperforms previous state-of-the-art FSS methods on established benchmarks ($COCO\text{-}20^i$, $Pascal\text{-}5^i$), demonstrating its effectiveness and robust generalization. Our code is publicly available on GitHub.
Successful Page Load