Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-free Open-Vocabulary Semantic Segmentation
Abstract
Open-vocabulary semantic segmentation (OVSS) aims to segment images using arbitrary text queries without retraining. Recent approaches leverage vision-language models like CLIP to enable training-free segmentation. However, CLIP is primarily trained for global image-text alignment, which can lead to challenges in capturing fine-grained regional semantics and result in inconsistent predictions across image regions. This work introduces a new feature rectification strategy that incorporates localised structural priors via spherical linear interpolation on a supersphere. Specifically, we construct a regional adjacency graph guided by a combination of low-level image features—such as colour differences, gradients, and textures—to encode localised appearance cues as priors. This encourages more region-aware feature alignment, complementing CLIP’s global alignment bias. Extensive experimental analysis shows the effectiveness of the proposed method, in reducing segmentation noise and improving the preservation of fine-grained structures. Further generalisation analysis confirms that our approach maintains strong performance across diverse training-free open-vocabulary semantic segmentation benchmarks.