Unsupervised Segmentation by Diffusing, Walking and Cutting
Abstract
We propose a zero-shot unsupervised image segmentation method by utilising self-attention activations extracted from Stable Diffusion. We demonstrate that self-attention can directly be interpreted as transition probabilities in a Markov random walk between image patches. This property enables us to modulate multi-hop relationships through matrix exponentiation, which captures k-step transitions between patches. We then construct a graph representation based on self-attention feature similarity and apply Normalised Cuts to cluster them. We quantitatively analyse the effects of incorporating multi-node paths when constructing the NCuts adjacency matrix, showing that higher-order transitions enhance hierarchical relationships in the proposed segmentations. Finally, we describe an approach to automatically determine the NCut threshold criterion, avoiding the need to manually tune it. Our approach surpasses all existing methods for zero-shot unsupervised segmentation based on pre-trained diffusion models features, achieving state-of-the-art results on COCO-Stuff-27, Cityscapes and ADE20K.