Decoupling Shape and Texture in SAM-2 via Controlled Texture Replacement
Abstract
Segment Anything Models (SAM) have demonstrated strong generalization in object segmentation across diverse datasets. However, their training on large-scale semantic segmentation data induces a shape bias, leading to over-segmentation in texture-dominant scenes and severely limiting performance. This limitation is particularly pronounced in domains such as remote sensing and metallographic imaging, where meaningful boundaries are defined by texture variations rather than semantic structure. In this study, we investigate SAM’s shape bias and show that a simple fine-tuning strategy—based on incremental texture augmentations of semantically labeled data—can effectively calibrate this bias and guide the model toward texture-aware segmentation. By interpolating and replacing textures within \textbf{semantically} labeled regions, we generate texture-diverse instances of the same semantic category, enabling effective fine-tuning without requiring additional manual annotations. We release both the texture-oriented variant of SAM (“TextureSAM”) and the texture-augmented dataset used in our experiments to support reproducibility and facilitate further research on shape–texture bias in foundation models. We show that this fine-tuning approach mitigates SAM-2’s shape bias, improving segmentation performance on both real-world (RWTD, +0.20 mIoU) and synthetic (STMD, +0.18 mIoU) texture segmentation benchmarks.