Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention
Abstract
Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. Existing methods for this problem, referred to as a hierarchical segmentation task, typically require annotated training datasets, which are often species-specific and require notable human labor. To address this problem, we introduce ZeroPlantSeg, a zero-shot segmentation method for rosette-shaped plant individuals from top-view crop images. Our method integrates a foundation segmentation model, extracting leaf instances, and a vision-language model (VLM), reasoning about plants' structures to extract plant instances without additional training. Evaluations on real-world datasets with multiple plant species (i.e., sugar beets and cauliflowers), growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance compared to supervised methods. Our implementations will be publicly available upon acceptance.