Uncertainty-Aware Vision-Language Segmentation for Medical Imaging
Abstract
Medical image segmentation is crucial for computer-aided diagnosis, surgical planning, and clinical research, requiring precise delineation of anatomical structures and pathological regions across various imaging modalities. Traditional deep learning approaches, primarily based on convolutional neural networks (CNNs) and transformers, have shown high performance but are limited by their reliance on visual features alone, which restricts their generalization and integration of clinical knowledge. The increasing availability of multimodal medical data, including paired image-text records from electronic health systems, offers a promising solution to these limitations. Vision-language segmentation (VLS) leverages natural language inputs, such as radiology reports or anatomical queries, to guide the segmentation process. This multimodal approach bridges the gap between low-level visual cues and high-level clinical concepts, reduces the need for task-specific supervision, and facilitates more intuitive human-AI interaction in medical workflows. Despite recent advancements, VLS in the medical domain faces significant challenges, including the subtlety of pathological features, inter-reader variability, and the need for fine-grained spatial accuracy. Privacy constraints and the scarcity of well-aligned, high-quality image-text pairs further complicate model training and evaluation, limiting the applicability of general-purpose VLS models in clinical settings. This paper introduces an innovative uncertainty-guided multimodal vision-language segmentation model designed to address these challenges. Our model integrates visual and textual data through advanced cross-modal learning techniques, enhancing segmentation accuracy and robustness. By incorporating uncertainty guidance, the model improves spatial precision and better captures domain-specific visual-linguistic relationships, making it more suitable for clinical applications.