Multimodal Medical Image Binding via Shared Text Embeddings
Yunhao Liu · Suyang Xi · Shiqi Liu · Hong Ding · Chicheng Jin · Zhong Chong · Junjun He · Catherine Liu · Yiqing Shen
Abstract
Medical image analysis increasingly relies on the integration of multiple imaging modalities to capture complementary anatomical and functional information, enabling more accurate diagnosis and treatment planning.Achieving aligned feature representations across these diverse modalities is therefore important for effective multimodal analysis.While contrastive language-image pre-training (CLIP) and its variant have enabled image-text alignments, they require explicitly paired data between arbitrary two modalities, which is difficult to acquire in medical contexts. To address the gap, we present Multimodal Medical Image Binding with Text (M$^3$Bind), a novel pre-training framework that enables seamless alignment of multiple medical imaging modalities through a shared text representation space without requiring explicit paired data between any two medical image modalities.Specifically, based on the insight that different images can naturally bind with text, M$^3$Bind first fine-tunes pre-trained CLIP-like image-text models, which are derived from different medical modalities, to align their modality-specific text embedding space while preserving their original image-text alignments. Subsequently, we distill these modality-specific text encoders into a unified model, creating a shared text embedding space.Notably, M$^3$Bind is a flexible framework in which the selection of CLIP-like models is not fixed and can be adapted according to the requirements of the task.Experiments on X-ray, CT, retina, ECG, and pathological images on multiple downstream tasks demonstrate that M$^3$Bind achieves competitive or even superior performance in zero-shot, few-shot classification and cross-modal retrieval tasks compared to its CLIP-like counterparts.These results validate M$^3$Bind's effectiveness in achieving cross-image-modal alignment for medical analysis.
Successful Page Load