Human knowledge integrated multi-modal learning for single source domain generalization
Abstract
Generalized performance of image classification across datasets from different domains has been elusive in applications of critical importance such as fundus image based grading of diabetes retinopathy (DR). Theoretically, if data from two domains differ in unknown causal factors, it is difficult to achieve generalized performance across the two domains. Traditionally, there is no methodology to evaluate whether domains differ on causal factors without access to data collection sources which is often not feasible. This paper, first proposes a novel theoretical framework of domain conformal bounds (DCB) to evaluate whether two domains differ in unknown causal factors. Then it proposes, GenEval, a multi-modal visual language model (VLM) based technique that integrates foundational image classification models such as MedGemma-4B with human knowledge about specific classes through parameter-efficient LoRA adaptation to bridge the causal gap between domains and achieve superior single source domain generalization (SDG) performance than state-of-the-art. Comprehensive SDG evaluation across four major DR datasets (APTOS, EyePACS, Messidor, Messidor-2), demonstrate GenEval's superiority: achieving \textbf{76.0\%} average accuracy, surpassing the strongest baseline by \textbf{10.5\%} in DR application under SDG setting.