DermEVAL: A Dermatologist-Reviewed Benchmark for Multimodal Large Language Models
Abstract
Clinical photographs play a crucial role in conversational computer-aided diagnosis, particularly in dermatology. However, existing skin disease benchmarks have notable limitations, including insufficient dataset size, the sole presence of categorical labels, the lack of expert inspections, and limited diversity in annotations. To address these shortcomings, we introduce DermEVAL, a large-scale benchmark specifically designed to evaluate the performance of Multimodal Large Language Models (MLLMs) in dermatology. Our benchmark includes image-text pairs depicting 16 distinct skin diseases, featuring a total of 11,347 representative images drawn from various dermatological datasets, carefully selected and annotated with the guidance of dermatologists. DermEVAL enables two primary tasks: visual question answering (VQA) and medical report generation (MRG), designed to simulate real-world medical diagnostics. We evaluate the performance of MLLMs in dermatology using multiple metrics, including traditional metrics and GPT-4V-based assessments. Our results indicate that accurately diagnosing skin diseases remains challenging for state-of-the-art MLLMs. We demonstrate that fine-tuning MLLMs using DermEVAL significantly improves their accuracy on dermatological tasks. We will release our code and benchmark.