Deep Image Decomposition for Medical Imaging Anonymization and Curation
Abstract
Medical scans often include patient identifiers and clinical annotations that must be removed prior to data sharing or use in downstream model training. With machine learning now central to clinical imaging analysis, reliable removal of such non-imaging artifacts is essential for preserving patient privacy, reducing bias, and improving data quality. However, this crucial curation step is frequently overlooked or addressed heuristically.We present a deep learning framework that automatically detects and removes overlaid text, markers, and other non-imaging elements from clinical scans while restoring the underlying image content. The model comprises two components: a detection module that localizes non-imaging regions, and a dual-generator architecture for unsupervised image decomposition, where one generator reconstructs the imaging content and the other produces the non-imaging components. Unlike conventional inpainting, our method bypasses explicit segmentation by leveraging explainable AI (XAI) maps from the detection module to guide artifact masking and restoration.We demonstrate robust curation performance on three datasets, one MRI and two ultrasound, for both public and private sources.Results show high visual quality (Turing-test validated) and strong quantitative scores (SSIM, PSNR, FID). Importantly, training downstream classification and segmentation models with scans curated by our method substantially improves results compared to models trained on data containing overlaid annotations. In fact, our performance on various metrics (e.g., accuracy, F1 score, IoU, and Dice) is comparable to those obtained with clean, marker-free training data. Code is included with the submission. Our private dataset will be released upon acceptance.