Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection
Abstract
Unsupervised anomaly detection in medical images is an important yet challenging task due to the diversity of possible anomalies and the practical impossibility of collecting comprehensively annotated datasets. In this paper, we propose a modernized autoencoder-based framework, the Q-Former Autoencoder, that leverages state-of-the-art pretrained vision foundation models for medical anomaly detection.Instead of training encoders from scratch, we directly utilize frozen foundation models as feature extractors, enabling rich, multi-stage, high-level representations without domain-specific fine-tuning. We introduce the Q-Former architecture as the bottleneck, which enables us to control the length of the reconstruction sequence, while efficiently aggregating multi-scale features. Additionally, we incorporate a perceptual loss computed using features from a the trained Masked Autoencoder, guiding the reconstruction towards semantically meaningful structures. Our framework is evaluated on four diverse medical anomaly detection benchmarks - BraTS2021, RESC, RSNA and LiverCT, achieving state-of-the-art results. Our results highlight the potential of foundation model encoders, pretrained on natural images, to generalize effectively to medical image analysis tasks without further fine-tuning. Code and models will be released upon acceptance.