MEDAL: multi-modal MEta-space Distillation and ALignment for Visual Compatibility Learning
Abstract
Visual compatibility recommendation systems aim to surface compatible items (e.g. pants, shoes) that harmonise with a user‑selected product (e.g., shirt). Existing methods struggle in three key aspects: they rely on global CNN representations that overlook fine‑grained local cues critical for visual pairing; they force all categories into a single latent space, ignoring the fact that compatibility rules differ across product‑type pairs; and they demand costly, expert‑annotated outfit labels. We introduce MEDAL(Meta‑space Distillation and Alignment ), a self‑supervised framework that addresses all three challenges simultaneously. MEDAL (i) employs a local–global augmentation curriculum inside a teacher–student ViT to emphasise patch‑level texture and pattern similarities while suppressing confounding global shape cues; (ii) partitions the joint feature manifold into learnable, pair‑specific meta‑spaces so that, for example, {shirt,pants} and {pants,shoes} relationships are modelled with distinct projection masks; and (iii) replaces manual labels with distantly supervised KD, harvesting pseudo‑compatible sets via object detection on web images, thus scaling to millions of real‑world examples. We further fuse perceptually uniform LUV colour histograms to capture global colour harmony often missed by pure vision transformers.Extensive experiments on Polyvore disjoint/non‑disjoint and a 2M‑image in‑house dataset show state‑of‑the‑art gains of up to +3.72/+2.7FITB and +9.58R@10 over the strongest baseline, whilst cutting annotation cost to zero. Qualitative studies confirm that MEDAL retrieves stylistically coherent outfits and correctly penalises mismatched colour palettes.