CLoCKDistill: Consistent Location-and-Context-aware Knowledge Distillation for DETRs
Abstract
Detection Transformers (DETRs) have advanced object detection but are resource-intensive, limiting deployment in embedded settings like self-driving cars. While knowledge distillation (KD) effectively compresses CNN detectors, it’s underexplored for DETRs, and most KD methods fail to capture global context. Also, existing KD methods often blindly trust the teacher model, which can be misleading. To bridge the gaps, this paper proposes Consistent Location-and-Context-aware Knowledge Distillation (CLoCKDistill) for DETR detectors, which includes two components: (1) Feature distillation targets the context-rich transformer encoder output (memory) and enriches it with ground truth object cues, enabling the student to focus on relevant regions with balanced attention across object sizes.(2) Logit distillation uses ground truth to generate target-aware decoder queries, ensuring both teacher and student attend to consistent and accurate parts of encoder memory. Experiments on KITTI and COCO show that CLoCKDistill improves a wide range of DETRs (e.g., single-scale DAB-DETR, multi-scale deformable DETR, and denoising-based DINO) by 2.2\%–6.4\%.