Learning Mask-Aware Offsets: Two-branch Deformable Attention Networks for Inpainting with Masked Region Avoidance
Abstract
Image inpainting aims to restore missing regions in images in a visually plausible and structurally consistent manner. However, existing methods often struggle with irregular holes and complex structural patterns due to the limitations of fixed-kernel convolutions and static attention mechanisms. In this paper, we propose Mask-Aware Deformable Inpainting Network (MADIN), an image inpainting framework that enables position-aware control based on mask information. The proposed model employs a Two-stage Offset Estimator that jointly utilizes query features and mask signals to reliably predict reference positions even within masked regions. Moreover, the introduction of Adaptive Offset Range Scaling allows the model to flexibly access broader contextual information by adjusting the offset magnitude according to the masking ratio.By effectively combining convolutional operations with attention mechanisms, MADIN integrates both local and global information while maintaining spatial structure without requiring explicit positional encoding. Extensive quantitative and qualitative experiments on the CelebA-HQ and Places2 datasets demonstrate that MADIN achieves superior restoration performance, despite having a lightweight structure of only 29M parameters and an inference speed under 48 ms. Our method outperforms existing state-of-the-art approaches across key metrics including PSNR, SSIM, LPIPS, and FID.