Learnable Query-Enhanced Pose Transformation
Abstract
Pose-Guided Person Image Synthesis (PGPIS) aims to transfer a person from a source image to a target pose (e.g., skeleton) while preserving their original appearance. Although existing methods can produce high-quality results at first glance, they often suffer from noticeable distortions in fine details. We identify the root cause of these issues as the heavy reliance on pre-trained encoders for extracting visual features from the source image. To address this, we propose a novel Query Enhancement Network composed of two key components: the Query-based Feature Fusion Transformer (QFFT) and Pose-Masked Attention (PMA). The QFFT uses learnable queries to fuses multi-scale features from high to low resolution extracted by the backbone encoder, thereby significantly enhancing the realism of texture details in the generated images. To better capture the relationship between pose information and visual features from the source image, we introduce PMA that uses the pose skeleton as a mask to guide the attention mechanism to focus on the pose regions. Our method produces high-quality, visually coherent results and outperforms existing approaches on standard evaluation metrics, including FID, SSIM, and LPIPS, demonstrating its effectiveness on the DeepFashion dataset.