Oral Session
Oral Session 5A: Generative Models II
CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization
Anh-Duy Le ⋅ Van Pham ⋅ Thanh Vo ⋅ Mai Toan ⋅ Tuan-Anh Tran
One-shot styled handwriting image generation, despite achieving impressive results in recent years, remains challenging due to the difficulty in capturing the intricate and diverse characteristics of human handwriting by using solely a single reference image. Existing methods still struggle to generate visually appealing and realistic handwritten images and adapt to complex, unseen writer styles, struggling to isolate invariant style features (e.g., slant, stroke width, curvature) while ignoring irrelevant noise. To tackle this problem, we introduce Patch \textbf{Con}trastive Enhancement and \textbf{St}yle-\textbf{A}ware Qua\textbf{nt}ization via Denoising Diffusion (\textbf{CONSTANT}), a novel one-shot handwriting generation via diffusion model. CONSTANT leverages three key innovations: 1) a Style-Aware Quantization (SAQ) module that models style as discrete visual tokens capturing distinct concepts; 2) a contrastive objective to ensure these tokens are well-separated and meaningful in the embedding style space; 3) a latent patch-based contrastive ($L_{LatentPCE}$) objective help improving quality and local structures by aligning multiscale spatial patches of generated and real features in latent space. Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and our proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of our method over state-of-the-art approaches. Code is available at \href{https://github.com/anonymous6399/CONSTANT}
DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy
Jaewoo Song ⋅ Jooyoung Choi ⋅ Kanghyun Baek ⋅ Sangyub Lee ⋅ Daemin Park ⋅ Sungroh Yoon
Despite recent text-to-image models achieving high-fidelity text rendering, they still struggle to generate long or multiple text due to diluted global attention. To address this, we propose DCText, a training-free method for visual text generation that employs a divide-and-conquer strategy, inspired by the reliable short-text generation of Multi-Modal Diffusion Transformer models. To effectively render long or multiple texts, our method first decomposes a global prompt by extracting and dividing the target text, then assigns each decomposed text to a designated region for generation. To ensure these text segments are accurately rendered within their regions while preserving overall image coherence, we introduce two attention masks, the Text-Focus Attention Mask and the Context-Expansion Attention Mask, that are sequentially applied during denoising. In addition, our Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive evaluations across diverse datasets, covering both single-sentence and multi-sentence cases, demonstrate that DCText consistently achieves the highest text accuracy without compromising image quality, while also delivering the lowest generation latency.
VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping
Sanoojan Baliah ⋅ Yohan Abeysinghe ⋅ Rusiru Thushara ⋅ Khan Muhammad ⋅ Abhinav Dhall ⋅ Karthik Nandakumar ⋅ Muhammad Haris Khan
We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code will be released upon acceptance.
VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework
Donglin Huang ⋅ Yongyuan Li ⋅ Tianhang Liu ⋅ Junming Huang ⋅ Xiaoda Yang ⋅ Chi Wang ⋅ Weiwei Xu
Existing for audio- and pose-driven human animation methods often struggle with stiff head movements and blurry hands, primarily due to the weak correlation between audio and head movements and the structural complexity of hands.To address these issues, we propose \textbf{VividAnimator}, an end-to-end framework for generating high-quality, half-body human animations driven by audio and sparse hand pose conditions. Our framework introduces three key innovations. First, to overcome the instability and high cost of online codebook training, we pre-train a Hand Clarity Codebook (HCC) that encodes rich, high-fidelity hand texture priors, significantly mitigating hand degradation. Second, we design a Dual-Stream Audio-Aware Module (DSAA) to model lip synchronization and natural head pose dynamics separately while enabling interaction. Third, we introduce a Pose Calibration Trick (PCT) that refines and aligns pose conditions by relaxing rigid constraints, ensuring smooth and natural gesture transitions. Extensive experiments demonstrate that Vivid Animator achieves state-of-the-art performance, producing videos with superior hand detail, gesture realism, and identity consistency, validated by both quantitative metrics and qualitative evaluations.
Fine-grained Defocus Blur Control for Generative Image Models
Ayush Shrivastava ⋅ Connelly Barnes ⋅ Cecilia Zhang ⋅ Lingzhi Zhang ⋅ Andrew Owens ⋅ Sohrab Amirghodsi ⋅ Eli Shechtman
Current text-to-image diffusion models excel at generating diverse, high-quality images, yet they struggle to incorporate fine-grained camera metadata such as precise aperture settings. In this work, we introduce a novel text-to-image diffusion framework that leverages camera metadata, or EXIF data, which is often embedded in image files, with an emphasis on generating controllable lens blur. Our method mimics the physical image formation process by first generating an all-in-focus image, estimating its monocular depth, predicting a plausible focus distance with a novel focus distance transformer, and then forming a defocused image with an existing differentiable lens blur model. Gradients flow backwards through this whole process, allowing us to learn without explicit supervision to generate defocus effects based on content elements and the provided EXIF data. At inference time, this enables precise control over defocus effects while preserving scene contents, which is not achievable with existing diffusion models. Experimental results demonstrate that our model enables superior fine-grained control without altering the depicted scene.