DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy
Abstract
Despite recent text-to-image models achieving high-fidelity text rendering, they still struggle to generate long or multiple text due to diluted global attention. To address this, we propose DCText, a training-free method for visual text generation that employs a divide-and-conquer strategy, inspired by the reliable short-text generation of Multi-Modal Diffusion Transformer models. To effectively render long or multiple texts, our method first decomposes a global prompt by extracting and dividing the target text, then assigns each decomposed text to a designated region for generation. To ensure these text segments are accurately rendered within their regions while preserving overall image coherence, we introduce two attention masks, the Text-Focus Attention Mask and the Context-Expansion Attention Mask, that are sequentially applied during denoising. In addition, our Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive evaluations across diverse datasets, covering both single-sentence and multi-sentence cases, demonstrate that DCText consistently achieves the highest text accuracy without compromising image quality, while also delivering the lowest generation latency.