CraftSVG: Multi-Object Text-to-SVG Synthesis via Layout Guided Diffusion
Abstract
Generating SVGs from text prompts is a challenging vision task, requiring diverse yet realistic depictions of the seen as well as unseen entities. However, existing research has been mostly limited to generating single-object rather than comprehensive scenes comprising multiple elements. In response, CraftSVG, introduces an end-to-end framework for creating SVGs depicting entire scenes from textual descriptions. Utilizing a pre-trained LLM for layout generation from text prompts via iterative in-context learning, CraftSVG introduces a technique for producing masked latent in specified bounding boxes for accurate object placement. It introduces a fusion mechanism for integrating attention maps and employs a diffusion U-Net for coherent composition, speeding up the stroke initialization. Recognizing the importance of abstract SVGs in communication, we incorporated an MLP-based mechanism to simplify the resulting SVGs, with alignment and perceptual loss with differential rendering and opacity modulation to maximize the similarity. CraftSVG outperforms previous methods in abstraction, recognizability, and detail, as demonstrated by its performance metrics: CLIP-T: 0.5013, Cosine Similarity: 0.7091, and Aesthetic: 7.0779, among others.