Analysis of Text Accuracy and Visual Alignment in Vision-Language Models for Artistic Text Generation
Abstract
Artistic text generation involves rendering textual content in visually creative and contextually appropriate designs, such as floral or geometric patterns. Despite advancements in Large Language Models (LLMs) like DALL-E and Qwen, maintaining both text accuracy and aesthetic alignment is a persistent challenge in this domain. This paper presents an analysis of current Vision-Language Models (VLMs) in generating artistic text, aiming to diagnose performance gaps and guide improvements in text accuracy and multimodal alignment. The objective of this study is to proposes a comprehensive analysis framework to evaluate and improve artistic text generation by assessing models across three key dimensions: character-level text accuracy, semantic similarity, and visual-context alignment. A custom dataset of 1,000 prompts, each specifying a word and artistic style to support systematic evaluation. We benchmarked the performance of DALLĀ·E, Qwen-VL, and Qwen-2.5B on text accuracy by comparing the generated text to the original prompt and measuring perceptual alignment using CLIP embeddings. The results shows better performance of DALL-E model compared to the other LLM, espicially in better generated text accuracy. This study highlights the potential of adopting LLMs for artistic text generation. Furthermore, we present exploratory results using ControlNet as an enhancement module, demonstrating its potential to improve text structure and alignment in generated images. Our findings expose critical limitations in current VLMs and offer actionable insights for advancing multimodal generation architectures, datasets, and evaluation strategies, as well as provide a framework for improving text accuracy and aesthetic coherence in creative applications, paving the way for advancements in multimodal AI systems.