MBTI: Metric-Based Textual Inversion for Fine-Grained Image Generation
Abstract
Textual Inversion has recently gained attention for generating diverse text-guided images by learning a custom class from just a few reference images. However, the generated images often struggle to distinguish between fine-grained classes with similar visual characteristics. To address this challenge, we propose a novel technique for fine-grained image generation called Metric Based Textual Inversion (MBTI). MBTI leverages inter-class relationships from reference images of different classes to encode their characteristics into new pseudo-words, enhancing fine-grained image generation. Learning inter-class information is facilitated by maximizing the distances between the pseudo-words in the text embedding space. MBTI employs a simple selection rule for embeddings and a basic distance metric. Experimental results demonstrate that MBTI successfully generates images for fine-grained classes with distinct characteristics, which are crucial for accurately identifying the image classes. By leveraging its ability to highlight and preserve fine-grained details as a data augmentation technique, MBTI also significantly enhances the performance of fine-grained image classification.