Towards Photorealistic Style Transfer with Multimodal Guidance and Robustness to Content Images in Arbitrary Styles
Abstract
Existing photorealistic style transfer methods are broadly categorized into two groups: image-guided and text-guided approaches. The image-guided paradigm requires a reference style image, which is superior when the target style is difficult to define precisely with text. Unfortunately, such references are not always available in practical scenarios. In contrast, the text-guided paradigm offers greater flexibility. However, existing text-guided methods often fail to preserve the details of content images or perform poorly when content images deviate from normal style. In this paper, we present a novel multimodal-guided photorealistic style transfer framework, supporting flexible switching and fusion of both modalities while ensuring robust performance across content images in arbitrary styles. Specifically, we adopt a two-stage pipeline. First, the Style Removal Module removes the original style from the content image. Then, the Style Injection Module applies stylization based on the style guidance (image, text, or their fusion). To make the text-guided branch compatible with this pipeline, we propose the Image-Assisted Textual Style Injection (IATSI) strategy. Additionally, we design a Dual-Residual Adaptive MLP (DRA-MLP), which exhibits strong color mapping capability and avoids spatial distortions. Extensive experiments show that our method achieves state-of-the-art (SOTA) performance in both image-guided and text-guided settings. Moreover, we innovatively implement multimodal fusion-guided photorealistic style transfer, achieving promising results.