Efficient Text-Guided Convolutional Adapter for the Diffusion Model
Abstract
We propose Nexus Adapters, a novel family of text-guided and efficient adapters designed for diffusion-based Structure Preserving Conditional Generation (SPCG). Existing structure-preserving methods typically rely on using a base diffusion model conditioned on a prompt and a separate adapter for structural inputs like sketches or depth maps. However, such approaches are computationally expensive, often requiring adapter sizes comparable to the base model, making training impractical due to the already high cost of diffusion models. Moreover, traditional adapters operate independently of the input prompt, making them suboptimal for capturing the semantic alignment between textual and structural cues. To address these limitations, we introduce Nexus Prime and Nexus Slim, two prompt-aware adapters that jointly leverage both text and structural inputs. These adapters are composed of modular Nexus Blocks that utilize cross-attention mechanisms to enable effective multimodal conditioning. As a result, the adapters can maintain structural fidelity while aligning more closely with the intended textual prompt. Experimental results demonstrate that Nexus Prime significantly improves performance with only 8M additional parameters over the minimal T2I baseline. Additionally, Nexus Slim, a lightweight variant with 18M fewer parameters than T2I, still achieves competitive, state-of-the-art results—validating the efficiency and effectiveness of our proposed design.