PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval
Abstract
Zero-shot Composed Image Retrieval (ZS-CIR) enables image search using a reference image andtext prompt without requiring specialized text-image composition networks trained on large-scale paireddata. However, current ZS-CIR approaches face three critical limitations in their reliance on composedtext embeddings: static query embedding representations, insufficient utilization of image embeddings,and suboptimal performance when fusing text and image embeddings. To address these challenges, weintroduce the Prompt Directional Vector (PDV), a simple yet effective training-free enhancement thatcaptures semantic modifications induced by user prompts. PDV enables three key improvements: (1)dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor,(2) composed image embeddings through semantic transfer from text prompts to image features, and(3) weighted fusion of composed text and image embeddings that enhances retrieval by balancingvisual and semantic similarity. Our approach serves as a plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead. Extensive experiments across multiple benchmarksdemonstrate that PDV consistently improves retrieval performance when integrated with state-of-the-artZS-CIR approaches, particularly for methods that generate accurate compositional embeddings. The codewill be released upon publication.