Crafting Descriptive Information for a Zero-shot Method to Improve Knowledge-Based Visual Question Answering Performance
Abstract
We present GC-KBVQA, a zero-shot framework for knowledge-based visual question answering (KB-VQA) that requires no additional training. GC-KBVQA leverages pre-trained models together with carefully designed, context-aware descriptive information. The framework integrates three modules—(i) question-guided visual grounding, (ii) semantics-based caption filtering, and (iii) inter-stage feedback—that work together to generate concise, relevant prompts while reducing hallucinations and noisy auxiliary text. Despite its lightweight design, GC-KBVQA outperforms strong zero-shot baselines by up to +10.97% on OK-VQA, A-OKVQA, and VQAv2, and approaches the performance of few-shot systems without labeled data. The framework is model-agnostic, maintaining effectiveness across LLMs from TinyLLaMA-1B to Llama3-8B with minimal degradation. Ablation studies confirm that grounding, dual-caption generation, and both intra-stage filtering and inter-stage feedback each contribute to accuracy improvements. By combining efficiency, robustness, and modularity, GC-KBVQA provides a practical and scalable direction for zero-shot KB-VQA.