TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning
Abstract
Traditional vision-language models struggle with fine-grained taxonomic reasoning, particularly distinguishing between visually similar species within the same genus or family. We propose a reinforcement learning approach using Group Relative Policy Optimization with intermediate rewards that decompose the reasoning process into hierarchical taxonomic predictions. Our method incentivizes models to explicitly reason about species-level, genus-level, and family-level features before making final classifications. This structured approach is designed not only to boost accuracy but also to yield a transparent, verifiable decision-making process. On the challenging Birds-to-Words dataset, our approach achieves 91.7% accuracy on same-species verification, matching human performance (77.3%) while generating interpretable reasoning traces. We demonstrate cross-domain generalization showing substantial gains on primate verification while generating explainable traces. The intermediate reward mechanism shows that structured biological reasoning provides a powerful framework for fine-grained visual discrimination.