Oral Session
Oral Session 2A: Vision+Language and Other Modalities I
MageBench: Bridging Large Multimodal Models to Agents
Miaosen Zhang ⋅ Qi Dai ⋅ Yifan Yang ⋅ Jianmin Bao ⋅ Dongdong Chen ⋅ Kai Qiu ⋅ Chong Luo ⋅ Xin Geng ⋅ Baining Guo
Recent models like OpenAI's O1 and DeepSeek's R1, which utilize test-time scaling techniques, have demonstrated remarkable improvements in reasoning capabilities. We anticipate that in the near future, multimodal models will also experience significant breakthroughs in multimodal reasoning. This will require some highly challenging and specialized evaluations.As one of the most crucial real-world applications of multimodal models, visual agents require complex and comprehensive capabilities such as spatial planning and vision-in-the-chain type reasoning. These capabilities are currently lacking in existing multimodal benchmarks. In this paper, we introduce MageBench, a Multimodal reasoning benchmark built upon light-weight AGEnt environments that pose significant reasoning challenges and hold substantial practical value. The results show that only a few product-level models are better than random acting, and all of them are far inferior to human level. We analyze and summarize their errors and capability gaps in visual planning.Furthermore, we found that rule-based RL can significantly boost visual reasoning capabilities. This highlights that our benchmark could serve as a valuable testing ground for the emerging field of agentic RL research.
You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
Logan Lawrence ⋅ Oindrila Saha ⋅ Megan Wei ⋅ Chen Sun ⋅ Subhransu Maji ⋅ Grant Horn
Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of autoregressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate \textit{nlg2choice}, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.
InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation
Sreehari Rajan ⋅ Kunal Bhosikar ⋅ Charu Sharma
Generating realistic human motions that naturally respond to both spoken language and physical objects is crucial for interactive digital experiences. Current methods, however, address speech-driven gestures or object interactions independently, limiting real-world applicability due to a lack of integrated, comprehensive datasets. To overcome this, we introduce InteracTalker, a novel framework that seamlessly integrates prompt-based object-aware interactions with co-speech gesture generation. We achieve this by employing a multi-stage training process to learn a unified motion, speech, and prompt embedding space. To support this, we curate a rich human-object interaction dataset, formed by augmenting an existing text-to-motion dataset with detailed object interaction annotations. Our framework utilizes a Generalized Motion Adaptation Module that enables independent training, adapting to the corresponding motion condition, which is then dynamically combined during inference. To address the imbalance between heterogeneous conditioning signals, we propose an adaptive fusion strategy, which dynamically reweights the conditioning signals during diffusion sampling.InteracTalker successfully unifies these previously separate tasks, outperforming prior methods in both co-speech gesture generation and object-interaction synthesis, outperforming gesture-focused diffusion methods, yielding highly realistic, object-aware full-body motions with enhanced realism, flexibility, and control.
ITSELF: Attention Guided Fine-Grained Alignment for Vision–Language Retrieval
TIEN-HUY NGUYEN ⋅ Huu-Loc Tran ⋅ Thanh Ngo
Vision–language models (VLMs) have rapidly advanced and and show strong promise for text-based person search (TBPS), a task that requires capturing fine-grained relationships between images and text to distinguish individuals. Previous methods address these challenges through local alignment, yet they are often prone to shortcut learning and spurious correlations, yielding misalignment. Moreover, injecting prior knowledge can distort intra-modality structure. Motivated by our observation that encoder attention surfaces spatially precise evidence from the earliest training epochs \emph{and} to alleviate these issues, we introduce ITSELF, an attention-guided framework for \emph{implicit local alignment}. At its core, Guided Representation with Attentive Bank (GRAB) converts the model’s own attention into an Attentive Bank of high-saliency tokens and applies local objectives on this bank, learning fine-grained correspondences without extra supervision. To make the selection reliable and non-redundant, we introduce Multi-Layer Attention for Robust Selection (MARS), which aggregates attention across layers and performs diversity-aware top-k selection; and Adaptive Token Scheduler (ATS), which schedules the retention budget from coarse to fine over training, preserving context early while progressively focusing on discriminative details. Extensive experiments on three widely used TBPS benchmarks show \textbf{state-of-the-art} performance and strong cross-dataset generalization, confirming the effectiveness and robustness of our approach without additional prior supervision.
MarineEval: Assessing the Marine Intelligence of Vision-Language Models
Yuk Kwan Wong ⋅ Tuan-An To ⋅ Jipeng Zhang ⋅ Ziqiang Zheng ⋅ Sai-Kit Yeung
We have witnessed promising progress led by large language models (LLMs) and further visual language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 13 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research.