Track: Poster Session 2 + Refreshments

1

MageBench: Bridging Large Multimodal Models to Agents

Miaosen Zhang ⋅ Qi Dai ⋅ Yifan Yang ⋅ Jianmin Bao ⋅ Dongdong Chen ⋅ Kai Qiu ⋅ Chong Luo ⋅ Xin Geng ⋅ Baining Guo

Recent models like OpenAI's O1 and DeepSeek's R1, which utilize test-time scaling techniques, have demonstrated remarkable improvements in reasoning capabilities. We anticipate that in the near future, multimodal models will also experience significant breakthroughs in multimodal reasoning. This will require some highly challenging and specialized evaluations.As one of the most crucial real-world applications of multimodal models, visual agents require complex and comprehensive capabilities such as spatial planning and vision-in-the-chain type reasoning. These capabilities are currently lacking in existing multimodal benchmarks. In this paper, we introduce MageBench, a Multimodal reasoning benchmark built upon light-weight AGEnt environments that pose significant reasoning challenges and hold substantial practical value. The results show that only a few product-level models are better than random acting, and all of them are far inferior to human level. We analyze and summarize their errors and capability gaps in visual planning.Furthermore, we found that rule-based RL can significantly boost visual reasoning capabilities. This highlights that our benchmark could serve as a valuable testing ground for the emerging field of agentic RL research.

2

You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Logan Lawrence ⋅ Oindrila Saha ⋅ Megan Wei ⋅ Chen Sun ⋅ Subhransu Maji ⋅ Grant Horn

Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of autoregressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate \textit{nlg2choice}, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.

3

InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation

Sreehari Rajan ⋅ Kunal Bhosikar ⋅ Charu Sharma

Generating realistic human motions that naturally respond to both spoken language and physical objects is crucial for interactive digital experiences. Current methods, however, address speech-driven gestures or object interactions independently, limiting real-world applicability due to a lack of integrated, comprehensive datasets. To overcome this, we introduce InteracTalker, a novel framework that seamlessly integrates prompt-based object-aware interactions with co-speech gesture generation. We achieve this by employing a multi-stage training process to learn a unified motion, speech, and prompt embedding space. To support this, we curate a rich human-object interaction dataset, formed by augmenting an existing text-to-motion dataset with detailed object interaction annotations. Our framework utilizes a Generalized Motion Adaptation Module that enables independent training, adapting to the corresponding motion condition, which is then dynamically combined during inference. To address the imbalance between heterogeneous conditioning signals, we propose an adaptive fusion strategy, which dynamically reweights the conditioning signals during diffusion sampling.InteracTalker successfully unifies these previously separate tasks, outperforming prior methods in both co-speech gesture generation and object-interaction synthesis, outperforming gesture-focused diffusion methods, yielding highly realistic, object-aware full-body motions with enhanced realism, flexibility, and control.

4

ITSELF: Attention Guided Fine-Grained Alignment for Vision–Language Retrieval

TIEN-HUY NGUYEN ⋅ Huu-Loc Tran ⋅ Thanh Ngo

Vision–language models (VLMs) have rapidly advanced and and show strong promise for text-based person search (TBPS), a task that requires capturing fine-grained relationships between images and text to distinguish individuals. Previous methods address these challenges through local alignment, yet they are often prone to shortcut learning and spurious correlations, yielding misalignment. Moreover, injecting prior knowledge can distort intra-modality structure. Motivated by our observation that encoder attention surfaces spatially precise evidence from the earliest training epochs \emph{and} to alleviate these issues, we introduce ITSELF, an attention-guided framework for \emph{implicit local alignment}. At its core, Guided Representation with Attentive Bank (GRAB) converts the model’s own attention into an Attentive Bank of high-saliency tokens and applies local objectives on this bank, learning fine-grained correspondences without extra supervision. To make the selection reliable and non-redundant, we introduce Multi-Layer Attention for Robust Selection (MARS), which aggregates attention across layers and performs diversity-aware top-k selection; and Adaptive Token Scheduler (ATS), which schedules the retention budget from coarse to fine over training, preserving context early while progressively focusing on discriminative details. Extensive experiments on three widely used TBPS benchmarks show \textbf{state-of-the-art} performance and strong cross-dataset generalization, confirming the effectiveness and robustness of our approach without additional prior supervision.

5

MarineEval: Assessing the Marine Intelligence of Vision-Language Models

Yuk Kwan Wong ⋅ Tuan-An To ⋅ Jipeng Zhang ⋅ Ziqiang Zheng ⋅ Sai-Kit Yeung

We have witnessed promising progress led by large language models (LLMs) and further visual language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 13 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research.

6

Identity Verification from Human Scent using Channel Representation of 2D Gas Chromatography-Mass Spectrometry Data

Radim Spetlik ⋅ Jan Hlavsa ⋅ Jana Čechová ⋅ Petra Pojmanová ⋅ Jiri Matas ⋅ Štěpán Urban

This study examines the feasibility of employing raw two-dimensional gas chromatography/time-of-flight mass spectrometry (GCxGC ToF-MS) data for the purpose of human scent identity verification. Unlike techniques that require expert-driven identification of compounds, our framework transforms each GCxGC sample into a multi-channel image. A comprehensive assessment has been conducted on ten channel-encoding schemes, five spatial-alignment strategies, and ten feature-embedding methods.The evaluation is performed on a newly assembled dataset of 252 individuals, comprising 2,528 raw samples and aggregating around 7.5TB of data. In contrast to conventional methodologies employed in chemical analysis, our research demonstrates that alignment to a common spatial reference frame is unnecessary. The best performing method reaches an approximately 53% true positive rate at a 5% false positive rate. Although this performance is below that of well-established biometrics (e.g., iris verification), our results underscore the feasibility of raw-odor-based verification for scenarios where direct line-of-sight or cooperation may be limited, thereby revealing opportunities for interdisciplinary research.We will release the code and datasets with the camera ready.

7

milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion

Niraj Prakash Kini ⋅ Shiau-Rung Tsai ⋅ Guan-Hsun Lin ⋅ Wen-Hsiao Peng ⋅ Ching-Wen Ma ⋅ Jenq-Neng Hwang

Millimeter-wave radar offers a privacy-preserving and lighting-invariant alternative to RGB sensors for Human Pose Estimation (HPE) task. However, the radar signals are often sparse due to specular reflection, making the extraction of robust features from radar signals highly challenging. To address this, we present milliMamba, a radar-based 2D human pose estimation framework that jointly models spatio-temporal dependencies across both the feature extraction and decoding stages. Specifically, given the high dimensionality of radar inputs, we adopt a Cross-View Fusion Mamba encoder to efficiently extract spatio-temporal features from longer sequences with linear complexity. A Spatio-Temporal-Cross Attention decoder then predicts joint coordinates across multiple frames. Together, this spatio-temporal modeling pipeline enables the model to leverage contextual cues from neighboring frames and joints to infer missing joints caused by specular reflections. To reinforce motion smoothness, we incorporate a velocity loss alongside the standard keypoint loss during training. Experiments on the TransHuPR and HuPR datasets demonstrate that our method achieves significant performance improvements, exceeding the baselines by 11.0 AP and 14.6 AP, respectively, while maintaining reasonable complexity. Our code will be released upon publication.

8

OpenCowID: Zero-Shot Visual Identification of Dairy Cows

Omkar Prabhune ⋅ Younghyun Kim

Accurate identification of individual cows is essential to precision dairy farming. While computer vision offers a non-invasive alternative to ear tags and RFID systems, its practical deployment remains limited by the need for zero-shot identification in dynamic herds where test identities are unseen during training. In this work, we propose OpenCowID, a unified framework that addresses this challenge.First, we introduce a stochastic cow coat synthesis pipeline that efficiently generates large-scale, diverse images.Second, using the generated large-scale high-quality data, we present a centroid-guided feature learning strategy that forms a well-structured embedding space using virtual class centroids, enabling generalization to unseen identities. OpenCowID achieves state-of-the-art zero-shot and open-set identification on real-world cow benchmarks, without requiring any real labeled training data. This work contributes to the advancement of automated livestock monitoring, enabling robust, non-invasive identification.The code for reproducing our results is provided in the supplementary material.

9

QCFace: Image Quality Control for boosting Face Representation & Recognition

Duc-Phuong Doan-Ngo ⋅ Thanh-Dang Diep ⋅ Thanh Nguyen-Duc ⋅ Thanh-Sach LE ⋅ Nam Thoai

Recognizability, a key perceptual factor in human face processing, strongly affects the performance of face recognition (FR) systems in both verification and identification. Effectively using recognizability to enhance feature representation remains challenging. In deep FR, the loss function plays a crucial role in shaping how features are embedded. However, current methods have two main drawbacks: (i) recognizability is only partially captured through soft margin constraints, resulting in weaker quality representation and lower discrimination, especially for low-quality or ambiguous faces; (ii) mutual overlapping gradients between feature direction and magnitude introduce undesirable interactions during optimization, causing instability and confusion in hypersphere planning, which may result in poor generalization, and entangled representations where recognizability and identity are not cleanly separated. To address these issues, we introduce a hard margin strategy - Quality Control Face (QCFace) that overcomes the mutual overlapping gradient problem and enables clear decoupling of recognizability from identity representation. Based on this strategy, a novel \textit{hard-margin-based} loss function employs a guidance factor for hypersphere planning, simultaneously optimizing for recognition ability and explicit recognizability representation. Extensive experiments confirm that QCFace not only provides robust and quantifiable recognizability encoding but also achieves state-of-the-art performance in both verification and identification benchmarks compared to existing recognizability-based losses.

10

MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions

Kaen Kazawa (Kogashi) ⋅ Anoop Cherian ⋅ Meng-Yu Jennifer Kuo

Real‑world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal‑oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI -- a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction‑specific body parts, providing a comprehensive testbed for next-generation HOI research.Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human–object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality.

11

BrightRate: Quality Assessment for User-Generated HDR Videos

Shreshth Saini ⋅ Bowen Chen ⋅ Yilin Wang ⋅ Neil Birkbeck ⋅ Balu Adsumilli ⋅ Alan Bovik

High Dynamic Range (HDR) videos offer superior luminance and color fidelity as compared to Standard Dynamic Range (SDR) content. The rapid growth of User-Generated Content (UGC) on platforms such as YouTube, Instagram, and TikTok has brought a significant increase in the volumes of streamed and shared UGC videos. This newer category of videos brings new challenges to the development of effective No-Reference (NR) video quality assessment (VQA) models specialized to HDR UGC, because of the extreme variety and severities of distortions, arising from diverse capture, editing, and processing outcomes. Towards addressing this issue, we introduce BrightVQ, a sizeable new psychometric data resource. It is the first large-scale subjective video quality database dedicated to the quality modelling of HDR UGC videos. BrightVQ comprises 2,100 videos, on which we collected 73,794 perceptual quality ratings. Using this dataset, we also developed BrightRate, a novel video quality prediction model designed to capture both UGC-specific distortions coexisting with HDR-specific artifacts. Extensive experimental results demonstrate that BrightRate achieves state-of-the-art performance across HDR databases. Project page: https://brightvqa.github.io/BrightVQ/

12

Reviving Unsupervised Optical Flow: Concept Reevaluation, Multi-Scale Advances and Full Open-Source Release

Azin Jahedi ⋅ Marc Rivinius ⋅ Noah Senn ⋅ Andres Bruhn

Unsupervised optical flow methods have become more popular in the last decade, enabling the training of models across domains without ground truth data. Although RAFT and its successors have achieved significant success in the supervised settings, many unsupervised approaches continue to use older backbones such as PWC-Net. One reason for this architectural stagnation is that the current RAFT-based SOTA approach has proven challenging for the community to reproduce. In this paper, we revive and advance unsupervised optical flow: First, we introduce Sun-RAFT: a simple unsupervised RAFT. Second, building on Sun-RAFT, we present Muun-RAFT: a novel multi-scale unsupervised RAFT, where we propose a gradual context-based upsampling to refine the flow, further improving both accuracy and preservation of details. Third, we reexamine previously advised unsupervised strategies to identify effective training settings. In terms of results, both our methods demonstrate strong generalization capabilities and set a new SOTA for unsupervised two-frame approaches on MPI-Sintel, with Muun-RAFT surpassing even the current multi-frame SOTA by up to 28%. Finally, we open-source our PyTorch code, enabling further developments in the field: https://cv-stuttgart.github.io/Reviving-Unsupervised-OpticalFlow.

13

UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations

Debabrata Mandal ⋅ Soumitri Chattopadhyay ⋅ Guansen Tong ⋅ Praneeth Chakravarthula

Image restoration is essential for enhancing degraded images across computer vision tasks. However, most existing methods address only a single type of degradation (e.g., blur, noise, or haze) at a time, limiting their real-world applicability where multiple degradations often occur simultaneously. In this paper, we propose UniCoRN, a unified image restoration approach capable of handling multiple degradation types simultaneously using a multi-head diffusion model. Specifically, we uncover the potential of low-level visual cues extracted from images in guiding a controllable diffusion model for real-world image restoration, and design a multi-head control network adaptable via a mixture-of-experts strategy. We train our model without any prior assumption of specific degradations, through a smartly designed curriculum learning recipe. Additionally, we also introduce MetaRestore, a metalens imaging benchmark containing images with multiple degradations and artifacts. Extensive evaluations on several challenging datasets, including our benchmark, demonstrate that our method achieves significant performance gains and can robustly restore images with severe degradations. Our code and datasets will be open-sourced upon acceptance.

14

DRWKV: Focusing on Object Edges for Low-Light Image Enhancement

Xuecheng Bai ⋅ Yuxiang Wang ⋅ Boyu Hu ⋅ Qinyuan Jie ⋅ Chuanzhi Xu ⋅ Kechen Li ⋅ Hongru Xiao ⋅ Yuk Chung

Low-light image enhancement remains a challenging task, particularly in preserving object edge continuity and fine structural details under extreme illumination degradation. In this paper, we propose a novel model, DRWKV (Detailed Receptance Weighted Key Value), which integrates our proposed Global Edge Retinex (GER) theory, enabling effective decoupling of illumination and edge structures for enhanced edge fidelity. Secondly, we introduce Evolving WKV Attention, a spiral-scanning mechanism that captures spatial edge continuity and models irregular structures more effectively. Thirdly, we design the Bilateral Spectrum Aligner (Bi-SAB) and a tailored MS²-Loss to jointly align luminance and chrominance features, improving visual naturalness and mitigating artifacts. Extensive experiments on five LLIE benchmarks demonstrate that DRWKV achieves leading performance in PSNR, SSIM, and NIQE while maintaining low computational complexity. Furthermore, DRWKV enhances downstream performance in low-light multi-object tracking tasks, validating its generalization capabilities.

15

Layout Anything: One Transformer for Universal Room Layout Estimation

Md Sohag Mia ⋅ Muhammad Abdullah Adnan

We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer's universal segmentation architecture to geometric structure prediction. Our approach integrates OneFormer's task-conditioned queries and contrastive learning with two key modules: (1) a layout degeneration strategy that augments training data while preserving Manhattan-world constraints through topology-aware transformations, and (2) differentiable geometric losses that directly enforce planar consistency and sharp boundary predictions during training. By unifying these components in an end-to-end framework, the model eliminates complex post-processing pipelines while achieving real-time inference at 114 ms. Extensive experiments demonstrate state-of-the-art performance across standard benchmarks, with pixel error (PE) of 5.43% and corner error (CE) of 4.02% on the LSUN dataset and PE of 7.04% (CE 5.17%) on the Hedau datasets. The framework's combination of geometric awareness and computational efficiency makes it particularly suitable for augmented reality applications and large-scale 3D scene reconstruction tasks.

16

BOP-Distrib: Revisiting 6D Pose Estimation Benchmarks for Better Evaluation under Visual Ambiguities

Boris Meden ⋅ Asma Brazi ⋅ Fabrice Mayran de Chamisso ⋅ Steve Bourgeois ⋅ Vincent Lepetit

6D pose estimation aims at determining the object pose that best explains the camera observation. The unique solution for non-ambiguous objects can turn into a multi-modal pose distribution for symmetrical objects or when occlusions of symmetry-breaking elements happen, depending on the viewpoint.Currently, 6D pose estimation methods are benchmarked on datasets that consider, for their ground truth annotations, visual ambiguities as only related to global object symmetries, whereas they should be defined per-image to account for the camera viewpoint. We thus first propose an automatic method to re-annotate those datasets with a 6D pose distribution specific to each image, taking into account the object surface visibility in the image to correctly determine the visual ambiguities. Second, given this improved ground truth, we re-evaluate the state-of-the-art single pose methods and show that this greatly modifies the ranking of these methods. Third, as some recent works focus on estimating the complete set of solutions, we derive a precision/recall formulation to evaluate them against our image-wise distribution ground truth, making it the first benchmark for pose distribution methods on real images.

17

Cosine Similarity is Almost All You Need (for Prototypical-Part Models)

Luke Moffett ⋅ Frank Willard ⋅ Maximillian Machado ⋅ Emmanuel Mokel ⋅ Jon Donnelly ⋅ Zhicheng Guo ⋅ Adam Costarino ⋅ Julia Yang ⋅ Giyoung Kim ⋅ Alina Barnett ⋅ Cynthia Rudin

Prototypical-part networks are a popular interpretable alternative to black-box deep learning models for computer vision because of their faithful, prototype-based self-explanations.However, in practice, they have proven difficult to train because they are highly sensitive to hyperparameter tuning and difficult to comprehend because they contain a large number of prototypes.We show that replacing $\ell_2$ distance with an angular prototype similarity in the original ProtoPNet greatly improves robustness to hyperparameter selection and is sufficient to produce accuracy and sparsity competitive with state-of-the-art on many backbones and datasets.We also show cosine similarity leads to superior accuracy for five different ProtoPNet architectures (ProtoPNet, TesNet, Deformable ProtoPNet, ProtoTree, and ST-ProtoPNet).Finally, we demonstrate ProtoPNet with cosine similarity produces better semantics than $\ell_2$: prototypes from cosine models score better on prototype quality metrics and are perceived as more similar 3:2 in a user study.

18

Orca: Object Recognition and Comprehension for Archiving Marine Species

Yuk Kwan Wong ⋅ Liang Haixin ⋅ Zeyu Ma ⋅ Yiwei Chen ⋅ Ziqiang Zheng ⋅ Rinaldi Gotama ⋅ Pascal Sebastian ⋅ Lauren Sparks ⋅ Sai-Kit Yeung

Marine visual understanding is essential for monitoring and protecting marine ecosystems, enabling automatic and scalable biological surveys. However, progress is hindered by limited training data and the lack of a systematic task formulation that aligns domain-specific marine challenges with well-defined computer vision tasks, thereby limiting effective model application. To address this gap, we present Orca, a multi-modal benchmark for marine research comprising 14,647 images from 478 species, with 42,217 bounding box annotations and 22,321 expert-verified instance captions. The dataset provides fine-grained visual and textual annotations that capture morphology-oriented attributes across diverse marine species. To catalyze methodological advances, we evaluate 18 state-of-the-art models on three tasks: object detection (closed-set and open-vocabulary), instance captioning, and visual grounding. Results highlight key challenges, including species diversity, morphological overlap, and specialized domain demands, underscoring the difficulty of marine understanding. Orca thus establishes a comprehensive benchmark to advance research in marine domain.

19

Multimodal Medical Image Binding via Shared Text Embeddings

Yunhao Liu ⋅ Suyang Xi ⋅ Shiqi Liu ⋅ Hong Ding ⋅ Chicheng Jin ⋅ Zhong Chong ⋅ Junjun He ⋅ Catherine Liu ⋅ Yiqing Shen

Medical image analysis increasingly relies on the integration of multiple imaging modalities to capture complementary anatomical and functional information, enabling more accurate diagnosis and treatment planning.Achieving aligned feature representations across these diverse modalities is therefore important for effective multimodal analysis.While contrastive language-image pre-training (CLIP) and its variant have enabled image-text alignments, they require explicitly paired data between arbitrary two modalities, which is difficult to acquire in medical contexts. To address the gap, we present Multimodal Medical Image Binding with Text (M$^3$Bind), a novel pre-training framework that enables seamless alignment of multiple medical imaging modalities through a shared text representation space without requiring explicit paired data between any two medical image modalities.Specifically, based on the insight that different images can naturally bind with text, M$^3$Bind first fine-tunes pre-trained CLIP-like image-text models, which are derived from different medical modalities, to align their modality-specific text embedding space while preserving their original image-text alignments. Subsequently, we distill these modality-specific text encoders into a unified model, creating a shared text embedding space.Notably, M$^3$Bind is a flexible framework in which the selection of CLIP-like models is not fixed and can be adapted according to the requirements of the task.Experiments on X-ray, CT, retina, ECG, and pathological images on multiple downstream tasks demonstrate that M$^3$Bind achieves competitive or even superior performance in zero-shot, few-shot classification and cross-modal retrieval tasks compared to its CLIP-like counterparts.These results validate M$^3$Bind's effectiveness in achieving cross-image-modal alignment for medical analysis.

20

PHYSPLAT: a Framework for Photorealistic Hybrid Simulation of Real and Synthetic Elements using 3D Gaussian Splatting

Mario Alfonso-Arsuaga ⋅ Henar Dominguez-Elvira ⋅ Jorge Guerrero ⋅ Andrea Castiella-Aguirrezabala ⋅ Lorenzo Domínguez ⋅ Jorge García-González ⋅ Maria Naranjo-Almeida ⋅ Marc Comino-Trinidad ⋅ Jorge Lopez-Moreno

We present an integrated, end-to-end system that enables photorealistic real-world objects—reconstructed using 3D Gaussian Splatting (3DGS)—to interact seamlessly with synthetic elements such as polygonal meshes, fluids, fabrics, and robotic systems within a unified simulation environment. By leveraging Material Point Method (MPM) simulation, our system ensures the compatibility of the 3DGS representation with established physics engines while remaining extensible.Our workflow begins by capturing real-world scenes “in the wild” using 3DGS, from which we derive a simplified, appearance-agnostic particle proxy suitable for physics simulation. These particles, along with synthetic primitives, are imported into the system, where the simulator computes positions and deformation gradients for all bodies—including 3DGS-derived particles—at each timestep.We validate our system through collision and deformation scenarios, and showcase a robotics application in which a manipulator plans and executes tasks involving both captured objects and synthetic elements. By selecting the most appropriate solver and constitutive model for each material—such as MPM for granular media and deformables, PBD for cloth, or SPH for fluids—our approach delivers: (i) high visual fidelity, (ii) accurate, material-specific physical behavior, and (iii) minimal performance overhead.Our pipeline streamlines scene preparation, offering a significant advantage over traditional mesh-centric photogrammetry for time-sensitive reconstruction and emergency scenarios. This combination of flexibility and realism makes our system well-suited for robot task planning, photorealistic multiview dataset generation for autonomous navigation, and other embodied AI applications.

21

ArchitectHead: Continuous Level of Detail Control for 3D Gaussian Head Avatars

Peizhi Yan ⋅ Rabab Ward ⋅ Qiang Tang ⋅ Shan Du

3D Gaussian Splatting (3DGS) has enabled photorealistic and real-time rendering of 3D head avatars. Existing 3DGS-based avatars typically rely on tens of thousands of 3D Gaussian points (Gaussians), with the number of Gaussians fixed after training. However, many practical applications require adjustable levels of detail (LOD) to balance rendering efficiency and visual quality. In this work, we propose "ArchitectHead", the first framework for creating 3D Gaussian head avatars that support continuous control over LOD. Our key idea is to parameterize the Gaussians in a 2D UV feature space and propose a UV feature field composed of multi-level learnable feature maps to encode their latent features. A lightweight neural network-based decoder then transforms these latent features into 3D Gaussian attributes for rendering. ArchitectHead controls the number of Gaussians by dynamically resampling feature maps from the UV feature field at the desired resolutions. This method enables efficient and continuous control of LOD without retraining. Experimental results show that ArchitectHead achieves state-of-the-art (SOTA) quality in self and cross-identity reenactment tasks at the highest LOD, while maintaining near SOTA performance at lower LODs. At the lowest LOD, our method uses only 6.2% of the Gaussians while the quality degrades moderately (L1 Loss +7.9%, PSNR –0.97%, SSIM –0.6%, LPIPS Loss +24.1%), and the rendering speed nearly doubles. The code will be released.

22

Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts

Madhav Gupta ⋅ Vishak Prasad C ⋅ Ganesh Ramakrishnan

Subset selection–based methods are widely used to explain deep vision models: they attribute predictions by highlighting the most influential image regions and support object-level explanations. While these methods perform well in in-distribution (ID) settings, their behavior under out-of-distribution (OOD) conditions remains poorly understood. Through extensive experiments across multiple ID–OOD sets, we find that reliability of the existing subset based methods degrades markedly, yielding redundant, unstable, and uncertainty-sensitive explanations. To address these shortcomings, we introduce a framework that combines submodular subset selection with layer-wise, gradient-based uncertainty estimation to improve robustness and fidelity without requiring additional training or auxiliary models. Our approach estimates uncertainty via adaptive weight perturbations and uses these estimates to guide submodular optimization, ensuring diverse and informative subset selection. Empirical evaluations show that, beyond mitigating the weaknesses of existing methods under OOD scenarios, our framework also yields improvements in ID settings. These findings highlight limitations of current subset-based approaches and demonstrate how uncertainty-driven optimization can enhance attribution and object-level interpretability, paving the way for more transparent and trustworthy AI in real-world vision applications.

23

The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

Tejas Anvekar ⋅ Fenil Bardoliya ⋅ Pavan Turaga ⋅ Chitta Baral ⋅ Vivek Gupta

Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language decoders while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), leaving open whether progress reflects genuine visual grounding or language-side scaling. Existing evaluations emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present the Perceptual Observatory, a framework that benchmarks MLLMs across three verticals: (i) simple vision tasks, such as face matching and OCR capabiities; (ii) local vs. global understanding, encompassing image matching, grid pointing game, and attribution; which tests general perceptual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through low-level augmentations and high-level style-transfer illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield holistic insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.

24

FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy

Haochen Zhang ⋅ Nirav Savaliya ⋅ Faizan Siddiqui ⋅ Enna Sachdeva

Embodied Question Answering (EQA) combines visual scene understanding, goal‑directed exploration, spatial and temporal reasoning under partial observability. A central challenge is to confine physical search to question‑relevant subspaces while maintaining a compact, actionable memory of observations. Furthermore, for real-world deployment, fast inference time during exploration is crucial. We introduce FAST‑EQA, a question‑conditioned framework that (i) identifies likely visual targets, (ii) scores global regions of interest to guide navigation, and (iii) employs Chain‑of‑Thought (CoT) reasoning over visual memory to answer confidently. FAST‑EQA maintains a bounded scene memory that stores a fixed‑capacity set of region–target hypotheses and updates them online, enabling robust handling of both single‑ and multi‑target questions without unbounded growth. To expand coverage efficiently, a global exploration policy treats narrow openings and doors as high‑value frontiers, complementing local target seeking with minimal computation. Together, these components focus the agent’s attention, improves scene coverage, and improve answer reliability while running substantially faster than prior approaches. On HMEQA and EXPRESS‑Bench, FAST‑EQA achieves state‑of‑the‑art performance, while performing competitively on OpenEQA and MT‑HM3D.

25

RapidMV: Leveraging Spatio-Angular Latent Space for Efficient and Consistent Text-to-Multi-View Synthesis

Seungwook Kim ⋅ Yichun Shi ⋅ Kejie Li ⋅ Minsu Cho ⋅ Peng Wang

Generating synthetic multi-view images from a text prompt is an essential bridge to generating synthetic 3D assets. In this work, we introduce RapidMV, a novel text-to-multi-view generative model that can produce 32 multi-view synthetic images in just around 5 seconds. In essence, we propose a novel spatio-angular latent space, encoding both the spatial appearance and angular viewpoint deviations into a single latent for improved efficiency and multi-view consistency. We achieve effective training of RapidMV by strategically decomposing our training process into multiple steps. We demonstrate that RapidMV outperforms existing methods in terms of consistency and latency, with competitive quality and text-image alignment.

26

Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning

Bolutife Atoki ⋅ Iuliia Tkachenko ⋅ Bertrand Kerautret ⋅ Carlos Crispim-Junior Crispim-Junior

Counterfeiting affects diverse industries, including pharmaceuticals, electronics, and food, posing serious health and economic risks. Printable unclonable codes, such as Copy Detection Patterns (CDPs), are widely used as an anti-counterfeiting measure and are directly applied to products and packaging. However, the increasing availability of high-resolution printing and scanning devices, along with advances in generative deep learning, undermines traditional authentication systems, which often fail to distinguish high-quality counterfeits from genuine prints. In this work, we propose a diffusion-based authentication framework that jointly leverages the original binary template, the printed CDP, and a semantically meaningful representation of printer identity. By formulating authentication as a multi-class classification task over printer signatures, our model captures fine-grained, device-specific features through both spatial and textual conditioning. We extend ControlNet by repurposing the denoising process for class-conditioned noise prediction, enabling effective printer classification. Experiments on the Indigo 1 x 1 Base dataset show that our method outperforms traditional similarity metrics and prior deep learning approaches. Results further demonstrate that the framework generalises robustly to counterfeit types not seen during training.

27

AD2: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems

Ishan Sahu ⋅ Somnath Hazra ⋅ Somak Aditya ⋅ Soumyajit Dey

End-to-end autonomous driving systems have achieved significant progress, yet their adversarial robustness remains largely underexplored. In this work, we conduct a closed-loop evaluation of state-of-the-art autonomous driving agents under black-box adversarial threat models in CARLA. Specifically, we consider three representative attack vectors on the visual perception pipeline: (i) a physics-based blur attack induced by acoustic waves, (ii) an electromagnetic interference attack that distorts captured images, and (iii) a digital attack that adds ghost objects as carefully crafted bounded perturbations on images. Our experiments on two advanced agents, Transfuser and Interfuser, reveal severe vulnerabilities to such attacks, with driving scores dropping by up to 99\% in the worst case, raising valid safety concerns. To help mitigate such threats, we further propose a lightweight Attack Detection model for Autonomous Driving systems (AD$^2$) based on attention mechanisms that capture spatial–temporal consistency. Comprehensive experiments across multi-camera inputs on CARLA show that our detector achieves superior detection capability and computational efficiency compared to existing approaches.

28

SSplain: Sparse and Smooth Explainer for Retinopathy of Prematurity Classification

Elifnur Sunger ⋅ Tales Imbiriba ⋅ J. Campbell ⋅ Deniz Erdogmus ⋅ Stratis Ioannidis ⋅ Jennifer Dy

Neural networks are frequently used in medical diagnosis. However, due to their black-box nature, model explainers are used to help clinicians understand better and trust model outputs. This paper introduces an explainer method for classifying Retinopathy of Prematurity (ROP) from fundus images. Previous methods fail to generate explanations that preserve input image structures such as smoothness and sparsity. We introduce Sparse and Smooth Explainer (SSplain), a method that generates pixel-wise explanations while preserving image structures by enforcing smoothness and sparsity. This results in realistic explanations to enhance the understanding of the given black-box model. To achieve this goal, we define an optimization problem with combinatorial constraints and solve it using the Alternating Direction Method of Multipliers (ADMM). Experimental results show that SSplain outperforms commonly used explainers in terms of both post-hoc accuracy and smoothness analyses. Additionally, SSplain identifies features that are consistent with domain-understandable features that clinicians consider as discriminative factors for ROP. We also show SSplain's generalization to other domains by applying it to additional publicly available datasets.

29

Tables Guide Vision: Learning to See the Heart through Tabular Data

Marta Hasny ⋅ Maxime Di Folco ⋅ Keno Bressem ⋅ Julia Schnabel

Contrastive learning methods in computer vision typically rely on augmented views of the same image or multimodal pretraining strategies that align paired modalities. However, these approaches often overlook semantic relationships between distinct instances, leading to false negatives when semantically similar samples are treated as negatives. This limitation is especially critical in medical imaging domains such as cardiology, where demographic and clinical attributes play a critical role in assessing disease risk and patient outcomes. We introduce a tabular-guided contrastive learning framework that leverages clinically relevant tabular data to identify patient-level similarities and construct more meaningful pairs, enabling semantically aligned representation learning without requiring joint embeddings across modalities. Additionally, we adapt the k-NN algorithm for zero-shot prediction to overcome the lack of zero-shot capability in unimodal representations. We demonstrate the strength of our methods using a large cohort of short-axis cardiac MR images and clinical attributes, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data guidance yields stronger visual representations than conventional methods that rely solely on image augmentation or combined image-tabular embeddings. The code will be available on GitHub.

30

SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense

Jiayang Liu ⋅ Daniel Tso ⋅ Yiming Bu ⋅ Qinru Qiu

Adversarial attacks significantly challenge the safe deployment of deep learning models, particularly in real-world applications. Traditional defenses often rely on computationally intensive optimization (e.g., adversarial training or data augmentation) to improve robustness, whereas the human visual system achieves inherent robustness to adversarial perturbations through evolved biological mechanisms. We hypothesize that attention guided non-homogeneous sparse sampling and predictive coding plays a key role in this robustness. To test this hypothesis, we propose a novel defense framework incorporating three key biological mechanisms: foveal-peripheral processing, saccadic eye movements, and cortical filling-in. Our approach employs reinforcement learning-guided saccades to selectively capture multiple foveal-peripheral glimpses, which are integrated into a reconstructed image before classification. This biologically inspired preprocessing effectively mitigates adversarial noise, preserves semantic integrity, and notably requires no retraining or fine-tuning of downstream classifiers, enabling seamless integration with existing systems. Experiments on the ImageNet dataset demonstrate that our method improves system robustness across diverse classifiers and attack types, while significantly reducing training overhead compared to both biologically and non-biologically inspired defense techniques.

31

Enhancing Object Detection Training via Joint Image-Annotation Generation

Roy Uziel ⋅ Oded Bialer

Incorporating generated annotated data into training sets can improve object detection. Prior approaches either condition image generation on annotation layouts, limiting diversity and often causing misalignment, or generate images independently and annotate them afterward, reducing accuracy. We introduce a diffusion model that jointly generates images and annotations, enabling their co-evolution and mutual dependency throughout the process. This design achieves tight image-annotation alignment and produces diverse scenarios beyond the original training set, enhancing object detection performance when used in training.

32

Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

Jintang Xue ⋅ Ganning Zhao ⋅ Jie-En Yao ⋅ Hong-En Chen ⋅ Yue Hu ⋅ Meida Chen ⋅ Suya You ⋅ Chung Chieh Kuo

Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactions of objects. In this paper, we introduce Descrip3D, a novel and powerful framework that explicitly encodes the relationships between objects using natural language. Unlike previous methods that rely only on 2D and 3D embeddings, Descrip3D enhances each object with a textual description that captures both its intrinsic attributes and contextual relationships. These relational cues are incorporated into the model through a dual-level integration: embedding fusion and prompt-level injection. This allows for unified reasoning across various tasks such as grounding, captioning, and question answering, all without the need for task-specific heads or additional supervision. When evaluated on five benchmark datasets, including ScanRefer, Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, Descrip3D consistently outperforms strong baseline models, demonstrating the effectiveness of language-guided relational representation for understanding complex indoor scenes.

33

From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation

Shivanshu Agnihotri ⋅ Snehashis Majhi ⋅ Deepak Nayak ⋅ Debesh Jha

Accurate polyp segmentation during colonoscopy is critical for the early detection of colorectal cancer and still remains challenging due to significant size, shape, and color variations, and the camouflaged nature of polyps. While lightweight baseline models such as U-Net, U-Net++, and PraNet offer advantages in terms of easy deployment and low computational cost, they struggle to deal with the above issues, leading to limited segmentation performance. In contrast, large-scale vision foundation models such as SAM, DINOv2, OneFormer, and Mask2Former have exhibited impressive generalization performance across natural image domains. However, their direct transfer to medical imaging tasks (e.g., colonoscopic polyp segmentation) is not straightforward, primarily due to the scarcity of large-scale datasets and lack of domain-specific knowledge. To bridge this gap, we propose a novel distillation framework, Polyp-DiFoM, that transfers the rich representations of foundation models into lightweight segmentation baselines, allowing efficient and yet accurate deployment in clinical settings. In particular, we infuse semantic priors from the foundation models into canonical architectures such as U-Net and U-Net++ and further perform frequency domain encoding for enhanced distillation, corroborating their generalization capability. Extensive experiments are performed across five benchmark datasets, such as Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. Notably, Polyp-DiFoM consistently outperforms respective baseline models significantly, as well as the state-of-the-art model, with nearly 9× reduced computation overhead.

34

Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

Sangjune Park ⋅ Inhyeok Choi ⋅ Donghyeon Soon ⋅ Youngwoo Jeon ⋅ Kyungdon Joo

Dance is a form of human motion characterized by emotional expression and communication, playing a role in various fields such as music, virtual reality, and content creation. Existing methods for dance generation often fail to adequately capture the inherently sequential, rhythmical, and music-synchronized characteristics of dance. In this paper, we propose MambaDance, a new dance generation approach that leverages a Mamba-based diffusion model. Mamba, well-suited to handling long and autoregressive sequences, is integrated into our two-stage diffusion architecture, substituting off-the-shelf Transformer. Additionally, considering the critical role of musical beats in dance choreography, we propose a Gaussian-based beat representation to explicitly guide the decoding of dance sequences. Experiments on AIST++ and FineDance datasets for each sequence length show that our proposed method effectively generates plausible dance movements while reflecting essential characteristics, consistently from short to long dances, compared to the previous methods.

35

BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity

Juil Koo ⋅ Wei-Tung Lin ⋅ Chanho Park ⋅ Chanhyeok Park ⋅ Minhyuk Sung

Human creativity follows a perceptual process, moving from abstract ideas to finer details during creation. While 3D generative models have advanced dramatically, models specifically designed to assist human imagination in 3D creation--particularly for detailing abstractions from coarse to fine--have not been explored. We propose a framework that enables intuitive and interactive 3D shape generation by iteratively splitting bounding boxes to refine the set of bounding boxes. The main technical components of our framework are two generative models: the box-splitting generative model and the box-to-shape generative model. The first model, named BoxSplitGen, generates a collection of 3D part bounding boxes with varying granularity by iteratively splitting coarse bounding boxes. It utilizes part bounding boxes created through agglomerative merging and learns the reverse of the merging process—the splitting sequences. The model consists of two main components: the first learns the categorical distribution of the box to be split, and the second learns the distribution of the two new boxes, given the set of boxes and the indication of which box to split. The second model, the box-to-shape generative model, is trained by leveraging the 3D shape priors learned by an existing 3D diffusion model while adapting the model to incorporate bounding box conditioning. In our experiments, we demonstrate that the box-splitting generative model outperforms token prediction models and the inpainting approach with an unconditional diffusion model. Also, we show that our box-to-shape model, based on a state-of-the-art 3D diffusion model, provides superior results compared to a previous model.

36

3D Gaussian Point Encoders

Jim James ⋅ Benjamin Wilson ⋅ Simon Lucey ⋅ James Hays

In this work, we introduce the 3D Gaussian Point Encoder, an explicit per-point embedding built on mixtures of learned 3D Gaussians. This explicit geometric representation for 3D recognition tasks is a departure from widely used implicit representations such as PointNet. However, it is difficult to learn 3D Gaussian encoders in end-to-end fashion with standard optimizers. We develop optimization techniques based on natural gradients and distillation from PointNets to find a Gaussian Basis that can reconstruct PointNet activations. The resulting 3D Gaussian Point Encoders are faster and more parameter efficient than traditional PointNets. As in the 3D reconstruction literature where there has been considerable interest in the move from implicit (e.g., NeRF) to explicit (e.g., Gaussian Splatting) representations, we can take advantage of computational geometry heuristics to accelerate 3D Gaussian Point Encoders further. We extend filtering techniques from 3D Gaussian Splatting to construct encoders that run 2.7× faster as a comparable accuracy PointNet while using 46% less memory and 88% fewer FLOPs. Furthermore, we demonstrate the effectiveness of 3D Gaussian Point Encoders as a component in Mamba3D, running 1.27× faster and achieving a reduction in memory and FLOPs by 42% and 54% respectively. 3D Gaussian Point Encoders are lightweight enough to achieve high framerates on CPU-only devices.

37

HumanGuideNet: Adapter-Based Alignment of Deep Neural Networks with Human Similarity Judgments

Xufu Liu ⋅ Yifan Yang ⋅ Zhengxin Zhang

Aligning deep neural network (DNN) representations with human perception is essential for cognitively aligned and robust AI. We introduce HumanGuideNet, an adapter-based architecture with a human-aligned branch—HumReg—trained jointly on standard class labels (e.g., ImageNet-1k) and human similarity judgments (THINGs data) to align model representations with human similarity structure. Unlike traditional alignment methods based on linear transforms, HumanGuideNet preserves the pretrained backbone and fuses human-aligned features with backbone representations to retain general visual knowledge while injecting perceptual alignment. We show that the HumReg representations better capture human representational similarity matrices (RSMs) and lead to fused features that significantly improve generalization and robustness. Specifically, the fused features boost few-shot classification and anomaly detection accuracy across a range of datasets, while also exhibiting robustness to natural image corruptions. Our results show that modular human alignment can effectively enhance large pretrained models, providing a scalable and interpretable approach to building human-aligned visual intelligence.

38

Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification

Taha Mustapha Nehdi ⋅ Nairouz Mrabah ⋅ ATIF BELAL ⋅ Marco Pedersoli ⋅ Eric Granger

Adapting person re-identification (reID) models to new target environments remains a challenging problem that is typically addressed using unsupervised domain adaptation (UDA) methods. Recent works show that when labeled data originates from several distinct sources (e.g., datasets and cameras), considering each source separately and applying multi-source domain adaptation (MSDA) typically yields higher accuracy and robustness compared to blending the sources and performing conventional UDA. However, state-of-the-art MSDA methods learn domain-specific backbone models or require access to source domain data during adaptation, resulting in significant growth in training parameters and computational cost. In this paper, a Source‑free Adaptive Gated Experts (SAGE‑reID) method is introduced for person reID. Our SAGE‑reID is a cost-effective, source-free MSDA method that first trains individual source-specific low-rank adapters (LoRA) through source-free UDA. Next, a lightweight gating network is introduced and trained to dynamically assign optimal merging weights for fusion of LoRA experts, enabling effective cross-domain knowledge transfer. While the number of backbone parameters remains constant across source domains, LoRA experts scale linearly but remain negligible in size ($\leqslant2\%$ of the backbone), reducing both the memory consumption and risk of overfitting. Extensive experiments conducted on three challenging benchmarks -- Market-1501, DukeMTMC-reID, and MSMT17 -- indicate that SAGE‑reID outperforms state-of-the-art methods, while being computationally efficient.

39

MEGA-PCC: A Mamba-based Efficient Approach for Joint Geometry and Attribute Point Cloud Compression

Kai-Hsiang Hsieh ⋅ Monyneath Yim ⋅ Wen-Hsiao Peng ⋅ Jui-Chiu Chiang

Joint compression of point cloud geometry and attributes is essential for efficient 3D data representation. Existing methods often rely on recoloring procedures and manually tuned bitrate allocation between geometry and attribute compressions in inference, which hinder end-to-end optimization and add system complexity. To overcome these limitations, we propose MEGA-PCC, a fully end-to-end, learning-based framework featuring two specialized models for joint compression. The main compression model employs a shared encoder that embeds both geometry and attribute information into a unified latent space, followed by dual decoders that sequentially reconstruct geometry and then attributes. Complementing this, the Mamba-based Entropy Model (MEM) enhances entropy coding by capturing spatial and channel-wise correlations to improve probability estimation. Both models are built on the Mamba architecture to effectively model long-range dependencies and rich contextual features. By eliminating the need for recoloring and heuristic bitrate tuning, MEGA-PCC enables data-driven bitrate allocation during training and simplifies the overall pipeline. Extensive experiments demonstrate that MEGA-PCC achieves superior rate-distortion performance and runtime efficiency compared to both traditional and learning-based baselines, offering a powerful solution for AI-driven point cloud compression.

40

HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis

Joy Dhar ⋅ Manish Pandey ⋅ Debashis Das Chakladar ⋅ Maryam Haghighat ⋅ Azadeh Alavi ⋅ Sajib Mistry ⋅ Nayyar Zaidi

Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks.To address these limitations, we propose a Hybrid Parallel‐Fusion Cascaded Attention Network (HyPCA-Net), composed of two core blocks: (a) an efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing methods, achieving performance improvements of up to 9.34\%, while reducing computational costs by up to 78.3\%.

41

CycleSL: Server-Client Cyclical Update Driven Scalable Split Learning

Mengdi Wang ⋅ Efe Bozkir ⋅ Enkelejda Kasneci

Split learning emerges as a promising paradigm for collaborative distributed model training, akin to federated learning, by partitioning neural networks between clients and a server without raw data exchange. However, sequential split learning suffers from poor scalability, while parallel variants like parallel split learning and split federated learning often incur high server resource overhead due to model duplication and aggregation, and generally exhibit reduced model performance and convergence owing to factors like client drift and lag. To address these limitations, we introduce CycleSL, a novel aggregation-free split learning framework that enhances scalability and performance and can be seamlessly integrated with existing methods. Inspired by alternating block coordinate descent, CycleSL treats server-side training as an independent higher-level machine learning task, resampling client-extracted features (smashed data) to mitigate heterogeneity and drift. It then performs cyclical updates, namely optimizing the server model first, followed by client updates using the updated server for gradient computation. We integrate CycleSL into previous algorithms and benchmark them on four publicly available datasets with non-iid data distribution and partial client attendance. Our empirical findings highlight the effectiveness of CycleSL in enhancing model performance. We provide our source code and appendix in the supplementary materials.

42

3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

Ziyang Yan ⋅ Yihua Shao ⋅ Minwen Liao ⋅ Siyu Chen ⋅ Nan Wang ⋅ Muyuan Lin ⋅ Jenq-Neng Hwang ⋅ Hao Zhao ⋅ Fabio Remondino ⋅ Lei Li

The creation of 3D scenes has traditionally been both labor-intensive and costly, requiring designers to meticulously configure 3D assets and environments. Recent advancements in generative AI, including text-to-3D and image-to-3D methods, have dramatically reduced the complexity and cost of this process. However, current techniques for editing complex 3D scenes continue to rely on generally interactive multi-step, 2D-to-3D projection methods and diffusion-based techniques, which often lack precision in control and hamper interactive-rate performance. In this work, we propose 3DSceneEditor, a fully 3D-based paradigm for interactive-rate, precise editing of intricate 3D scenes using Gaussian Splatting. Unlike conventional methods, 3DSceneEditor operates through a streamlined 3D pipeline, enabling direct Gaussian-based manipulation for efficient, high-quality edits based on input prompts. The proposed framework (i) integrates a pre-trained instance segmentation model for semantic labeling; (ii) employs a zero-shot grounding approach with CLIP to align target objects with user prompts; and (iii) applies scene modifications, such as object addition, repositioning, recoloring, replacing, and removal—directly on Gaussians. Extensive experimental results show that 3DSceneEditor surpasses existing state-of-the-art techniques in terms of both editing precision and efficiency, establishing a new benchmark for efficient and interactive 3D scene customization.

43

Segmentation-Aware Latent Diffusion for Satellite Image Super-Resolution: Enabling Smallholder Farm Boundary Delineation

Aditi Agarwal ⋅ Anjali Jain ⋅ Nikita Saxena ⋅ Ishan Deshpande ⋅ Michal Kazmierski ⋅ Abigail Annkah ⋅ Nadav Sherman ⋅ Karthikeyan Shanmugam ⋅ Alok Talekar ⋅ Vaibhav Rajan

Delineating farm boundaries through segmentation of satellite images is a fundamental step in many agricultural applications. The task is particularly challenging for smallholder farms, where accurate delineation requires the use of high resolution (HR) imagery which are available only at low revisit frequencies (e.g., annually). To support more frequent (sub-) seasonal monitoring, HR images could be combined as references (ref) with low resolution (LR) images – having higher revisit frequency (e.g., weekly) – using reference-based super resolution (Ref-SR) methods. However, current Ref-SR methods optimize perceptual quality and smooth over crucial features needed for downstream tasks, and are unable to meet the large scale-factor requirements for this task. Further, previous two-step approaches of SR followed by segmentation do not effectively utilize diverse satellite sources as inputs. We address these problems through a new approach, SEED-SR, which uses a combination of conditional latent diffusion models and large-scale multi-spectral, multi-source geo-spatial foundation models. Our key innovation is to bypass the explicit SR task in the pixel space and instead perform SR in a segmentation-aware latent space. This unique approach enables us to generate segmentation maps at an unprecedented 20× scale factor, and rigorous experiments on two large, real datasets demonstrate up to 25.5% and 12.9% relative improvement in instance and semantic segmentation metrics respectively over approaches based on state-of-the-art Ref-SR methods

44

mmWeaver: Environment-Specific mmWave Signal Synthesis from a Photo and Activity Description

Mahathir Monjur ⋅ Shahriar Nirjon

Realistic signal generation and dataset augmentation are essential for advancing mmWave radar applications such as activity recognition and pose estimation, which rely heavily on diverse, and environment-specific signal datasets. However, mmWave signals are inherently complex, sparse, and high-dimensional, making physical simulation both computationally expensive. This paper presents mmWeaver, a novel framework that synthesizes realistic, environment-specific complex mmWave signals by modeling them as continuous functions using Implicit Neural Representations (INRs), achieving up to 49-fold compression. mmWeaver incorporates hypernetworks that dynamically generate INR parameters based on environmental context (extracted from RGB-D images) and human motion features (derived from text-to-pose generation via MotionGPT), enabling efficient and adaptive signal synthesis. By conditioning on these semantic and geometric priors, mmWeaver generates diverse I/Q signals at multiple resolutions, preserving information critical for downstream tasks such as point cloud estimation and activity classification. Extensive experiments show that mmWeaver achieves a complex SSIM of 0.88 and a PSNR of 35 dB, outperforming existing methods in signal realism while improving activity recognition accuracy by up to 7\% and reducing human pose estimation error by up to 15\%, all while operating $6$–$35\times$ faster than simulation-based approaches.

45

Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models

Oz Zafar ⋅ Yuval Cohen ⋅ Lior Wolf ⋅ Idan Schwartz

Accurately controlling object count in text-to-image generation remains a key challenge. Supervised methods often fail, as training data rarely covers all count variations. Methods that manipulate the denoising process to add or remove objects can help; however, they still require labeled data, limit robustness and image quality, and rely on a slow, iterative process. Pre-trained differentiable counting models that rely on soft object density summation exist and could steer generation, but employing them presents three main challenges: (i) they are pre-trained on clean images, making them less effective during denoising steps that operate on noisy inputs; (ii) they are not robust to viewpoint changes; and (iii) optimization is computationally expensive, requiring repeated model evaluations per image. We propose a new framework that uses pre-trained object counting techniques and object detectors to guide generation. First, we optimize a counting token using an outer-loop loss computed on fully generated images. Second, we introduce a detection-driven scaling term that corrects errors caused by viewpoint and proportion shifts, etc., without requiring backpropagation through the detection model. Third, we show that the optimized parameters can be reused for new prompts, removing the need for repeated optimization. Our method provides efficiency through token reuse, flexibility via compatibility with various detectors, and accuracy with improved counting across diverse object categories.

46

Roadside Monocular 3D Detection Prompted by 2D Detection

Yechi Ma ⋅ Wei Hua ⋅ Yanan Li ⋅ Shu Kong

Roadside monocular 3D detection requires detecting objects of predefined classes in an RGB frame and predicting their 3D attributes, such as bird's-eye-view (BEV) locations. It has broad applications in traffic control, vehicle-vehicle communication, and vehicle-infrastructure cooperative perception. To address this task, we introduce Promptable 3D Detector (Pro3D), a novel detector design that leverages 2D detections as prompts. We build our Pro3D upon two key insights. First, compared to a typical 3D detector, a 2D detector is ``easier'' to train due to fewer loss terms and performs significantly better at localizing objects w.r.t 2D metrics. Second, once 2D detections precisely locate objects in the image, a 3D detector can focus on lifting these detections into 3D BEV, especially when fixed camera pose or scene geometry provide an informative prior. To encode and incorporate 2D detections, we explore three methods: (a) concatenating features from both 2D and 3D detectors, (b) attentively fusing 2D and 3D detector features, and (c) encoding properties of predicted 2D bounding boxes \{$x$, $y$, width, height, label\} and attentively fusing them with the 3D detector feature. Interestingly, the third method significantly outperforms the others, underscoring the effectiveness of 2D detections as prompts that offer precise object targets and allow the 3D detector to focus on lifting them into 3D. Pro3D is adaptable for use with a wide range of 2D and 3D detectors with minimal modifications. Comprehensive experiments demonstrate that our Pro3D significantly enhances existing methods, achieving state-of-the-art results on two contemporary benchmarks.

47

UniCalib: Targetless LiDAR-camera Calibration via Probabilistic Flow on Unified Depth Representations

Shu Han ⋅ Xubo Zhu ⋅ Ji Wu ⋅ Ximeng Cai ⋅ Wen Yang ⋅ Huai Yu ⋅ Gui-Song Xia

Online targetless extrinsic LiDAR-camera calibration is essential for robust perception in computer vision applications such as autonomous driving. However, existing methods struggle with the significant modality gap between heterogeneous sensors and fail to handle unreliable correspondences arising from real-world challenges like occlusions and dynamic objects. To address these issues, we introduce UniCalib, a novel method that performs calibration by estimating a probabilistic flow on unified depth representations. UniCalib first bridges the modality gap by converting both the camera images and the sparse LiDAR points into unified, dense depth maps, enabling a unified encoder to learn consistent features. Subsequently, it learns a probabilistic flow field that captures the correspondence uncertainty to improve robustness. This probabilistic approach is reinforced by a reliability map and a perceptually weighted sparse flow loss, which guide the model to suppress the influence of unreliable regions. Experimental results on three datasets validate the accuracy and generalization of UniCalib. In particular, it achieves a mean translation error of $0.550\mathrm{cm}$ and a rotation error of $0.044^\circ$ on the KITTI dataset.

48

Color Bind: Exploring Color Perception in Text-to-Image Models

Shay Shomer-Chai ⋅ Wenxuan Peng ⋅ Bharath Hariharan ⋅ Hadar Averbuch-Elor

Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors---a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes—far more so than with single-color prompts—and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques. We will make our code, benchmark and evaluation protocol publicly available.

49

ASC: Learning Augmentation Severity-Consistent Representations Improves Generalization via Augmentation Search

Amirhossein Alamdar ⋅ Hossein Jafarinia ⋅ Mahdi Nouri ⋅ Mohammad Rohban

Whole Slide Image (WSI) classification is hindered by limited data availability, resulting in weak generalization. Recent efforts leverage data augmentation to address this, but methods adapted from natural images often fail on WSIs—either degrading performance or offering marginal gains. A central challenge lies in tuning augmentation parameters to match WSI-specific characteristics, a task rendered impractical by the computational demands of current WSI pipelines, where feature extraction is frozen and prohibitively expensive. This work introduces two key contributions. First, it proposes DINOASC, an enhanced self-supervised learning framework that modifies DINO to produce embeddings with AugSev Consistency—a property ensuring that linear interpolations across augmentation severities yield semantically coherent representations. Second, it presents the first automatic augmentation search strategy for WSI classification, built on top of TrivialAugment, which efficiently discovers augmentation strength ranges suited to histopathology by exploiting the structured embedding space induced by DINOASC. Together, these components enable augmentation-based generalization improvements without incurring excessive computational overhead. The proposed method achieves state-of-the-art performance on CAMELYON16 and SICAP-MIL.

50

Detecting Out-of-Distribution Objects through Class-Conditioned Inpainting

Quang-Huy Nguyen ⋅ Jin Peng Zhou ⋅ Zhenzhen Liu ⋅ Khanh-Huyen Bui ⋅ Kilian Weinberger ⋅ Wei-Lun Chao ⋅ Dung Le

Recent object detectors have achieved impressive accuracy in identifying objects seen during training. However, real-world deployment often introduces novel and unexpected objects, referred to as out-of-distribution (OOD) objects, posing significant challenges to model trustworthiness. Modern object detectors are typically overconfident, making it unreliable to use their predictions alone for OOD detection. To address this, we propose leveraging an auxiliary model as a complementary solution. Specifically, we utilize an off-the-shelf text-to-image generative model, such as Stable Diffusion, which is trained with objective functions distinct from those of discriminative object detectors. We hypothesize that this fundamental difference enables the detection of OOD objects by measuring inconsistencies between the models. Concretely, for a given detected object bounding box and its predicted in-distribution class label, we perform class-conditioned inpainting on the image with the object removed. If the object is OOD, the inpainted image is likely to deviate significantly from the original, making the reconstruction error a robust indicator of OOD status. Extensive experiments demonstrate that our approach consistently surpasses existing zero-shot and non-zero-shot OOD detection methods, establishing a robust framework for enhancing object detection systems in dynamic environments.

51

Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars

Eric Chen ⋅ Di Liu ⋅ Sizhuo Ma ⋅ Michael Vasilkovsky ⋅ Bing Zhou ⋅ Qiang Gao ⋅ Wenzhou Wang ⋅ Jiahao Luo ⋅ Dimitri Metaxas ⋅ Vincent Sitzmann ⋅ Jian Wang

Despite the increasing popularity of avatar systems such as Snapchat Bitmojis, existing production avatar platforms face several limitations, such as a limited number of predefined assets, tedious customization processes, and inefficient rendering requirements. Addressing these shortcomings, we introduce Instamoji, an avatar generation system that instantly creates 3D avatars, and enables customization in a process we call dual-stylization. Instamoji first maps a selfie of a user to a primary avatar (e.g., Bitmoji style) using a new technique we name Gaussian Domain Adaptation (GDA), then applies a secondary style (e.g., skeleton, yarn, toy) to the primary avatar, all while preserving the user’s identity. The generated 3D avatars can then be rendered an animated on mobile devices at 30--40 FPS.

52

Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

Utsav Panchal ⋅ Yuchen Liu ⋅ Luigi Palmieri ⋅ Ilche Georgievski ⋅ Marco Aiello

Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9\% in prediction accuracy.

53

False Alarm Rectification for Early Smoke Segmentation

Hongjin Zhao ⋅ Weihao Li ⋅ Ge-Peng Ji ⋅ Nick Barnes

Early smoke segmentation plays a critical role in forest protection and industrial safety. With the increasing deployment of fixed cameras and drones, vision-based smoke detection has become widely adopted. However, in open environments, smoke is easily confused with visually similar phenomena such as clouds, fog, and water vapor, leading to frequent false positives. To address this challenge, we propose a method that suppresses false alarms in pixel-level smoke segmentation while preserving overall detection performance. The core idea is to leverage the confidence of an image-level smoke classifier as a prior to guide both training and inference of the segmentation model. High-confidence samples receive stronger supervision to enhance discriminative capability, whereas low-confidence samples are down-weighted to mitigate noise propagation. In addition, we design a multi-scale feature fusion module that integrates texture and semantic cues from different layers, improving robustness to thin plumes and complex backgrounds. We further introduce a contrastive loss that encourages intra-class compactness and inter-class separability in feature space. Overall, our method reduce the false positive rate without sacrificing segmentation quality. Experiments on the SmokeSeg dataset demonstrate the effectiveness of our approach, achieving an IoU of 61.83\% and an FPR of only 0.28\%. Our code will be released publicly.

54

Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space

Ren Nakagawa ⋅ Yang Yang ⋅ Risa Shinoda ⋅ Hiroaki Santo ⋅ Kenji Oyama ⋅ Fumio Okura ⋅ Takenao Ohkawa

This paper introduces a method and application for automatic detection of behavioral interaction between grazing cattle from a single image, which is essential for smart livestock management in the cattle industry, such as for detecting estrus. Although the interaction detection for humans has been actively studied, a non-trivial challenge lies in cattle interaction detection, i.e., the lack of a comprehensive behavioral dataset including interaction, since the interactions of grazing cattle are rare events. We, therefore, propose CattleAct, a data-efficient method for interaction detection by decomposing interactions into the combinations of actions by individual cattle. Specifically, we first learn an action latent space from a large-scale cattle action dataset, then embed rare interactions via the fine-tuning of the pre-trained latent space using contrastive learning, constructing a unified latent space of action and interactions. On top of the proposed method, we develop a practical working system integrating video and GPS inputs. Experiments in a commercial-scale pasture show the accurate interaction detection by our method compared to the baselines.

55

Semi-Supervised Hierarchical Open-Set Classification

Erik Wallin ⋅ Fredrik Kahl ⋅ Lars Hammarstrand

Hierarchical open-set classification handles previously unseen classes by assigning them to the most appropriate high-level category in a class taxonomy. We extend this paradigm to the semi-supervised setting, enabling the use of large-scale, uncurated datasets containing a mixture of known and unknown classes to improve the hierarchical open-set performance. To this end, we propose a teacher-student framework based on pseudo-labeling. Two key components are introduced: 1) subtree pseudo-labels, which provide reliable supervision in the presence of unknown data, and 2) age-gating, a mechanism that mitigates overconfidence in pseudo-labels. Experiments show that our framework outperforms self-supervised pretraining followed by supervised adaptation, and even matches the fully supervised counterpart when using only 20 labeled samples per class on the iNaturalist19 benchmark. Our code is available as supplementary material.

56

ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora

Nikolaos Adaloglou ⋅ Diana Petrusheva ⋅ Mohamed Asker ⋅ Felix Michels ⋅ Markus Kollmann

Large-scale visual out-of-distribution (OOD) detection has witnessed remarkable progress by leveraging vision-language models such as CLIP. Currently, such OOD methods require access to the in-distribution ground truth label names (positives), while widely available text corpora are solely utilized for mining unrelated concepts (negatives) that are likely OOD. In this work, we present a general framework for mining positive and negative concepts from a text corpus. Additionally, we propose a novel label mining method, ClusterMine, which is the first method to achieve state-of-the-art OOD detection performance when ground-truth label names are inaccessible. ClusterMine extracts in-distribution-related concepts from a large text corpus by enforcing visual sample consistency along with zero-shot inference. Our extensive experimental study reveals that ClusterMine i) is scalable across a plethora of CLIP models, and ii) achieves state-of-the-art robustness to covariate in-distribution shifts, on average. The code will be released.

57

CURIO: Curvature-Aligned and Efficient OCR for Low-Resource Historical Manuscripts

Sai Madhusudan Gunda ⋅ Tathagata Ghosh ⋅ Simran Sandral ⋅ Ravi Kiran Sarvadevabhatla

We present CURIO, an OCR system for low-resource historical manuscripts. In many challenging cases, manuscripts feature curved text lines, unsegmented lines with lack of spacing between words, and highly variable line lengths — conditions under which existing OCR methods fail. To tackle this challenge, we first extract lines and corresponding curvature profiles from manuscripts, then straighten them using a rectification procedure to reduce redundant background within each line. Because data is scarce, we compliment real data with synthetic data. To bridge the synthetic–real gap, we generate line images by warping rendered straight text along the rectified profiles, ensuring both real and synthetic lines align in their curvature characteristics. Our recognizer is a lightweight CNN–Transformer with padding-aware null activations, sparse attention and optimized with CTC loss for efficient training. We evaluate our method on challenging manuscript collections written in Sharada, a rare and endangered Indic script. CURIO outperforms strong CNN+RNN and Transformer baselines, with the largest gains on high-curvature lines and long lines. CURIO further transfers zero-shot to printed Sharada text, indicating robustness beyond manuscript domain.

58

DoTA: Latent Distribution Conditioned Data Attribution for Diffusion Models

Ninad Joshi ⋅ Vivek Srivastava ⋅ Shirish Karande

Diffusion models have emerged as the backbone of several modern generative AI models for effective visual content generation. However, their opaque nature raises fundamental questions about which training samples are responsible for specific generations, especially in applications involving bias detection, model auditing, and dataset curation. Data attribution seeks to identify the training samples that highly influence the output of generative models, a task that becomes especially challenging when targeting fine-scale attributes for attribution. Prior work has focused on broad concepts such as global features or entire images, often overlooking the nuances of fine-grained attributes and relying on group-based strategies that dilute individual influence. We propose a novel latent distribution conditioned method DoTA for data attribution. DoTA presents an effective search space pruning technique based on the latent distribution matching between the generated and training data for effective and controlled attribution. We demonstrate the attribution effectiveness through extensive quantitative and qualitative evaluations across challenging settings such as counterfactual evaluation and robustness to adversarial attack.

59

Learnable Query-Enhanced Pose Transformation

Yi-Zhen Wang ⋅ Hong-Han Shuai

Pose-Guided Person Image Synthesis (PGPIS) aims to transfer a person from a source image to a target pose (e.g., skeleton) while preserving their original appearance. Although existing methods can produce high-quality results at first glance, they often suffer from noticeable distortions in fine details. We identify the root cause of these issues as the heavy reliance on pre-trained encoders for extracting visual features from the source image. To address this, we propose a novel Query Enhancement Network composed of two key components: the Query-based Feature Fusion Transformer (QFFT) and Pose-Masked Attention (PMA). The QFFT uses learnable queries to fuses multi-scale features from high to low resolution extracted by the backbone encoder, thereby significantly enhancing the realism of texture details in the generated images. To better capture the relationship between pose information and visual features from the source image, we introduce PMA that uses the pose skeleton as a mask to guide the attention mechanism to focus on the pose regions. Our method produces high-quality, visually coherent results and outperforms existing approaches on standard evaluation metrics, including FID, SSIM, and LPIPS, demonstrating its effectiveness on the DeepFashion dataset.

60

CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video

Xinyi Wang ⋅ Angeliki Katsenou ⋅ Junxiao Shen ⋅ David Bull

The prevalence of user-generated content (UGC) on platforms such as YouTube and TikTok has rendered no-reference (NR) perceptual video quality assessment (VQA) vital for optimizing video delivery. Nonetheless, the characteristics of non-professional acquisition and the subsequent transcoding of UGC video on sharing platforms present significant challenges for NR-VQA. Although NR-VQA models attempt to infer mean opinion scores (MOS), their modeling of subjective scores for compressed content remains limited due to the absence of fine-grained perceptual annotations of artifact types. To address these challenges, we propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large vision–language models. Our approach introduces a quality-aware prompting mechanism that integrates video metadata (e.g., resolution, frame rate, bitrate) with key fragments extracted from inter-frame variations to guide the BLIP-2 pretraining approach in generating fine-grained quality captions. A unified architecture has been designed to model perceptual quality across three dimensions: semantic alignment, temporal characteristics, and spatial characteristics. These multimodal features are extracted and fused, then regressed to video quality scores. Extensive experiments on a wide variety of UGC datasets demonstrate that our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations. Our method achieves the best performance in terms of average rank and linear correlation (SRCC: 0.928, PLCC: 0.938) compared to state-of-the-art methods. The source code and trained models, along with a user-friendly demo, will be made available at: https://github.com/.

61

Evaluating Text-to-Image and Text-to-Video Synthesis with a Conditional Frechet Distance

Jaywon Koo ⋅ Jefferson Hernandez ⋅ Moayed Haji-Ali ⋅ Ziyan Yang ⋅ Vicente Ordonez

Evaluating text-to-image and text-to-video models is challenging due to a fundamental disconnect: established metrics fail to jointly measure visual quality and semantic alignment with text, leading to a poor correlation with human judgments. To address this critical issue, we propose cFreD, a general metric based on a Conditional Fr\'echet Distance that unifies the assessment of visual fidelity and text-prompt consistency into a single score. Existing metrics such asFr\'echet Inception Distance (FID) capture image quality but ignore text conditioning while alignment scores such as CLIPScore are insensitive to visual quality.Furthermore, learned preference models require constant retraining and are unlikely to generalize to novel architectures or out-of-distribution prompts. Through extensive experiments across multiple recently proposed text-to-image models and diverse prompt datasets, cFreD exhibits a higher correlation with human judgments compared to statistical metrics, including metrics trained with human preferences. Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text conditioned models, standardizing benchmarking in this rapidly evolving field. We plan to release and include in the Appendix an evaluation toolkit and benchmark.

62

From Bands to Depth: Understanding Bathymetry Decisions on Sentinel-2

Satyaki Roy Chowdhury ⋅ Aswathnarayan Radhakrishnan ⋅ Hari Subramoni

Deploying Sentinel-2 satellite-derived bathymetry (SDB)robustly across sites remains challenging. We analyze aSwin-Transformer U-Net (Swin-BathyUNet) to understandhow it infers depth and when its predictions are trustwor-thy. A leave-one-band-out study ranks spectral importanceto the different bands consistent with shallow-water op-tics. We adapt ablation-based CAM to regression (A-CAM-R) and validate faithfulness via a performance–retentiontest: keeping only the top-p% salient pixels while neutral-izing the rest causes large, monotonic RMSE increases,indicating explanations localize causal evidence. Atten-tion ablations show decoder-conditioned cross-attention onskips is the most cost-effective upgrade, improving robust-ness to glint/foam. Cross-region inference (train one site,test another) reveals depth-dependent degradation: MAErises nearly linearly with depth, and bimodal depth dis-tributions exacerbate mid/deep errors. Practical guidancefollows: maintain wide receptive fields, preserve radiomet-ric fidelity in green/blue channels, pre-filter bright high-variance near shore, and pair light target-site fine-tuningwith depth-aware calibration to transfer across regions.

63

Pyramidal Spectrum: Frequency-based Hierarchically Vector Quantized VAE for Videos

Tushar Prakash ⋅ Onkar Susladkar ⋅ Inderjit Dhillon ⋅ Sparsh Mittal

Variational Autoencoders (VAEs) form the foundation of modern video generation models. In particular, discrete latent VAEs with vector quantization have gained prominence for their superior perceptual sharpness, ability to model long-range dynamics, and efficient adaptation to downstream tasks. However, existing discrete VAEs face two key limitations: (i) a lack of frequency-domain modeling to enhance global spatiotemporal understanding, and (ii) fixed-resolution quantization schemes, preventing effective modeling of coarse-to-fine spatiotemporal hierarchies essential for video generation. To address these limitations, we propose a Pyramidal Vector Quantized Variational Autoencoder (PVQ-VAE) for videos. PVQ-VAE's encoder–decoder leverages Fast Fourier Transform and Discrete Wavelet Transform to capture global semantics and multi-scale local details jointly. We introduce Pyramidal Vector Quantization (PVQ), a hierarchical quantization scheme that discretizes features at multiple resolutions to better capture multi-scale information. To further boost fidelity, we introduce a cross-modal contrastive loss guided by a pretrained high-resolution image VAE. PVQ-VAE achieves state-of-the-art performance on WebVid-val, COCO-val, and MCL-JCV, reconstructing videos with high perceptual quality at up to 32× spatial and 16× temporal compression.

64

PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

Oishee Bintey Hoque ⋅ Nibir Mandal ⋅ Kyle Luong ⋅ Mandy Wilson ⋅ Samarth Swarup ⋅ Madhav Marathe ⋅ Abhijin Adiga

Large-scale livestock operations pose significant risks to human health and the environment, while also being vulnerable to threats such as infectious diseases and extreme weather events. As the number of such operations continues to grow, accurate and scalable mapping has become increasingly important. In this work, we present an infrastructure-first, explainable pipeline for identifying and characterizing Concentrated Animal Feeding Operations (CAFOs) from aerial and satellite imagery. Our method (i) detects candidate infrastructure (e.g., barns, feedlots, manure lagoons, silos) with a domain-tuned YOLOv8 detector, then derives SAM2 masks from these boxes and filters them using geometric and component-specific criteria; (ii) extracts structured descriptors—counts, areas, orientations, and spatial relations—that are fused with deep visual features via a lightweight spatial cross attention-based classifier; and (iii) outputs both CAFO type predictions and mask-level attributions that link decisions to visible infrastructure. Through comprehensive evaluation, we show that our approach achieves state-of-the-art performance, with Swin-B+PRISM-CAFO surpassing the best performing baseline by up to 15%. Beyond strong predictive performance across diverse U.S. regions, we run systematic gradient–activation analyses that quantify the impact of domain priors and show how specific infrastructure (e.g., barns, lagoons) shapes classification decisions. We release code, infrastructure masks, and descriptors to facilitate transparent, scalable monitoring of livestock infrastructure. Our system enables stakeholders to model environmental risks (e.g., identifying manure ponds for water quality screening), monitor infrastructure changes, and prioritize regulatory interventions at regional and national scales.

65

Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel NeRA Adapter for Enhanced Feature Adaptation

Gayatri Deshmukh ⋅ Somsubhra De ⋅ Chirag Sehgal ⋅ Jishu Gupta ⋅ Sparsh Mittal

Specialized datasets that capture the fashion industry's rich language and styling elements can boost progress in AI-driven fashion design. We present FLORA (Fashion Language Outfit Representation for Apparel Generation), the first comprehensive dataset containing 4,330 curated pairs of fashion outfits and corresponding textual descriptions. Each description utilizes industry-specific terminology and jargon commonly used by professional fashion designers, providing precise and detailed insights into the outfits. Hence, the dataset captures the delicate features and subtle stylistic elements necessary to create high-fidelity fashion designs. We demonstrate that fine-tuning generative models on the FLORA dataset significantly enhances their capability to generate accurate and stylistically rich images from textual descriptions of fashion sketches. FLORA will catalyze the creation of advanced AI models capable of comprehending and producing subtle, stylistically rich fashion designs. It will also help fashion designers and end-users to bring their ideas to life. As a second orthogonal contribution, we introduce NeRA (Nonlinear low-rank Expressive Representation Adapter), a novel adapter architecture based on Kolmogorov-Arnold Networks (KAN). NeRA replaces traditional MLP-based LoRA adapters with learnable spline-based nonlinear transformations, enabling superior modeling of complex semantic relationships, achieving strong fidelity, faster convergence and semantic alignment. Extensive experiments and ablation studies on our proposed FLORA and LAION-5B datasets validate the superiority of NeRA over LoRA adapters. We will open-source both the FLORA dataset and our implementation code.

66

EllipssianNet: Image-guided Sampling of 2D Gaussians for Gaussian Splatting

MyoungGon Kim ⋅ JeongHyeon Ahn ⋅ Seohyeon Park ⋅ Hyemi Kim ⋅ Seunghyun Park ⋅ Jung Hwang ⋅ JungHyun Han

In this paper, we present a neural sampling method, EllipssianNet, which predicts 2D Gaussians from an input RGB image. Trained with a Voronoi diagram-based synthetic dataset, EllipssianNet outputs a center map and a covariance map, which are combined with the colors sampled from the input image to generate 2D Gaussians. The Gaussians are anisotropic and aligned with local complexities of the input RGB image. The 2D Gaussians are converted into 3D ones that are then optimized and rasterized in the 3D Gaussian Splatting framework.EllipssianNet is tested in two applications. In Gaussian-based image representation, initialization with EllipssianNet enables faster convergence and higher rendering quality. EllipssianNet is also seamlessly integrated into a real-time SLAM system, producing high-quality reconstructions under online constraints.

67

Zero-Shot Coreset Selection via Iterative Subspace Sampling

Brent Griffin ⋅ Jacob Marks ⋅ Jason Corso

Deep learning increasingly relies on massive data with substantial storage, annotation, and training costs. To reduce costs, coreset selection finds a representative subset of data to train models while ideally performing on par with the full data training. To maximize performance, current state-of-the-art coreset methods select data using dataset-specific ground truth labels and training. However, these methodological requirements prevent selection at scale on real-world, unlabeled data. To that end, this paper addresses the selection of coresets that achieve state-of-the-art performance but without using any labels or training on candidate data. Instead, our solution, Zero-Shot Coreset Selection via Iterative Subspace Sampling (ZCore), uses previously-trained foundation models to generate zero-shot, high-dimensional embedding spaces to interpret unlabeled data. ZCore then iteratively quantifies the relative value of all candidate data based on coverage and redundancy in numerous subspace distributions. Finally, ZCore selects a coreset sized for any data budget to train downstream models. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, especially at low data rates that provide the most substantial cost reduction. On ImageNet, ZCore selections for 10\% training data achieve a downstream validation accuracy of 53.99\%, which outperforms prior label-based methods and removes annotation and training costs for 1.15 million images.

68

MorphXAI: An Explainable Framework for Morphological Analysis of Parasites in Blood Smear Images

Aqsa Yousaf ⋅ Sint Sint Win ⋅ Megan Coffee ⋅ Habeeb Olufowobi

Parasitic infections remain a pressing global health challenge, particularly in low-resource settings where diagnosis still depends on labor-intensive manual inspection of blood smears. While deep learning models have shown strong performance in automating parasite detection, their clinical usefulness is constrained by limited interpretability. Existing explainability methods are largely restricted to visual heatmaps or attention maps, which highlight regions of interest but fail to capture the morphological traits that clinicians rely on for diagnosis. In this work, we present MorphXAI, an explainable framework that unifies parasite detection with fine-grained morphological analysis. MorphXAI integrates morphological supervision directly into the prediction pipeline, enabling the model to localize parasites while simultaneously characterizing clinically relevant attributes such as shape, curvature, visible dot count, flagellum presence, and developmental stage. To support this task, we curate a clinician-annotated dataset of three parasite species (Leishmania, Trypanosoma brucei, and Trypanosoma cruzi) with detailed morphological labels, establishing a new benchmark for interpretable parasite analysis. Experimental results show that MorphXAI not only improves detection performance over the baseline but also provides structured, biologically meaningful explanations.

69

WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement

Ching-Heng Cheng ⋅ Jen-Wei Lee ⋅ Chia-Ming Lee ⋅ Chih-Chung Hsu

Underwater Image Enhancement (UIE) aims to restore visibility and correct color distortions caused by light absorption and scattering in aquatic environments. Existing deep learning approaches often suffer from high computational overhead or limited generalization due to the lack of explicit domain priors. In this work, we propose WWE-UIE, a compact and efficient enhancement network that integrates three interpretable priors: white balance correction to mitigate color attenuation, a wavelet-based enhancement block (WEB) for multi-scale frequency decomposition, and a gradient-aware module (SGFB) to preserve edge sharpness. We further incorporate the HVI color space to decouple chromatic and intensity information, enhancing color fidelity in challenging underwater scenes. Extensive experiments on benchmark datasets demonstrate that WWE-UIE achieves competitive performance with significantly fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms. Ablation studies and visualizations confirm the contribution of each component. Moreover, our prior-guided framework generalizes well to other degradation domains, such as low-light and foggy scenes, highlighting its versatility for practical and time-sensitive image restoration applications.

70

SPAR-Det: Segmentation-guided and Prior-Aided Routing for Small Object Detection

Seungchan Kwon ⋅ Gyuil Lim ⋅ Youngjoon Han

Small Object Detection (SOD) is essential for real-world applications, including satellite image analysis and drone-based surveillance, where target objects typically exhibit limited spatial extent, high density, and visual similarity to complex backgrounds. These factors substantially hinder conventional object detection methods, resulting in weak feature representations and poor detection accuracy. To overcome these challenges, we introduce Segmentation-guided and Prior-Aided Routing for Small Object Detection (SPAR-Det), a unified framework integrating segmentation-guided attention, geometric prior supervision, and adaptive feature routing. At its core, SPAR-Det employs a Cross-Attention Heterogeneous Feature Fusion (CAHF) module that leverages pretrained segmentation backbones to enhance foreground object features while effectively suppressing background. Additionally, we propose a Geometric Prior Supervision Loss that combines Gaussian bounding box maps with segmentation feature maps, providing crucial geometric context and semantic cues to address the limited self-representation capability of small objects. Furthermore, our framework includes a Mixture-of-Experts (MoE) detection head, dynamically allocating specialized classifiers according to varying scene characteristics, thereby significantly improving generalization across diverse environments. Extensive evaluations conducted on two benchmark datasets, AI-TOD and VisDrone, demonstrate that SPAR-Det achieves state-of-the-art performance, verifying its robustness and applicability for challenging small object detection scenarios. The source code will be publicly released upon publication.

71

GeoHSAF: Geometric Hippocampus Shape Analysis Framework for Longitudinal Alzheimer's Disease Classification

MUBARAK OLAOLUWA ⋅ HENI LOUKIL ⋅ Arafet Sbei ⋅ Hassen Drira

Alzheimer’s disease (AD) is the most common form of dementia and a progressive, irreversible brain disorder that affects millions worldwide. The majority of existing research on AD classification relies on cross-sectional brain magnetic resonance imaging studies, which consider information from a single time point and fail to account for the progressive nature of AD. Longitudinal analysis, however, is crucial for capturing AD evolution and enabling more accurate diagnosis. To address this gap, we propose \textbf{GeoHSAF}, a novel hippocampus-based geometric learning framework for longitudinal AD classification. To overcome the challenge of missing or inconsistent hippocampal shapes across subjects and time points, our framework includes an interpolation module that predicts intermediate shapes, ensuring temporal continuity. We evaluate the effectiveness of GeoHSAF on three public longitudinal AD datasets: ADNI, OASIS, and AIBL, and benchmark its performance against existing approaches. GeoHSAF achieves new state-of-the-art results on binary classification tasks (AD vs. Normal Controls (NC)), while also demonstrating strong performance on more challenging triple-class classification tasks (AD vs. NC vs. Mild Cognitive Impairment (MCI)). Our work is fully reproducible, and all code is available at: \href{https://github.com/anonymous252573/GeoHSAF}{\textcolor{red}{https://github.com/anonymous252573/GeoHSAF}}

72

BAFIS: Dataset + Framework to assess occupational Bias and Human Preference in modern Text-to-image Models

Thomas Klassert ⋅ Adrian Ulges ⋅ Biying Fu

Generative artificial intelligence has the potential to improve productivity and transform the production of creative content. However, existing research indicates that image generation models are significantly influenced by biases. This work investigates the inherent biases and language-induced biases present in text-to-image models within the context of occupation-related image generation, complementing established metrics with human preference feedback. We present a comprehensive evaluation of five current text-to-image models: Midjourney v6.1, Stable Diffusion 3 Medium, DALL-E 3, Playground v2.5, and FLUX.1-dev , focusing on gender and ethnicity bias, image quality, and prompt alignment. To facilitate this evaluation, we developed the "Battle-Arena for Fair Image Synthesis" (BAFIS), a platform designed to collect human feedback on bias in generated images. Furthermore, we created a dataset comprising 21,140 synthetic images generated using multilingual prompts, which serves as a basis for our analysis. We further place our results within a broader social context by comparing them to official statistics from the German Federal Employment Agency. Our findings reveal systematic biases in text-to-image models, with established evaluation metrics in partial correlation with subjective user ratings. Thus, our research emphasizes the need for including human preferences to develop fairer and more inclusive text-to-image models.

73

Imitating the Functionality of Image-to-Image Models Using a Single Example

Nurit Spingarn ⋅ Tomer Michaeli

We study the possibility of imitating the functionality of an image-to-image translation model by observing input-output pairs. We focus on cases where training the model from scratch is impossible, either because training data are unavailable or because the model architecture is unknown. This is the case, for example, with commercial models for biological applications. Since the development of these models requires large investments, their owners commonly keep them confidential and reveal only a fewinput-output examples on the company's website or in an academic paper. Surprisingly, we find that even a single example typically suffices for learning to imitate the model's functionality, and that this can be achieved using a simple distillation approach. We present an extensive ablation study encompassing a wide variety of model architectures, datasets and tasks, to characterize the factors affecting vulnerability to functionality imitation, and provide a preliminary theoretical discussion on the reasons for this unwanted behavior.

74

Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts

Jaehun Bang ⋅ Moon Ye-Bin ⋅ Tae-Hyun Oh ⋅ Kyungdon Joo

When searching for videos, users often rely on surrounding context such as background elements or temporal details beyond salient content. However, existing video models struggle with fine-grained spatio-temporal understanding particularly surrounding contexts, and there are no datasets that effectively evaluate their performance.We introduce our SS Datasets, a collection of three video retrieval datasets that offer detailed salient and surrounding captions aligned with semantically segmented clips. To capture rich, temporally localized contexts aligned with meaningful scene changes, we segment videos based on scene transitions and generate captions with a vision-language model. Then, we analyze current video models, revealing their challenges in matching surrounding context queries and handling temporally complex videos. To overcome these challenges, we propose simple yet effective baselines that improve retrieval across various query types, enabling models to generalize robustly to real-world scenarios.

75

SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization

Bowen Yuan ⋅ Yuxia Fu ⋅ Zijian Wang ⋅ Yadan Luo ⋅ Zi Huang

Dataset Condensation (DC) aims to obtain a condensed dataset that allows models trained on the condensed dataset to achieve performance comparable to those trained on the full dataset. Recent DC approaches increasingly focus on encoding knowledge into realistic images with soft labeling, for their scalability to ImageNet-scale datasets and strong capability of cross-domain generalization. However, this strong performance comes at a substantial storage cost which could significantly exceed the storage cost of the original dataset. We argue that the three key properties to alleviate this performance-storage dilemma are informativeness, discriminativeness, and compressibility of the condensed data. Towards this end, this paper proposes a **S**oft label compression-centric dataset condensation framework using **CO**ding **R**at**E** (SCORE). SCORE formulates dataset condensation as a min-max optimization problem, which aims to balance the three key properties from an information-theoretic perspective. In particular, we theoretically demonstrate that our coding rate-inspired objective function is submodular, and its optimization naturally enforces low-rank structure in the soft label set corresponding to each condensed data. Extensive experiments on large-scale datasets, including ImageNet-1K and Tiny-ImageNet, demonstrate that SCORE outperforms existing methods in most cases. Even with 30$\times$ compression of soft labels, performance decreases by only 5.5\% and 2.7\% for ImageNet-1K with IPC 10 and 50, respectively. Code will be released upon paper acceptance.

76

Sketch2Stitch: GANs for Abstract Sketch-Based Dress Synthesis

Faizan Khan ⋅ Faizan Khan ⋅ Davide Morelli ⋅ Marcella Cornia ⋅ Rita Cucchiara ⋅ Mohamed Elhoseiny

In the realm of creative expression, not everyone possesses the gift of effortlessly translating their imaginative visions into flawless sketches. More often than not, the outcome resembles an abstract, perhaps even slightly distorted representation. The art of producing impeccable sketches is not only challenging but also a time-consuming process. Our work is the first of this kind in transforming abstract, sometimes deformed garment sketches into photorealistic catalog images, to empower the everyday individual to become their own fashion designer. We create Sketch2Stitch, a dataset featuring over 65,000 abstract sketch images generated from garments of DressCode~\cite{morelli2022dress} and VITON~HD~\cite{choi2021vitonhd}, two benchmark datasets in the virtual try-on task. Sketch2Stitch is the first dataset in the literature to provide abstract sketches in the fashion domain. We propose a StyleGAN-based generative framework that bridges freehand sketching with photorealistic garment synthesis. We demonstrate that our framework allows users to sketch rough outlines and optionally provide color hints, producing realistic designs in seconds. Experimental results demonstrate, both quantitatively and qualitatively, that the proposed framework achieves superior performance against various baselines and existing methods on both subsets of our dataset. Our work highlights a pathway toward AI-assisted fashion design tools, democratizing garment ideation for students, independent designers, and casual creators.

77

AnyBald: Toward Realistic Diffusion-Based Hair Removal In-The-Wild

Yongjun Choi ⋅ Seungoh Han ⋅ Soomin Kim ⋅ Sumin Son ⋅ Mohsen Rohani ⋅ Edgar Maucourant ⋅ Dongbo Min ⋅ Kyungdon Joo

We present AnyBald, a novel framework for realistic hair removal from portrait images captured under diverse in-the-wild conditions. One of the key challenges in this task is the lack of high-quality paired data, as existing datasets are often low-quality, with limited viewpoint variation and overall diversity, making it difficult to handle real-world cases. To address this, we construct a scalable data augmentation pipeline that synthesizes high-quality hair and non-hair image pairs capturing diverse real-world scenarios, enabling effective generalization with the added benefit of scalable supervision. With this enriched dataset, we present a new hair removal framework that reformulates pretrained latent diffusion inpainting using learnable text prompts, removing the need for explicit masks at inference. In doing so, our model achieves natural hair removal with semantic preservation via implicit localization. To further improve spatial precision, we introduce a regularization loss that guides the model to focus attention specifically on hair regions. Extensive experiments demonstrate that AnyBald outperforms in removing hairstyles while preserving identity and background semantics across various in-the-wild domains.

78

Understanding Generative AI Capabilities in Everyday Image Editing Tasks

Brandon Collins ⋅ Mohammad Reza Taesiri ⋅ Logan Bolton ⋅ Viet Lai ⋅ Franck Dernoncourt ⋅ Trung Bui ⋅ Anh Nguyen

Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025. However, what subjects do people most often want edited? What kinds of editing actions do they want to perform (e.g., removing or stylizing the subject)? Do people prefer precise edits with predictable outcomes or highly creative ones? By understanding the characteristics of real-world requests and the corresponding edits made by freelance photo-editing wizards, can we draw lessons for improving AI-based editors and determine which types of requests can currently be handled successfully by AI editors? In this paper, we present a unique study addressing these questions by analyzing 83k requests from the past 12 years (2013–2025) on the Reddit community, which collected 305k PSR-wizard edits. According to human ratings, approximately only 33% of requests can be fulfilled by the best AI editors (including GPT-4o, Gemini-2.0-Flash, SeedEdit). Interestingly, AI editors perform worse on low-creativity requests that require precise editing than on more open-ended tasks. They often struggle to preserve the identity of people and animals, and they frequently make non-requested touch-ups. On the other side of the table, VLM judges (e.g., o1) perform differently from human judges and may prefer AI edits more than human edits.

79

Reconstructing Realistic and Relightable Eyes

Wesley Khademi ⋅ Jogendra Nath Kundu ⋅ Yatong An ⋅ Alexander Fix ⋅ David Colmenares

Accurately modeling the eye is a challenging task as it exhibits refraction and reflection at the cornea, complex iris texture, and self-occlusion and shadowing due to eyelids and eyelashes. To address these challenges, we present a system for learning a hybrid relightable eye model which can be relit under near-field point lights. Our hybrid model leverages an eyeball mesh for explicitly representing the cornea surface and the reflections on it while learning the geometry and light transport of the periocular region and eye interior implicitly. To account for refraction, we explicitly handle the refraction of camera rays using Snell's law and predict the refraction of incident light rays using a neural network. Furthermore, we propose an extension of our method which enables us to relight the eye using a fringe projector to simulate structured light. Through experiments, we demonstrate that our method results in higher fidelity rendering under novel viewpoint and lighting conditions, improves learned iris geometry, and more accurately simulates structured light fringe patterns on the eye.

80

1LoRA: Summation Compression for Very-Low Rank Adaptation

Alessio Quercia ⋅ Zhuo Cao ⋅ Arya Bangun ⋅ Richard Paul ⋅ Abigail Morrison ⋅ Ira Assent ⋅ Hanno Scharr

Parameter-Efficient Fine-Tuning (PEFT) methods have transformed the approach to fine-tuning large models for downstream tasks by enabling the adjustment of significantly fewer parameters than those in the original model matrices. In this work, we study the "very low rank regime", where we fine-tune the lowest amount of parameters per linear layer for each considered PEFT method. We propose 1LoRA (Summation Low-Rank Adaptation), a compute, parameter and memory efficient fine-tuning method which uses the feature sum as fixed compression and a single trainable vector as decompression. Differently from state-of-the-art PEFT methods like LoRA, VeRA, and the recent MoRA, 1LoRA uses fewer parameters per layer, reducing the memory footprint and the computational cost. We extensively evaluate our method against state-of-the-art PEFT methods on multiple fine-tuning tasks, and show that our method not only outperforms them, but is also more parameter, memory and computationally efficient. Moreover, thanks to its memory efficiency, 1LoRA allows to fine-tune more evenly across layers, instead of focusing on specific ones (e.g. attention layers), improving performance further.

81

Illuminating Darkness: Learning to Enhance Low-light Images In-the-Wild

S Sharif ⋅ Abdur Rehman ⋅ Zain Abidin ⋅ Fayaz Ali ⋅ Radu Timofte ⋅ Rizwan Naqvi

Single-shot low-light image enhancement (SLLIE) remains challenging due to the limited availability of diverse, real-world paired datasets. To bridge this gap, we introduce the Low-Light Smartphone Dataset (LSD), a large-scale, high-resolution (4K+) dataset collected in the wild across a wide range of challenging lighting conditions (0.1–200 lux). LSD contains 6,425 precisely aligned low- and normal-light image pairs, selected from over 8,000 dynamic indoor and outdoor scenes through multi-frame acquisition and expert evaluation. To evaluate generalization and aesthetic quality, we collect 2,117 unpaired low-light images from previously unseen devices. To fully exploit LSD, we propose TFFormer, a hybrid model that encodes luminance and chrominance (LC) separately to reduce color-structure entanglement. We further propose a cross-attention-driven joint decoder for context-aware fusion of LC representations, along with LC refinement and LC-guided supervision to significantly enhance perceptual fidelity and structural consistency. TFFormer achieves state-of-the-art results on LSD (+2.45 dB PSNR) and substantially improves downstream vision tasks, such as low-light object detection (+6.80 mAP on ExDark). Data and code will be publicly released.

82

DOODLE: Diffusion-based Out-of-Distribution Learning for Open-set LiDAR Semantic Segmentation

Changgyoon Oh ⋅ Hyeonseong Kim ⋅ Daehyun We ⋅ Jongoh Jeong ⋅ Yujeong Chae ⋅ Kuk-Jin Yoon

Open-set driving in complex real-world environments requires reliable identification of out-of-distribution (OOD) objects to avoid overconfident predictions on unseen categories. However, the sparsity and limited semantic richness of LiDAR point clouds make separating known and unknown classes difficult. This work proposes DOODLE, a diffusion model–based OOD learning framework for open-set 3D semantic segmentation. DOODLE trains a diffusion model to reconstruct in-distribution semantic features; feature-level reconstruction discrepancies then serve as OOD evidence. The resulting OOD scores are used to enhance backbone semantic features, improving discrimination of unknown regions during segmentation. To further reduce false positives arising from nonuniform measurements, a density-aware post-processing (DAP) module incorporates spatial variation in LiDAR point density when refining OOD predictions. DOODLE integrates seamlessly with existing open-set models and does not constrain backbone design. Experiments on SemanticKITTI and nuScenes demonstrate state-of-the-art OOD performance. On SemanticKITTI, DOODLE improves area under the precision–recall curve (AUPR) by 1.85%p and area under the receiver operating characteristic (AUROC) by 1.29%p over prior methods. Ablation studies confirm complementary benefits from diffusion-based reconstruction and DAP. Code will be open for reproducibility.

83

RobustFormer: Noise-Robust Pre-training for Images and Videos

Ashish Bastola ⋅ Nishant Luitel ⋅ Hao Wang ⋅ Danda Pani Paudel ⋅ Roshni Poudel ⋅ Abolfazl Razi

While deep learning-based models like transformers, have revolutionized time-series and vision tasks, they remain highly susceptible to noise and often overfit on noisy patterns rather than robust features. This issue is exacerbated in vision transformers, which rely on pixel-level details that can easily be corrupt. To address this, we leverage the discrete wavelet transform (DWT) for its ability to decompose into multi-resolution layers, isolating noise primarily in the high frequency domain while preserving essential low-frequency information for resilient feature learning. Conventional DWT-based methods, however, struggle with computational inefficiencies due to the requirement for a subsequent inverse discrete wavelet transform (IDWT) step. In this work, we introduce RobustFormer, a novel framework that enables noise-robust masked autoencoder (MAE) pre-training for both images and videos by using DWT for efficient downsampling, eliminating the need for expensive IDWT reconstruction and simplifying the attention mechanism to focus on noise-resilient multi-scale representations. To our knowledge, RobustFormer is the first DWT-based method fully compatible with video inputs and MAE-style pre-training. Extensive experiments on noisy image and video datasets demonstrate that our approach achieves up to 8% increase in Top-1 classification accuracy under severe noise conditions in Imagenet-C and up to 2.7% in Imagenet-P standard benchmarks compared to the baseline and up to 13% higher Top-1 accuracy on UCF-101 under severe custom noise perturbations while maintaining similar accuracy scores for clean datasets. We also observe the reduction of computation complexity by up to 4.4% through IDWT removal compared to VideoMAE baseline without significant performance drop.

84

CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning

Zeyuan Chen ⋅ Xiang Zhang ⋅ Haiyang Xu ⋅ Jianwen Xie ⋅ Zhuowen Tu

We present Deux3D, a simple yet effective framework for 3D scene understanding that draws inspiration from the two types of human visual fields -- central vision and peripheral vision. Existing approaches primarily rely on unstructured representations, such as point clouds, voxels, or patch features, and inject scene context implicitly via coordinate embeddings. However, this often results in limited spatial reasoning capabilities due to the lack of explicit, high-level structural understanding. To address this limitation, we introduce two complementary components into a Large Multimodal Model-based architecture: target-affinity token, analogous to central vision, that guides the model's attention toward query-relevant objects; and allocentric grid, akin to peripheral vision, that captures global scene context and spatial arrangements. These components work in tandem to enable structured, context-aware understanding of complex 3D environments. Experiments show that Deux3D achieves state-of-the-art performance across a range of 3D scene understanding benchmarks.

85

Trajectory Tactics: When Transformers Learn Exploration to Generate Online Signature

Anurag Pandey ⋅ Aditya Nigam ⋅ Arnav Bhavsar ⋅ Ashutosh Sharma ⋅ Basu Verma ⋅ Divya Acharya ⋅ Mohd Amir

The increasing need for robust digital signature verification systems has amplified interest in realistic online signature generation to counter digital forgeries. In this work, we propose a novel Decision Transformer based framework that learns Reinforcement Learning output to generate diverse online signatures. Departing from traditional RL approaches that rely on policy gradients or value function estimation, we formulate signature generation as a sequence-modeling problem. Our framework addresses varied free-form signature styles, demonstrating adaptability across linguistic and stylistic variations. Initially, an RL model generates signature trajectories, which are then fed to Decision Transformer, employing an autoregressive sequence modeling approach. To further personalize the generated signatures, we introduce a Q-learning-based module that produces user-specific variations while mitigating noise. By operating in an offline reinforcement learning setting, the proposed method reduces the dependency on extensive online interactions, improving scalability. Experimental results on a publicly available online signature dataset in multiple linguistic script styles show that our approach significantly outperforms traditional generative methods in terms of realism, variability, and mimicry accuracy. These results highlight the potential of Decision Transformers for structured sequence generation tasks beyond their conventional domains.

86

BrandFusion: Aligning Image Generation with Brand Styles

Parul Gupta ⋅ Varun Khurana ⋅ Yaman Singla ⋅ Balaji Krishnamurthy ⋅ Abhinav Dhall

While recent text-to-image models excel at generating realistic content, they struggle to capture the nuanced visual characteristics that define a brand's distinctive style—such as lighting preferences, photography genres, color palettes, and compositional choices. This work introduces BrandFusion, a novel framework that automatically generates brand-aligned promotional images by decoupling brand style learning from image generation. Our approach consists of two components: a Brand-aware Vision-Language Model (BrandVLM) that predicts brand-relevant style characteristics and corresponding visual embeddings from marketer-provided contextual information, and a Brand-aware Diffusion Model (BrandDM) that generates images conditioned on these learned style representations. Unlike existing personalization methods that require separate fine-tuning for each brand, BrandFusion maintains scalability while preserving interpretability through textual style characteristics. Our method generalizes effectively to unseen brands by leveraging common industry sector-level visual patterns. Extensive evaluation demonstrates consistent improvements over existing approaches across multiple brand alignment metrics, with a 68.61% preference rate in human evaluation study. This work paves the way for AI-assisted on-brand content creation in marketing workflows.

87

Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models

Zhenxiang Lin ⋅ Maryam Haghighat ⋅ Will Browne ⋅ Dimity Miller

Vision-language models (VLMs), such as CLIP, have gained popularity for their strong open vocabulary classification performance, but they are prone to assigning high confidence scores to misclassifications, limiting their reliability in safety-critical applications. We introduce a training-free, post-hoc uncertainty estimation method for contrastive VLMs that can be used to detect erroneous predictions. Our approach assesses visual feature consistency within a class, using feature projection combined with multivariate Gaussians to create class-specific probabilistic embeddings. At inference, test embeddings are scored using the log-probability under each class distribution, with a softmax-normalized density used as the new confidence score. Our method is VLM-agnostic, requires no fine-tuning, robust to label shift, and works effectively with as few as 300 training images per class. Extensive experiments on ImageNet, Flowers102, and Food101 show state-of-the-art error detection performance, significantly outperforming both deterministic and probabilistic VLM baselines. Our code will be made publicly available upon acceptance.

88

FARF-Net: Frequency-guided Adaptive Receptive Field Network for Edge-enhanced Polyp Segmentation

Xue Li ⋅ Aiwen Jiang ⋅ Hongqian Yu ⋅ Xiao Yang

Accurate segmentation of colorectal polyps plays a vital role in the early diagnosis and prevention of colorectal cancer (CRC). Despite notable progress, existing methods struggle with limited region adaptability due to fixed receptive fields, lack explicit boundary modeling, and are prone to interference from background noise, leading to suboptimal segmentation results. To address these issues, we propose FARF-Net, a novel edge-aware segmentation framework that leverages frequency-domain adaptive receptive fields. Built upon the Pyramid Vision Transformer v2 (PVTv2) backbone, FARF-Net introduces three tailored components. The EdgeKAN module applies Kolmogorov–Arnold Networks (KANs) for channel-wise nonlinear modeling, enhancing local edge semantics and boundary detail representation. The Adaptive Receptive Field (ARF) module adjusts spatial receptive fields based on localized frequency energy, boosting sensitivity to high-frequency boundaries. Additionally, the Frequency-Guided Dual-Supervision (FGDS) decoder integrates high-frequency structural features and boundary priors to refine edge predictions and suppress irrelevant high-frequency background noise. Extensive experiments on five public polyp segmentation benchmarks demonstrate that FARF-Net consistently surpasses state-of-the-art methods. Notably, it achieves superior boundary reconstruction and robustness in challenging cases such as blurred contours and small polyps.

89

Discrete Facial Encoding: A Framework for Data-driven Facial Display Discovery

Minh Tran ⋅ Maksim Siniukov ⋅ Zhangyu Jin ⋅ Mohammad Soleymani

Facial expression analysis is central to understanding human behavior, yet existing coding systems such as the Facial Action Coding System (FACS) are constrained by limited coverage and costly manual annotation. In this work, we introduce Discrete Facial Encoding (DFE), an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions from 3D mesh sequences learned through a Residual Vector Quantized Variational Autoencoder (RVQ-VAE). Our approach first extracts identity-invariant expression features from images using a 3D Morphable Model (3DMM), effectively disentangling factors such as head pose and facial geometry. We then encode these features using an RVQ-VAE, producing a sequence of discrete tokens from a shared codebook, where each token captures a specific, reusable facial deformation pattern that contributes to the overall expression. Through extensive experiments, we demonstrate that Discrete Facial Encoding captures more precise facial behaviors than FACS and other facial encoding alternatives. We evaluate the utility of our representation across three high-level psychological tasks: stress detection, personality prediction, and depression detection. Using a simple Bag-of-Words model built on top of the learned tokens, our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models such as Masked Autoencoders. Further analysis reveals that our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.

90

Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding

Jun Li ⋅ Che Liu ⋅ Wenjia Bai ⋅ Mingxuan Liu ⋅ Rossella Arcucci ⋅ Cosmin Bercea ⋅ Julia Schnabel

In this work, we address the problem of grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. While generalist Vision-Language Models (VLMs) excel in natural grounding tasks, they often struggle in the medical domain due to rare, compositional, and domain-specific terms that are poorly aligned with visual patterns. Specialized medical VLMs address this challenge via large-scale domain pretraining, but at the cost of substantial annotation and computational resources.To overcome these limitations, we propose \textbf{Knowledge to Sight (K2Sight)}, a framework that introduces structured semantic supervision by decomposing clinical concepts into interpretable visual attributes, such as shape, density, and anatomical location. These attributes are distilled from domain ontologies and encoded into concise instruction-style prompts, which guide region-text alignment during training. Unlike conventional report-level supervision, our approach explicitly bridges domain knowledge and spatial structure, enabling data-efficient training of compact models.We train compact models with 0.23B and 2B parameters using only 1.5\% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, these models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82\% improvement in $mAP_{50}$. Code and models: https://huggingface.co/spaces/Anonymous-AC/Demo.

91

DenseBEV: Transforming BEV Grid Cells into 3D Objects

Marius Dähling ⋅ Sebastian Krebs ⋅ J. Zöllner

In current research, Bird’s-eye-view (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression (NMS), allowing gradients to flow only through non-suppressed objects. This ensures efficient training without the need for post-processing. By using BEV features from encoders such as BEVFormer directly as object queries, temporal BEV information is inherently embedded. Building on the temporal BEV information already embedded in our object queries, we introduce a hybrid temporal modeling approach by integrating prior detections to further enhance detection performance. Evaluating our method on the nuScenes dataset shows consistent and significant improvements in NDS and mAP over the baseline, even with sparser BEV grids and therefore fewer initial anchors. It is particularly effective for small objects, enhancing pedestrian detection with a 3.8% mAP increase on nuScenes and an 8% increase in LET-mAP on Waymo. Applying our method, named DenseBEV, to the challenging Waymo Open dataset yields state-of-the-art performance, achieving a LET-mAP of 60.7%, surpassing the previous best by 5.4%. The code will be released upon acceptance.

92

Human knowledge integrated multi-modal learning for single source domain generalization

Ayan Banerjee ⋅ Kuntal Thakur ⋅ Sandeep Gupta

Generalized performance of image classification across datasets from different domains has been elusive in applications of critical importance such as fundus image based grading of diabetes retinopathy (DR). Theoretically, if data from two domains differ in unknown causal factors, it is difficult to achieve generalized performance across the two domains. Traditionally, there is no methodology to evaluate whether domains differ on causal factors without access to data collection sources which is often not feasible. This paper, first proposes a novel theoretical framework of domain conformal bounds (DCB) to evaluate whether two domains differ in unknown causal factors. Then it proposes, GenEval, a multi-modal visual language model (VLM) based technique that integrates foundational image classification models such as MedGemma-4B with human knowledge about specific classes through parameter-efficient LoRA adaptation to bridge the causal gap between domains and achieve superior single source domain generalization (SDG) performance than state-of-the-art. Comprehensive SDG evaluation across four major DR datasets (APTOS, EyePACS, Messidor, Messidor-2), demonstrate GenEval's superiority: achieving \textbf{76.0\%} average accuracy, surpassing the strongest baseline by \textbf{10.5\%} in DR application under SDG setting.

93

GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction

Patrick Kwon ⋅ Chen Chen ⋅ Hanbyul Joo

Recent generative models can synthesize high-quality images but often fail to generate humans interacting with objects using their hands. This arises mostly from the model's misunderstanding of such interactions, and the hardships of synthesizing intricate regions of the body. In this paper, we propose GraspDiffusion, a novel generative method that creates realistic scenes of human-object interaction. Given a 3D object mesh, GraspDiffusion first constructs life-like whole-body poses with control over the object's location relative to the human body. This is achieved by separately leveraging the generative priors for 3D body and hand poses, optimizing them into a joint grasping pose. The resulting pose guides the image synthesis to correctly reflect the intended interaction, allowing the creation of realistic and diverse human-object interaction scenes. We demonstrate that GraspDiffusion can successfully tackle the relatively uninvestigated problem of generating full-bodied human-object interactions while outperforming previous methods.

94

ScoliGaitX: A Deep Multi-Modal Fusion Network for Scoliosis Assessment via Gait Video Analysis

Kaushik Vishwakarma ⋅ Aditya Nigam

Scoliosis presents significant diagnostic challenges, especially in its early stages due to structural deformities of the spine. The most common type, Adolescent Idiopathic Scoliosis (AIS), typically appears during rapid growth periods between ages 10 and 15, accounting for about 80–85% of all scoliosis cases. Currently, diagnosis mainly involves repeated X-rays and clinical evaluations, which are costly and expose patients to frequent radiation. To address these challenges, we propose ScoliGaitX, a novel non-invasive system for assessing scoliosis through gait video analysis. Our approach uses a multi-stream deep learning model trained specifically for scoliosis classification using the Scoliosis1K dataset. We integrate three different gait modalities—(i) silhouettes to capture appearance, (ii) optical flow for motion analysis, and (iii) GEI-subtracted sequences to highlight deviations from normal gait—to effectively identify gait abnormalities associated with scoliosis. A key component of our proposal is the Align Gate Fuse (AGF) module, designed to efficiently learn relationships between different modalities. It achieves this by adaptively assigning importance to each modality through a lightweight global gating mechanism. Our experimental results demonstrate that ScoliGaitX significantly outperforms existing methods, achieving an accuracy of 89.05%, which is 7.05% higher than the previous best approach (ScolNet-MT), while maintaining excellent specificity. This highlights the promise of our method for providing early, radiation-free scoliosis assessment.

95

Non‑Contact Blood Pressure Estimation from Face Videos via Physiology‑Aware Contrastive Learning

JaeHyuk Son ⋅ Young-Seok Choi

Remote photoplethysmography (rPPG) has emerged as a promising foundation for camera-based blood pressure (BP) monitoring, but practical deployment remains limited by strong domain gaps across datasets, scarce and imbalanced labels, and the difficulty of preserving waveform morphology. We present a dual-branch framework that combines raw rPPG segments with handcrafted waveform features and introduces an augmentation-free contrastive pre-training strategy. The approach learns subject-invariant, domain-agnostic embeddings from unlabeled facial videos and aligns them with physiology-inspired descriptors in a shared latent space, while a distribution-aware loss reduces label imbalance. This design integrates the strengths of data-driven representations and handcrafted physiological cues, producing morphology-sensitive features that generalize across acquisition domains. Experiments on multiple datasets demonstrate that the proposed method achieves competitive accuracy under both controlled and in-the-wild conditions, improves cross-dataset transfer, and maintains robust performance when labeled data are limited. Beyond accuracy, the framework emphasizes interpretability by grounding learned embeddings in physiological features, and its reliance on unlabeled videos makes it highly scalable to larger populations without requiring extensive manual annotation. Taken together, these results suggest that bridging representation learning with physiological modeling offers a practical and scalable path toward reliable, non-contact BP monitoring in diverse real-world environments.

96

Ordinal-Aware Multimodal Engagement Recognition for Collaborative Learning

Nha Tran ⋅ Dat Ly ⋅ Phi Ta ⋅ Hung Nguyen ⋅ Hien Nguyen

Assessing student engagement is critical for collaborative learning but remains a challenging task. Existing approaches often rely on controlled laboratory or online settings, which fail to capture the complexity of real-world classrooms. Furthermore, current datasets are scarce and rarely provide both individual- and group-level annotations, limiting the development of robust and generalizable models. To address these gaps, we propose CORE-Net, a multimodal architecture that integrates context modeling to capture group-level dynamics and ordinal supervision to account for the ordinal nature of engagement levels. We also present COLER, a large-scale dataset collected in authentic classroom environments with rich annotations at multiple levels. Experiments demonstrate that CORE- Net achieves 89.63% accuracy and 94.80 QWK, signifi- cantly outperforming strong baselines such as BlockGCN and MoViNet. Ablation studies further highlight the critical role of both context modeling and ordinal supervision. Our work establishes a robust and scalable foundation for automated engagement assessment, supporting timely feedback and enhancing the effectiveness of collaborative learning.

97

DynaGSLAM: Real-Time Gaussian-Splatting SLAM for Online Rendering, Tracking, Motion Predictions of Moving Objects in Dynamic Scenes

Runfa Li ⋅ Mahdi Shaghaghi ⋅ Keito Suzuki ⋅ Xinshuang Liu ⋅ Varun Moparthi ⋅ Bang Du ⋅ Walker Curtis ⋅ Martin Renschler ⋅ Ki Myung Brian Lee ⋅ Nikolay Atanasov ⋅ Truong Nguyen

Simultaneous Localization and Mapping (SLAM) is one of the most important environment-perception and navigation algorithms for computer vision, robotics, and autonomous cars/drones. Hence, high quality and fast mapping becomes a fundamental problem. With the advent of 3D GaussianSplatting (3DGS) as an explicit representation with excellent rendering quality and speed, state-of-the-art (SOTA) works introduce GS to SLAM. Compared to classical pointcloud-SLAM, GS-SLAM generates photometric information by learning from input camera views and synthesize unseen views with high-quality textures. However, these GS-SLAM fail when moving objects occupy the scene that violate the static assumption of bundle adjustment. The failed updates of moving GS affect the static GS and contaminates the full map over long frames. Although some efforts have been made by concurrent works to consider moving objects for GS-SLAM, they simply detect and remove the moving regions from GS rendering (''anti'' dynamic GS-SLAM), where only the static background could benefit from GS. To this end, we propose the first real-time GS-SLAM, ''DynaGSLAM'', that achieves high-quality online GS rendering, tracking, motion predictions of moving objects in dynamic scenes while jointly estimating accurate ego motion. Our DynaGSLAM outperforms SOTA static & ''Anti'' dynamic GS-SLAM on three dynamic real datasets, while keeping speed and memory efficiency in practice.

98

Test-Time Adaptation through Semantically-guided Feature Decomposition for Few-shot Chest X-ray Diagnosis

Jayant Mahawar ⋅ Angshuman Paul

Training a deep neural network with a small amount of labeled data is challenging. The challenge is even more severe for medical images because of the many possible variations in the images. We propose a novel framework for few-shot chest x-ray (CXR) diagnosis. For classification problems, training with limited data may be facilitated if class-specific features can be extracted and utilized. Semantic information about the abnormalities may also be helpful in this context. To that end, we design an autoencoder-based approach that extracts visual features and decomposes them into class-agnostic and class-specific features utilizing the semantic information of the abnormalities. The decomposition helps in efficient classification using the class-specific features. Additionally, we perform test-time adaptation to deal with possible variations in the test data compared to the training data. From this perspective, our method is one of the first of its kind. Extensive evaluations on publicly available chest x-ray datasets under few-shot settings show the effectiveness of our method. Results on the publicly available chest x-ray datasets show a 3–5\% improvement in AUROC scores.

99

FlowMorph: Revealing an Optimizable Flow Latent Space for Controlled Image Morphing

Yan Zheng ⋅ Yi Yang ⋅ Lanqing Guo ⋅ Zhangyang ”Atlas” Wang

We present FlowMorph, a simple and training-free framework for geometry-preserving and semantics-aware image interpolation. The key idea is to separate two factors inside the flow model’s latent space: an offset that captures shape and geometry, and a one-step vector that carries semantic meaning. By keeping the flow model frozen and only optimizing these two variables, FlowMorph exposes a stable and interpretable neighborhood around each image. This leads to two complementary modes. Flow-Optimizer directly fits a source image toward a target image and naturally supports multi-objective combinations, producing stable reconstructions. Flow-Interpolation mixes the offset linearly and the semantic vector spherically, generating smooth and coherent transitions between images. Across a wide range of tasks including object morphing, pose changes, and scene transitions, FlowMorph outperforms prior interpolation-based methods. Quantitative experiments show that our method achieves lower perceptual error, better image fidelity, and smoother transitions. Landmark-based analysis further confirms that FlowMorph preserves geometry more effectively. We also ablate the effect of the backward step size, showing that longer steps increase semantic expressiveness and allow interpolations that move beyond trivial shape blending, enabling richer morphs across object positions and photo layouts. FlowMorph provides an interpretable and controllable tool for high-quality image morphing without the need for additional training.

100

PVeRA: Probabilistic Vector-Based Random Matrix Adaptation

Leo Fillioux ⋅ Enzo Ferrante ⋅ Paul-Henry Cournède ⋅ Maria Vakalopoulou ⋅ Stergios Christodoulidis

Large foundation models have emerged in the last years and are pushing performance boundaries for a variety of tasks. Training or even finetuning such models demands vast datasets and computational resources, which are often scarce and costly. Adaptation methods provide a computationally efficient solution to address these limitations by allowing such models to be finetuned on small amounts of data and computing power. This is achieved by appending new trainable modules on frozen backbones with only a fraction of the trainable parameters and fitting only these modules on novel tasks. Recently, the VeRA adapter was shown to excel in making parameter-efficient adaptations by utilizing a pair of frozen random low-rank matrices shared across all layers. In this paper, we propose PVeRA, a probabilistic version of the VeRA adapter, which modifies the low-rank matrices of VeRA in a probabilistic manner. Such modification naturally allows handling inherent ambiguities in the input and allows for different sampling configurations during training and testing.

101

Harnessing Object Grounding for Time-Sensitive Video Understanding

Tz-Ying Wu ⋅ Sharath Nittur Sridhar ⋅ Subarna Tripathi

We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual descriptions of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object-level information. To address this, we propose {\bf GO-Tokenizer}, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart, utilizing textual descriptions of objects in the prompt. The gain generalizes across different models, datasets, and video understanding tasks such as reasoning, temporal localization, and dense captioning.The review-time code is available at \hyperref[https://anonymous.4open.science/status/GO-Video-DB16]{\url{https://anonymous.4open.science/r/GO-Video-DB16/README.md}}.

102

TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning

Maximilian von Klinski ⋅ Maximilian Schall

Traditional vision-language models struggle with fine-grained taxonomic reasoning, particularly distinguishing between visually similar species within the same genus or family. We propose a reinforcement learning approach using Group Relative Policy Optimization with intermediate rewards that decompose the reasoning process into hierarchical taxonomic predictions. Our method incentivizes models to explicitly reason about species-level, genus-level, and family-level features before making final classifications. This structured approach is designed not only to boost accuracy but also to yield a transparent, verifiable decision-making process. On the challenging Birds-to-Words dataset, our approach achieves 91.7% accuracy on same-species verification, matching human performance (77.3%) while generating interpretable reasoning traces. We demonstrate cross-domain generalization showing substantial gains on primate verification while generating explainable traces. The intermediate reward mechanism shows that structured biological reasoning provides a powerful framework for fine-grained visual discrimination.

103

Revisiting Retentive Networks for Fast Range-View 3D LiDAR Semantic Segmentation

Simone Mosco ⋅ Daniel Fusaro ⋅ Wanmeng Li ⋅ Alberto Pretto

LiDAR semantic segmentation is a crucial task in autonomous driving and robotics, where real-time performance is essential for online decision-making. Recent trends exploit range images and Vision Transformers, using the self-attention mechanism. However, these approaches often lack explicit spatial priors and involve a large number of parameters. To tackle these limitations, we propose a novel method, adapting the Retentive Network architecture from the Natural Language Processing (NLP) field, for its efficient sequence modeling capabilities, directly operating on the range-view representation. Our approach incorporates a circular retention (CiR) mechanism that explicitly captures spatial relationships and continual circular property of the range image while modeling long-range dependencies and preserving the receptive field. In addition, we introduce a new set of range-view augmentations, adapted from 3D techniques, to improve generalization and mitigate class imbalance. Extensive experiments on three large-scale datasets, as SemanticKITTI, PandaSet and SemanticPOSS demonstrate that our method achieve state-of-the-art performance among range-view approaches on two out of three datasets, while satisfying real-time constraints. The code of our method is available at [REMOVED DUE TO ANONYMOUS SUBMISSION].

104

Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention

Junhao Xing ⋅ Ryohei Miyakawa ⋅ Yang Yang ⋅ Xinpeng Liu ⋅ Risa Shinoda ⋅ Hiroaki Santo ⋅ Yosuke Toda ⋅ Fumio Okura

Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. Existing methods for this problem, referred to as a hierarchical segmentation task, typically require annotated training datasets, which are often species-specific and require notable human labor. To address this problem, we introduce ZeroPlantSeg, a zero-shot segmentation method for rosette-shaped plant individuals from top-view crop images. Our method integrates a foundation segmentation model, extracting leaf instances, and a vision-language model (VLM), reasoning about plants' structures to extract plant instances without additional training. Evaluations on real-world datasets with multiple plant species (i.e., sugar beets and cauliflowers), growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance compared to supervised methods. Our implementations will be publicly available upon acceptance.

105

Moiré Zero: An Efficient and High-Performance Neural Architecture for Moiré Removal

Seungryong Lee ⋅ Woojeong Baek ⋅ Younghyun Kim ⋅ Eunwoo Kim ⋅ Haru Moon ⋅ Donggon Yoo ⋅ Eunbyung Park

Moiré patterns, caused by frequency aliasing between fine repetitive structures and a camera sensor’s sampling process, have been a significant obstacle in various real-world applications, such as consumer photography and industrial defect inspection. With the advancements in deep learning algorithms, numerous studies-predominantly based on convolutional neural networks-have suggested various solutions to address this issue. Despite these efforts, existing approaches still struggle to effectively eliminate artifacts due to the diverse scales, orientations, and color shifts of moiré patterns, primarily because the constrained receptive field of CNN-based architectures limits their ability to capture the complex characteristics of moiré patterns. In this paper, we propose MZNet, a U-shaped network designed to bring images closer to a ‘Moiré-Zero’ state by effectively removing moiré patterns. It integrates three specialized components: Multi-Scale Dual Attention Block (MSDAB) for extracting and refining multi-scale features, Multi-Shape Large Kernel Convolution Block (MSLKB) for capturing diverse moiré structures, and Feature Fusion-Based Skip Connection for enhancing information flow. Together, these components enhance local texture restoration and large-scale artifact suppression. Experiments on benchmark datasets demonstrate that MZNet achieves state-of-the-art performance on high-resolution datasets and delivers competitive results on lower-resolution dataset, while maintaining a low computational cost, suggesting that it is an efficient and practical solution for real-world applications.

106

LENVIZ: A High-Resolution Low-Exposure Night Vision Benchmark Dataset

Manjushree Aithal ⋅ Rosaura VidalMata ⋅ Manikandtan Kartha ⋅ Gong Chen ⋅ Eashan Adhikarla ⋅ Lucas Kirsten ⋅ Zhicheng Fu ⋅ Nikhil Madhusudhana ⋅ Joseph Nasti

Low-light image enhancement is crucial for a myriad of applications, from night vision and surveillance, to autonomous driving. However, due to the inherent limitations that come in hand with capturing images in low-illumination environments, the task of enhancing such scenes still presents a formidable challenge. To advance research in this field, we introduce our Low Exposure Night Vision (LENVIZ) Dataset, a comprehensive multi-exposure benchmark dataset for low-light image enhancement comprising of over 230K frames showcasing 24K real-world indoor and outdoor, with-and-without human, scenes. Captured using 3 different camera sensors, LENVIZ offers a wide range of lighting conditions, noise levels, and scene complexities, making it the largest publicly available up to 4K resolution benchmark in the field. LENVIZ includes high quality human-generated ground truth, for which each multiexposure low-light scene has been meticulously curated and edited by expert photographers to ensure optimal image quality. Furthermore, we also conduct a comprehensive analysis of current state-of-the-art low-light image enhancement techniques on our dataset and highlight potential areas of improvement.

107

RobustGait: Robustness Analysis for Appearance Based Gait Recognition

Reeshoon Sayera ⋅ Akash Kumar ⋅ Sirshapan Mitra ⋅ Prudvi Kamtam ⋅ Yogesh Rawat

Appearance-based gait recognition have achieved strong performance on controlled datasets, yet systematic evaluation of its robustness to real-world corruptions and silhouette variability remains lacking. We present RobustGait, a framework for fine-grained robustness evaluation of appearance-based gait recognition systems. RobustGait evaluation spans four dimensions: the type of perturbation (digital, environmental, temporal, occlusion), the silhouette extraction method (segmentation and parsing networks), the architectural capacities of gait recognition models, and various deployment scenarios. The benchmark introduces 15 corruption types at 5 severity levels across CASIA-B, CCPG, and SUSTech1K, with in-the-wild validation on MEVID, and evaluates six state-of-the-art gait systems. We came across several exciting insights. First, applying noise at the RGB level better reflects real-world degradation, and reveal how distortions propagate through silhouette extraction to the downstream gait recognition systems. Second, gait accuracy is highly sensitive to silhouette extractor biases, revealing an overlooked source of benchmark bias. Third, robustness is dependent on both the type of perturbation and the architectural design. Finally, we explore robustness-enhancing strategies, showing that noise-aware training and knowledge distillation improve performance and move toward deployment-ready systems.

108

A-V Representation Learning via Audio Shift Prediction for Multimodal Deepfake Detection and Temporal Localization

Ashutosh Anshul ⋅ Eng Chng ⋅ Deepu Rajan

Recent multimodal deepfake detection methods typically rely on single-stage training, which can cause the model to focus on dataset-specific multimodal cues while missing important features that are helpful to detect unseen manipulations, thereby limiting generalization. While some approaches attempt to address this using self-supervised audio-visual pretraining, they may not fully exploit cross-modal temporal information. Also, they often assume that manipulations affect the entire video, ignoring more realistic cases where only short segments are altered. To overcome these limitations, we propose a two-stage training framework that first learns audio-visual temporal alignment in real videos and then uses this information to detect and localize potential deepfakes by identifying temporal inconsistencies. We propose a self-supervised shift-prediction pretraining objective to fully understand cross-modal temporal alignment across multiple temporal shifts applied to the audio input. The pretrained features enable the model to identify manipulations across entire videos as well as accurately localize deepfake segments in partially tampered content. Moreover, the pretrained components do not require task-specific fine-tuning, improving the model’s flexibility for both classification and localization. Experiments on benchmark datasets demonstrate strong within-dataset performance, robust generalization to new manipulations and datasets, and accurate temporal localization.

109

CraftSVG: Multi-Object Text-to-SVG Synthesis via Layout Guided Diffusion

Ayan Banerjee ⋅ Nityanand Mathur ⋅ Josep Llados ⋅ Umapada Pal ⋅ Anjan Dutta

Generating SVGs from text prompts is a challenging vision task, requiring diverse yet realistic depictions of the seen as well as unseen entities. However, existing research has been mostly limited to generating single-object rather than comprehensive scenes comprising multiple elements. In response, CraftSVG, introduces an end-to-end framework for creating SVGs depicting entire scenes from textual descriptions. Utilizing a pre-trained LLM for layout generation from text prompts via iterative in-context learning, CraftSVG introduces a technique for producing masked latent in specified bounding boxes for accurate object placement. It introduces a fusion mechanism for integrating attention maps and employs a diffusion U-Net for coherent composition, speeding up the stroke initialization. Recognizing the importance of abstract SVGs in communication, we incorporated an MLP-based mechanism to simplify the resulting SVGs, with alignment and perceptual loss with differential rendering and opacity modulation to maximize the similarity. CraftSVG outperforms previous methods in abstraction, recognizability, and detail, as demonstrated by its performance metrics: CLIP-T: 0.5013, Cosine Similarity: 0.7091, and Aesthetic: 7.0779, among others.

110

Shift-Equivariant Complex-Valued Convolutional Neural Networks

Quentin Gabot ⋅ Teck-Yian Lim ⋅ Jeremy Fix ⋅ Joana Frontera-Pons ⋅ Chengfang Ren ⋅ Jean-Philippe Ovarlez

Convolutional neural networks have shown remarkable performance in recent years on various computer vision problems. However, the traditional convolutional neural network architecture lacks a critical property: shift equivariance and invariance, broken by downsampling and upsampling operations. Although data augmentation techniques can help the model learn the latter property empirically, a consistent and systemic way to achieve this goal is by designing downsampling and upsampling layers that theoretically guarantee these properties by construction. Adaptive Polyphase Sampling (APS)introduced the cornerstone for shift invariance, later extended to shift equivariance with Learnable Polyphase up/downsampling (LPS) applied to real-valued neural networks. In this paper, we extend the work on LPSto complex-valued neural networks both from a theoretical perspective and with a novel building block of a projection layer from $\mathbb{C}$ to $\mathbb{R}$ before the Gumbel Softmax. We finally evaluate this extension on several computer vision problems for either the invariance property with classification tasks or the equivariance with both reconstruction and semantic segmentation problems using polarimetric synthetic aperture images.

111

ObjectMeshDeform : Towards recovering precise 3D geometry of real objects via image-guided mesh deformation of 3D generative priors

Siddharth Katageri ⋅ SANJANA SINHA ⋅ Sourav Ghosh ⋅ Soumyadip Maity ⋅ Brojeshwar Bhowmick

3D Generative Models that synthesize high fidelity 3D assets from single view or multi-view images cannot recover precise geometry and real-world object measurements needed for practical applications. On the other hand, multi-view 3D reconstruction methods based on structure from motion, implicit surfaces, gaussian splatting, fail to recover high fidelity object meshes with dense geometry, shape regularity, smooth surfaces present in real world objects. In this paper we propose a novel approach that leverages a 3D mesh prior synthesized by generative models pre-trained on large scale 3D synthetic datasets. Our method refines the initial mesh geometry without use of any additional training data while improving accuracy of 3D geometry via multi-view consistency, without degrading the mesh surface quality. Our method can automatically reconstruct meshes from images of objects in real world scenes, without requiring any additional large-scale training data or manual inputs.

112

Memory-Augmented Representation for Efficient Event-based Visuomotor Policy Learning with Adaptive Perception and Control

Uday Kamal ⋅ Saibal Mukhopadhyay

Event-based cameras are well-suited for fast and agile autonomous navigation due to their ultra-fast, microsecond-level temporal resolution. However, fully leveraging this potential requires highly efficient processing algorithms capable of asynchronous, event-by-event representations and policy updates. Current methods employ synchronous dense representation or process events in a fixed-rate time windows, leading to inefficiencies via redundant computation. We address this by proposing an end-to-end framework for event-to-control policy learning designed for reactive navigation tasks. Our method consists of a memory-augmented perception module that updates the representation asynchronously and adaptively selects the number of events to process. Using the memory representation, a lightweight policy module is jointly optimized with the perception module, and learns to predict control commands at rates that dynamically adjust to scene complexity in an event-based reinforcement learning setting. Evaluations on simulated drone navigation tasks demonstrate higher sample efficiency and robustness compared to dense frame-based methods. Moreover, our approach significantly reduces computational complexity by minimizing processing steps and event counts while maintaining competitive performance against state-of-the-art event-based methods.

113

PADM: A Physics-aware Diffusion Model for Attenuation Correction

Trung Pham ⋅ Hoang Vu ⋅ Anh Chu ⋅ Dac Thai Nguyen ⋅ Trung Thanh Nguyen ⋅ THAO TRUONG TRUONG ⋅ Mai Son ⋅ Thanh Nguyen ⋅ Phi Le Nguyen

Attenuation artifacts remain a significant challenge in cardiac Myocardial Perfusion Imaging (MPI) using Single-Photon Emission Computed Tomography (SPECT), often compromising diagnostic accuracy and reducing clinical interpretability. While hybrid SPECT/CT systems mitigate these artifacts through CT-derived attenuation maps, their high cost, limited accessibility, and added radiation exposure hinder widespread clinical adoption. In this study, we propose a novel CT-free solution to attenuation correction in cardiac SPECT. Specifically, we introduce Physics-aware Attenuation Correction Diffusion Model (PADM), a diffusion-based generative method that incorporates explicit physics priors via a teacher-student distillation mechanism. This approach enables attenuation artifact correction using only Non-Attenuation-Corrected (NAC) input, while still benefiting from physics-informed supervision during training. To support this work, we also present CardiAC, a comprehensive dataset comprising 424 patient studies with paired NAC and Attenuation-Corrected (AC) reconstructions, alongside high-resolution CT-based attenuation maps. Extensive experiments demonstrate that PADM outperforms state-of-the-art generative models, delivering superior reconstruction fidelity across both quantitative metrics and visual assessment. Both the CardiAC dataset and the PADM codebase are publicly released to advance research in physics-informed generative modeling for cardiac imaging.

114

Yunheon Lee, Juncheol Ye, Jaehong Kim, Dongsu Han NerVast: Compression-Efficient Scaling of Implicit Neural Video Representations via Scene-based Parameter-sharing

Yunheon Lee ⋅ Juncheol Ye ⋅ Jaehong Kim ⋅ Dongsu Han

Implicit neural representation (INR) has emerged as a new data representation for compressing videos and now shows on-par performance with the conventional codec. The next quest in the field is to make INR scalable for its practical use. Existing works realize this by utilizing small INR models to scale for long and high-resolution video, which achieves better encoding and decoding speeds. However, they fail to fully exploit the temporal nature of video data when encoding it into multiple separate INRs across time, which leads to sub-optimal compression efficiency. In this work, we propose NerVast, a new encoding scheme for video INR, that improves compression efficiency while still enjoying the low computation and transfer costs of small INR models. When a video is represented in separate INR segments, NerVast effectively reduces the total volume required for representation by sharing the parameters between models during encoding. Without expensive training, NerVast selects the most significant parameters to share. Then it jointly trains both shared and non-shared parameters in a way that minimizes the quality drop imposed by sharing. While maintaining real-time decoding speed (> 30 fps), NerVast provides better compression (39.9 % reduction in parameters on average) compared to the compute-efficient INR models. In other words, NerVast is better in encoding quality (1.57 dB higher in PSNR) with the same bitrate.

115

CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition

Quynh Phunh ⋅ Long Mai ⋅ Fabian Caba Heilbron ⋅ Feng Liu ⋅ Jia-Bin Huang ⋅ Cusuh Ham

Multi-shot generation requires preserving the identity of characters and settings across frames. Cinematic scene composition goes beyond standard multi-shot generation, introducing additional challenges such as expressing complex interactions among multiple characters and visual effects to convey creative narratives—challenges existing datasets cannot fully address. We present CineVerse a large-scale dataset of diverse movie scenes labeled with shot-level annotations tailored for filmmaking. CineVerse includes refined scene descriptions, shot-type information, and newly extracted shot, character, setting descriptions. We validate our dataset by developing a baseline framework that first generates a scene plan containing detailed information for the overall scene and each individual shot, then produces a set of coherent keyframes. Our results show significant improvements in controlling and synthesizing cinematic content through the added context provided by CineVerse.

116

MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction

Kyotaro Tokoro ⋅ Hiromu Taketsugu ⋅ Norimichi Ukita

This paper proposes a novel metric for Human Motion Prediction (HMP). Since a single past sequence can lead to multiple possible futures,a probabilistic HMP method predicts such multiple motions.While a single motion predicted by a deterministic method is evaluated only with the difference from its ground truth motion, multiple predicted motions should also be evaluated based on their distribution. For this evaluation, this paper focuses on the following two criteria. (a) Coverage: motions should be distributed among multiple motion modes to cover diverse possibilities. (b) Validity: motions should be kinematically valid as future motions observable from a given past motion.However, existing metrics simply appreciate widely distributed motions even if these motions are observed in a single mode and kinematically invalid.To resolve these disadvantages, this paper proposes a Multimodality-aware Metric using Clustering-based Modes (MMCM).For (a) coverage, MMCM divides a motion space into several clusters, each of which is regarded as a mode. These modes are used to explicitly evaluate whether predicted motions are distributed among multiple modes.For (b) validity, MMCM identifies valid modes by collecting possible future motions from a motion dataset.Our experiments validate that our clustering yields sensible mode definitions and that MMCM accurately scores multimodal predictions.Code: \url{https://anonymous.4open.science/r/MMCM-074E}

117

Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning

Ashutosh Chaubey ⋅ Xulang Guan ⋅ Mohammad Soleymani

The human face plays a central role in social communication, necessitating the use of performant computer vision tools for human-centered applications. We propose Face-LLaVA, a multimodal large language model for face-centered, in-context learning, including facial expression and attribute recognition. Additionally, Face-LLaVA is able to generate natural language descriptions that can be used for reasoning. Leveraging existing visual databases, we first developed FaceInstruct-1M, a face-centered database for instruction tuning MLLMs for face processing. We then developed a novel face-specific visual encoder powered by Face-Region Guided Cross-Attention that integrates face geometry with local visual features. We evaluated the proposed method across nine different datasets and five different face processing tasks, including facial expression recognition, action unit detection, facial attribute detection, age estimation and deepfake detection. Face-LLaVA achieves superior results compared to existing open-source MLLMs and competitive performance compared to commercial solutions. Our model output also receives a higher reasoning rating by GPT under a zero-shot setting across all the tasks. Both our dataset and model will be made public to support future advancements in social AI and foundational vision-language research.

118

ConsensusXAI: A framework to examine class-wise agreement in medical imaging

Abbas Haider ⋅ David Wright ⋅ Ruth Hogg ⋅ Hui Wang ⋅ Tunde Peto ⋅ Richard Gault

Explainable AI (XAI) is essential for trust and transparencyin deep learning, especially in medical imaging.Existing local explanation methods provide per-instance insightsbut fail to show whether similar explanations holdacross samples of the same class. This limits global interpretabilityand demands time-consuming manual reviewby clinicians to trust models in practice. We introduce theConsensus Alignment Score (CAS), a novel metric thatquantifies consistency of explanations at the class level.We also present ConsensusXAI, an open-source, modelandmethod-agnostic framework that evaluates explanationagreement quantitatively (via CAS) and qualitatively(through consensus heatmaps) per class. Unlike priorbenchmarks, ConsensusXAI uses a latent-space clusteringapproach, Latent Consensus, to identify dominant explanationpatterns, exposing biases and inconsistencies towardscertain classes. Evaluated across four benchmark datasetsand two imaging modalities, our method consistently revealsmeaningful class-level insights, outperforming traditionalmetrics like SSIM and IoU, and enabling faster, moreconfident clinical adoption of AI models.

119

Real-Time Tracking of Flexible Markers in Low-Contrast Fluoroscopy Using a Deep Neural Network Trained Solely on Synthetic Data

Tomoki Uchiyama ⋅ Yukinobu Sakata ⋅ Ryusuke Hirai ⋅ Hitoshi Ishikawa ⋅ Shinichiro Mori

In radiation therapy, fiducial markers implanted in a patient's body are tracked using X-ray fluoroscopy to estimate tumor positions.However, flexible markers, such as Gold Anchor$^{\textregistered}$ (Naslund Medical AB, Sweden), deform within the body, making conventional template matching challenging.While deep learning offers a promising solution, the extensive collection and annotation of clinical data required for training poses a significant barrier to adoption.To address this, we propose a tracking framework that utilizes a lightweight Siamese CNN trained exclusively on synthetic fluoroscopy images.Our method generates synthetic data simulating diverse marker deformations under low-contrast and high-noise conditions, employs dynamic programming for stable initial detection, and performs real-time tracking with a particle filter.In evaluations using clinical data, our method achieves a tracking accuracy of 0.42 ± 0.12 pixels for prostate cancer cases and 0.97 ± 0.53 pixels for pancreatic cancer cases.This significantly outperforms conventional methods, particularly in challenging low-contrast pancreatic cancer cases.With TensorRT optimization, the framework achieves a processing speed of 3.8 ms/frame.This work presents a practical solution for high-accuracy tracking, reducing data collection costs and facilitating the use of deep learning in clinical applications.

120

OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models

Miyamoto Ryoto ⋅ Xin Fan ⋅ Fuyuko Kido ⋅ Tsuneo Matsumoto ⋅ Hayato Yamana

OpenLVLM-MIA is a new benchmark that highlights fundamental challenges in evaluating membership inference attacks (MIA) against large vision-language models (LVLMs).While prior work has reported high attack success rates, our analysis suggests that these results often arise from detecting distributional bias introduced during dataset construction rather than from identifying true membership status.To address this issue, we introduce a controlled benchmark of 6,000 images where the distributions of member and non-member samples are carefully balanced, and ground-truth membership labels are provided across three distinct training stages —vision encoder, projector, and instruction tuning.Experiments using OpenLVLM-MIA demonstrated that the performance of state-of-the-art MIA methods converged to random chance under unbiased conditions.By offering a transparent and unbiased benchmark, OpenLVLM-MIA clarifies the current limitations of MIA research on LVLMs and provides a solid foundation for developing stronger privacy-preserving techniques.

121

DMAT: An End-to-End Framework for Joint Atmospheric Turbulence Mitigation and Object Detection

Paul Hill ⋅ Zhiming Liu ⋅ Alin Achim ⋅ David Bull ⋅ Nantheera Anantrasirichai

Atmospheric Turbulence (AT) degrades the clarity and accuracy of surveillance imagery, posing challenges not only for visualization quality but also for object classification and scene tracking. Deep learning-based methods have been proposed to improve visual quality, but spatio-temporal distortions remain a significant issue. Although deep learning-based object detection performs well under normal conditions, it struggles to operate effectively on sequences distorted by atmospheric turbulence.In this paper, we propose a novel framework that learns to compensate for distorted features while simultaneously improving visualization and object detection. This end-to-end training strategy leverages and exchanges knowledge of low-level distorted features in the AT mitigator with semantic features extracted in the object detector. Specifically, in the AT mitigator a 3D Mamba-based structure is used to handle the spatio-temporal displacements and blurring caused by turbulence. Optimization is achieved through back-propagation in both the AT mitigator and object detector. Our proposed DMAT outperforms state-of-the-art AT mitigation and object detection systems up to a 15% improvement on datasets corrupted by generated turbulence.

122

Divide and Refine: Enhancing Multimodal Representation and Explainability for Emotion Recognition in Conversation

Tuan Mai ⋅ Cam-Van Thi Nguyen ⋅ Duc-Trong Le

Multimodal emotion recognition in conversation (MERC) requires representations that effectively integrate signals from multiple modalities. These signals include modality-specific cues, information shared across modalities, and interactions that emerge only when modalities are combined. In information-theoretic terms, these correspond to unique, redundant, and synergistic contributions. An ideal representation should leverage all three, yet achieving such balance remains challenging. Recent advances in contrastive learning and augmentation-based methods have made progress, but they often overlook the role of data preparation in preserving these components. In particular, applying augmentations directly to raw inputs or fused embeddings can blur the boundaries between modality-unique and cross-modal signals. To address this challenge, we propose a two-phase framework Divide and Refine (DnR). In the Divide phase, each modality is explicitly decomposed into uniqueness, pairwise redundancy, and synergy. In the Refine phase, tailored objectives enhance the informativeness of these components while maintaining their distinct roles. The refined representations are plug-and-play compatible with diverse multimodal pipelines. Extensive experiments on IEMOCAP and MELD demonstrate consistent improvements across multiple MERC backbones. These results highlight the effectiveness of explicitly dividing, refining, and recombining multimodal representations as a principled strategy for advancing emotion recognition.

123

iMotion-LLM: Instruction-Conditioned Trajectory Generation

Abdulwahab Felemban ⋅ Nussair Hroub ⋅ Jian Ding ⋅ Faizan Khan ⋅ Xiaoqian Shen ⋅ Abduallah Mohamed ⋅ Mohamed Elhoseiny

We introduce iMotion-LLM, a multimodal large language model (LLM) integrated with trajectory prediction modules for interactive motion generation. Unlike conventional approaches, it generates feasible, safety-aligned trajectories based on textual instructions, enabling adaptable and context-aware driving behavior. It combines an encoder-decoder trajectory prediction model with a pre-trained LLM fine-tuned using LoRA, projecting scene features into the LLM input space and mapping special tokens to a multimodal trajectory decoder for text-based interaction and interpretable justification of driving behavior. To support this framework, we introduce two datasets: (1) InstructWaymo, an extension of the Waymo Open Motion Dataset with direction-based motion instructions, and (2) Open-Vocabulary InstructNuPlan, which features safety-aligned instruction-caption pairs and corresponding safe trajectory scenarios. Our experiments validate that instruction conditioning enables trajectory generation that follows the intended condition. iMotion-LLM further demonstrates strong contextual comprehension, achieving 84\% average accuracy in direction feasibility detection and 96\% average accuracy in safety evaluation of open-vocabulary instructions. This work lays the foundation for text-guided motion generation in autonomous driving, supporting simulated data generation, model interpretability, and robust safety alignment testing for trajectory generation models. Code, datasets, and models will be made publicly available.

124

SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation

Hu Cui ⋅ Wenqiang Hua ⋅ Renjing Huang ⋅ ShuRui Jia ⋅ Tessai Hayama

Recently, the Mamba architecture based on State Space Models (SSMs) has gained attention in 3D human pose estimation due to its linear complexity and strong global modeling capability. However, existing SSM-based methods typically apply manually designed scan operations to flatten detected 2D pose sequences into purely temporal sequences, either locally or globally. This approach disrupts the inherent spatial structure of human poses and entangles spatial and temporal features, making it difficult to capture complex pose dependencies.To address these limitations, we propose the Skeleton Structure-Aware Stride SSM (SAS-SSM), which first employs a structure-aware spatiotemporal convolution to dynamically capture essential local interactions between joints, and then applies a stride-based scan strategy to construct multi-scale global structural representations. This enables flexible modeling of both local and global pose information while maintaining linear computational complexity.Built upon SAS-SSM, our model SasMamba achieves competitive 3D pose estimation performance with significantly fewer parameters compared to existing hybrid models.The source code will be available at \url{https://github.com/}.

125

SUGAR: A Sweeter Spot for Generative Unlearning of Many Identities

Dung Nguyen ⋅ Quang Nguyen ⋅ Preston Robinette ⋅ Eli Jiang ⋅ Taylor Johnson ⋅ Kevin Leach

Recent advances in 3D-aware generative models have enabled high-fidelity image synthesis of human identities. However, this progress raises urgent questions around user consent and the ability to remove specific individuals from a model's output space. We address this by introducing SUGAR, a framework for scalable generative unlearning that enables the removal of many identities (simultaneously or sequentially) without retraining the entire model. Rather than projecting unwanted identities to unrealistic outputs or relying on static template faces, SUGAR learns a personalized surrogate latent for each identity, diverting reconstructions to visually coherent alternatives while preserving the model’s quality and diversity. We further introduce a continual utility preservation objective that guards against degradation as more identities are forgotten. SUGAR achieves state-of-the-art performance in identity removal, with up to 700\% improvement in retention-utility compared to existing baselines. Our code is publicly available at https://anonymous.4open.science/r/SUGAR-GenUnlearn.

126

Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation

Yimu Wang ⋅ Evelien Riddell ⋅ Adrian Chow ⋅ Sean Sedwards ⋅ Krzysztof Czarnecki

Existing vision-language model (VLM)-based methods for out-of-distribution (OOD) detection typically rely on similarity scores between input images and in-distribution (ID) text prototypes. However, the modality gap between image and text often results in high false positive rates, as OOD samples can exhibit high similarity to ID text prototypes. To mitigate the impact of this modality gap, we propose incorporating ID image prototypes along with ID text prototypes. We present theoretical and empirical evidence indicating that this approach enhances VLM-based OOD detection performance without any additional training. To further reduce the gap between image and text, we introduce a novel few-shot tuning framework, \textsc{suPreMe}, comprising biased prompt generation (BPG) and image-text consistency (ITC) modules. BPG enhances image-text fusion and improves generalization (prevents overfitting on training data) by conditioning ID text prototypes on the Gaussian-based estimated image domain bias; ITC reduces the modality gap by minimizing intra- and inter-modal distances. Moreover, inspired by our theoretical and empirical findings, we introduce a novel OOD score $S_{GMP}$, leveraging uni- and cross-modal similarities. Finally, we present extensive experiments to demonstrate that suPreMe consistently outperforms existing VLM-based OOD detection methods.

127

Matching Semantically Similar Non-Identical Objects

Yusuke Marumo ⋅ Kazuhiko Kawamoto ⋅ Satomi Tanaka ⋅ Shigenobu Hirano ⋅ Hiroshi Kera

Not identical but similar objects are ubiquitous in our world, ranging from four-legged animals such as dogs and cats to cars of different models and flowers of various colors. This study addresses a novel task of matching such non-identical objects at the pixel level. We propose a weighting scheme of descriptors that incorporates semantic information from object detectors into existing sparse feature matching methods, extending their targets from identical objects captured from different perspectives to semantically similar objects. The experiments show successful matching between non-identical objects in various cases, including in-class design variations, class discrepancy, and domain shifts (e.g., photo--drawing and image corruptions). The code will be publicly available soon.

128

UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network

Nhat-Tuong Do-Tran ⋅ Ngoc-Hoang-Lam Le ⋅ Ching-Chun Huang

The appearance of ultrasound images varies across acquisition devices, causing domain shifts that degrade the performance of fixed black-box downstream inference models when reused. To mitigate this issue, it is practical to develop unpaired image translation (UIT) methods that effectively align the statistical distributions between source and target domains, particularly under the constraint of a reused inference-blackbox setting. However, existing UIT approaches often overlook class-specific semantic alignment during domain adaptation, resulting in misaligned content-class mappings that can impair diagnostic accuracy. To address this limitation, we propose UI-Styler, a novel ultrasound-specific, class-aware image style transfer framework. UI-Styler leverages a pattern-matching mechanism to transfer texture patterns embedded in the target images onto source images while preserving the source structural content. In addition, we introduce a class-aware prompting strategy guided by pseudo labels of the target domain, which enforces accurate semantic alignment with diagnostic categories. Extensive experiments on ultrasound cross-device tasks demonstrate that UI-Styler consistently outperforms existing UIT methods, achieving state-of-the-art performance in distribution distance and downstream tasks, such as classification and segmentation.

129

Tables Decoded: DELTA for Structure, TARQA for Understanding

Jahanvi Rajput ⋅ Dhruv Kudale ⋅ Saikiran Kasturi ⋅ Utkarsh Verma ⋅ Ganesh Ramakrishnan

Table understanding is a core task in document intelligence, encompassing two key subtasks: table reconstruction and table visual question answering (TabVQA). While recent approaches predominantly rely on vision-language models (VLMs) operating on table images, we propose a more scalable and effective alternative based on structured textual representations. These representations are easier to process, align more naturally with large language models (LLMs), and eliminate the need for language-specific visual encoders, making them particularly suitable for multilingual documents. We present DELTA, a decoupled table reconstruction framework that separates structure recognition from OCR to extract both layout and content accurately. DELTA outputs tables in Optimized Table Structure Language (OTSL), a compact and unified format that encodes cell arrangements and textual content. DELTA achieves high-fidelity table-to-text conversion, outperforming prior methods on structure metrics with superior TEDS-Structure scores across FinTabNet, PubTabNet, and PubTables. On FinTabNet, it surpasses the best VLM baseline by an absolute 0.4% in overall TEDS score. Built on DELTA, we introduce TARQA (Table structure-Aware Representation for Question Answering), an LLM fine-tuned on OTSL-formatted tables for accurate and structure-aware TabVQA. TARQA outperforms baselines fine-tuned on HTML representations by 14.2%, and improves answer accuracy on WTQ by 6.8% and on FinTabNetQA by 9.2%. We release our code and models to advance research in multilingual, structure-aware table understanding.

130

What Happens When: Learning Temporal Orders of Events in Videos

Daechul Ahn ⋅ Yura Choi ⋅ Hyeonbeom Choi ⋅ Seongwon Cho ⋅ San Kim ⋅ Jonghyun Choi

Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model’s ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which (1) trains models on detailed, event-by-event video descriptions and (2) using chain-of-thought prompts at inference to enhance temporal awareness. MECOT outperforms prior arts on VECTOR as well as improving performance on existing video benchmarks, implying effectiveness of temporal understanding.

131

UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

Huy Le ⋅ Nhat Chung ⋅ Tung Kieu ⋅ Jingkang Yang ⋅ Ngan Le

Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design. Our code will be fully released upon acceptance.

132

Personalized Image Privacy Advisors via Federated Daisy-Chaining

Sourasekhar Banerjee ⋅ Vengateswaran Subramaniam ⋅ Debaditya Roy ⋅ Vigneshwaran Subbaraju ⋅ Monowar Bhuyan

Image sharing on social media has become routine but poses serious privacy risks, as users may unknowingly expose sensitive information. This necessitates an image privacy advisor that assigns personalized privacy risk scores, helping users decide whether to share images publicly or not. However, centralized training of such models risks user data exposure and loss of ownership, as data must be uploaded to a central server. To safeguard user privacy, we adopt Federated Learning (FL), which enables collaborative model training without sharing raw data. Despite its advantages, FL faces challenges such as data heterogeneity from diverse user privacy preferences, limited annotations per user, and communication overhead. To address these issues, we propose CFedDC, a personalized FL algorithm combined with PIONet, a parameter-efficient model with 14.18 $\times$ fewer trainable parameters and 92.94\% lower memory footprint than centralized baselines. CFedDC mitigates data heterogeneity through clustering and cluster aware regularization with stability, and tackles data scarcity using a daisy-chaining knowledge transfer mechanism. Comprehensive experimental evaluations demonstrate that our proposed method achieves well-aligned personalized user privacy scores, outperforming existing centralized and FL-based image privacy models.

133

DiRe: Diversity-promoting Regularization for Dataset Condensation

Saumyaranjan Mohanty ⋅ Aravind Reddy ⋅ Konda Reddy Mopuri

In dataset condensation, given an original training dataset, the goal is to synthesize a small dataset that replicates the training utility of the original dataset, when used to train neural networks. Existing condensation methods synthesize datasets that contain significant redundancy, leading to their inefficiency. Thus, there is a dire need to ensure diversity in the synthesized datasets. In this work, we propose an intuitive Diversity Regularizer (DiRe) composed of cosine similarity and Euclidean distance. Most importantly, the proposed regularizer can be applied off-the-shelf to various state-of-the-art optimization-driven condensation methods. Through extensive experimentation, we demonstrate that our approach improves state-of-the-art condensation methods on various benchmark datasets from CIFAR-10 to ImageNet-1K regarding generalization and diversity metrics.

134

Vision-informed Semantic Text Alignment for Open-set Recognition in Remote Sensing

Siddhant Gole ⋅ Akash Pal ⋅ Ankit Jha ⋅ Subhasis Chaudhuri ⋅ Biplab Banerjee

Existing Open-Set Recognition (OSR) methods struggle in remote sensing (RS) as their reliance on unimodal visual features fails to resolve the severe inter-class similarity inherent in overhead imagery. To address this, we propose ViSTA-RS, a novel multimodal framework that leverages semantic context from language to disambiguate visually similar scenes. Our approach first constructs semantically-rich class prototypes by jointly encoding images with generated text captions using a Vision-Language Model. We then introduce a reconstruction-based mechanism where an image's visual embedding is expressed as a weighted combination of these semantic prototypes. The magnitude of the reconstruction error serves as a robust novelty score, with a statistically principled threshold determined by Extreme Value Theory (EVT). This alignment of multimodal semantics with prototype reconstruction is uniquely suited for the fine-grained nature of RS data. On four challenging benchmarks, ViSTA-RS sets a new state-of-the-art, improving the AUROC for unknown detection by a significant 6.7% over leading baselines while maintaining high accuracy on known classes.

135

Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation

Difei Gu ⋅ Yunhe Gao ⋅ Mu Zhou ⋅ Dimitri Metaxas

Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction. Anatomy-VLM achieves outstanding performance on both in- and out-of-distribution datasets. We also validate the performance of Anatomy-VLM on downstream image segmentation tasks, suggesting that its fine-grained alignment captures anatomical and pathology-related knowledge. Furthermore, the Anatomy-VLM's encoder facilitates zero-shot anatomy-wise interpretation, providing its strong expert-level clinical interpretation capabilities.

136

Temporal Object Captioning for Street Scene Videos from LiDAR Tracks

Vignesh Gopinathan ⋅ Urs Zimmermann ⋅ Michael Arnold ⋅ Matthias Rottmann

Video captioning models have seen notable advancements in recent years, especially with regard to their ability to capture temporal information. While many research efforts have focusedon architectural advancements, such as temporal attention mechanisms, there remains a notable gap in understanding how modelscapture and utilize temporal semantics for effective temporal featureextraction, especially in the context of Advanced Driver AssistanceSystems. We propose an automated LiDAR-based captioning procedure that focuses on the temporal dynamics of traffic participants.Our approach uses a rule-based system to extract essential detailssuch as lane position and relative motion from object tracks, followed by a template-based caption generation. Our findings showthat training SwinBERT, a video captioning model, using only frontcamera images and supervised with our template-based captions,specifically designed to encapsulate fine-grained temporal behavior,leads to improved temporal understanding consistently across threedatasets. In conclusion, our results clearly demonstrate that integrating LiDAR-based caption supervision significantly enhances temporal understanding, effectively addressing and reducing the inherentvisual/static biases prevalent in current state-of-the-art model architectures.

137

STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

Soroush Mehraban ⋅ Javad Rajabi ⋅ Andrea Iaboni ⋅ Babak Taati

Self-supervised pretraining methods with masked prediction demonstrate remarkable within-dataset performance in skeleton-based action recognition. However, we show that, unlike contrastive learning approaches, they do not produce well-separated clusters. Additionally, these methods struggle with generalization in few-shot settings. To address these issues, we propose Self-supervised Tuning for 3D Action Recognition in Skeleton sequences (STARS). Specifically, STARS first uses a masked prediction stage using an encoder-decoder architecture. It then employs nearest-neighbor contrastive learning to partially tune the weights of the encoder, enhancing the formation of semantic clusters for different actions. By tuning the encoder for a few epochs, and without using hand-crafted data augmentations, STARS achieves state-of-the-art self-supervised results in various benchmarks, including NTU-60, NTU-120, and PKU-MMD. In addition, STARS exhibits significantly better results than masked prediction models in few-shot settings, where the model has not seen the actions throughout pretraining.

138

RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph

Sameer Malik ⋅ Ayush Singh ⋅ Moyuru Yamada ⋅ Dishank Aggarwal

Comprehending long videos remains a significant challenge for Large Multi-modal Models (LMMs). Current LMMs struggle to process even minutes to hours videos due to their lack of explicit memory and retrieval mechanisms. To address this limitation, we propose RAVU (Retrieval Augmented Video Understanding), a novel framework for video understanding enhanced by retrieval with compositional reasoning over a spatio-temporal graph. We construct a graph representation of the video, capturing both spatial and temporal relationships between entities. This graph serves as a long-term memory, allowing us to track objects and their actions across time. To answer complex queries, we decompose the queries into a sequence of reasoning steps and execute these steps on the graph, retrieving relevant key information. Our approach enables more accurate understanding of long videos, particularly for queries that require multi-hop reasoning and tracking objects across frames. Our approach demonstrate superior performances with limited retrieved frames (5-10) compared with other SOTA methods and baselines on two major video QA datasets, NExT-QA and EgoSchema.

139

Conditional Text-to-Image Generation with Reference Guidance

Taewook Kim ⋅ Ze Wang ⋅ Zhengyuan Yang ⋅ Jiang Wang ⋅ Lijuan Wang ⋅ Zicheng Liu ⋅ Qiang Qiu

Text-to-image diffusion models have demonstrated tremendous success in synthesizing visually stunning images given textual instructions. Despite remarkable progress in creating high-fidelity visuals, text-to-image models can still struggle with precisely rendering subjects, such as text spelling. To address this challenge, this paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. In addition, this reference condition empowers the model to be conditioned in ways that the vocabularies of the text tokenizer cannot adequately represent, and further extends the model's generalization to novel capabilities such as generating non-English text spellings. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Each plugin is trained with auxiliary networks and loss functions customized for applications such as English scene-text generation, multi-lingual scene-text generation, and logo-image generation. Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.

140

Improved Wildfire Spread Prediction with Time-Series Data and the WSTS+ Benchmark

Saad Lahrichi ⋅ Jake Bova ⋅ Jesse Johnson ⋅ Jordan Malof

Recent research has demonstrated the potential of deep neural networks (DNNs) to accurately predict wildfire spread on a given day based upon high-dimensional explanatory data from a single preceding day, or from a time series of $T$ preceding days. For the first time, we investigate a large number of existing data-driven wildfire modeling strategies under controlled conditions, revealing the best modeling strategies and resulting in models that achieve state-of-the-art (SOTA) accuracy for both single-day and multi-day input scenarios, as evaluated on a large public benchmark for next-day wildfire spread, termed the WildfireSpreadTS (WSTS) benchmark. Consistent with prior work, we found that models using time-series input obtained the best overall accuracy, suggesting this is an important future area of research. Furthermore, we create a new benchmark, WSTS+, by incorporating four additional years of historical wildfire data into the WSTS benchmark. Our benchmark doubles the number of unique years of historical data, expands its geographic scope, and, to our knowledge, represents the largest public benchmark for _time-series_-based wildfire spread prediction.

141

From Few-Shot to Zero-Shot Pallet Load Recognition: A Deployed Embedding-Based Vision System for Industrial Logistics

Juan Jesús Losada del Olmo ⋅ Emilio Ballesteros ⋅ Pedro Lopez-de-Teruel ⋅ Alberto Ruiz

Automated pallet load recognition is a critical task in industrial logistics, but the deployment of conventional deep learning systems is often unfeasible. Their reliance on large, manually annotated datasets creates a prohibitive bottleneck in terms of cost and time, especially in dynamic environments where product lines frequently change. To overcome this challenge, we introduce a highly flexible, dual-mode vision system built upon dense patch embeddings. Our primary, few-shot approach leverages features from the CAPI vision model to construct a compact memory bank from as little as a single labeled example per class. Classification is then performed via a simple yet highly effective $k$-nearest neighbor search. For annotation-free scenarios, we also propose a zero-shot mode that identifies the load by finding the rectangular region that minimizes intra-class feature variance. We demonstrate state-of-the-art performance on a new, challenging industrial dataset, where our few-shot method attains a $mAP_{50-95}$ over 90\% with only one support image per class. Additionally, the fully unsupervised approach achieves a notable $mAP_{50-95}$ of up to 75\%. The system's robustness and practical value were validated through its successful deployment in high-stakes, real-world scenarios. Our findings establish a basis for lightweight solutions that support the rapid, data-efficient integration of new vision systems into industrial workflows.

142

Graph-Based Spectral Attention with Multi-Spectral Images for Illuminant Estimation

Dong-Hoon Kang ⋅ Seung-Yeop Baek ⋅ Jong-Ok Kim

Existing color constancy methods based on deep learning primarily rely on the RGB domain and often struggle with accurate illuminant estimation in scenes with minimal spatial information, such as monochromatic environments, leading to suboptimal performance. To address this issue, this paper introduces an approach that utilizes MS (multi-spectral) images estimated by a pretrained RGB-to-MS model, enabling more accurate illuminant estimation. Additionally, we propose a graph-based spectral attention mechanism designed to effectively extract spectral features within the multi-spectral domain, further enhancing the robustness and accuracy of color constancy. This approach demonstrates outstanding effectiveness on our custom dataset, significantly outperforming existing methods. Additionally, when evaluated in the widely recognized NUS-8 and Cube+ datasets, the proposed method shows a substantial relative improvement of 21.5\% in NUS-8 and 9.9\% in Cube+ compared to previous state-of-the-art methods.