Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 5

Tue 10 Mar 10:45 a.m. PDT — 12:15 p.m. PDT
Abstract:
Chat is not available.


1
Motion-Aware Graph Fusion NetWork for 3D Human Pose Estimation

Yen Pham ⋅ Xiaohui Yuan ⋅ Chengyuan Zhuang

Recent state-of-the-art (SOTA) methods in 3D human pose estimation (HPE) typically prioritize lifting 2D pose coordinates to 3D but tend to underemphasize the importance of generalizing under real-world conditions with noisy 2D inputs from off-the-shelf 2D detector. In this paper, we introduce Graph Attention Fusion Network (GAtFuN), a novel motion-aware framework that integrates our spatial and temporal graph attention mechanisms to explicitly model joint velocities and motion transformations, resulting in more stable and coherent 3D pose predictions despite being trained with the same dataset pipeline as other SOTA methods. GAtFuN achieves a 7.8\% improvement in MPJPE over the current SOTA on the Human3.6M dataset and a 1.9\% improvement on the MPI-INF-3DHP dataset, while demonstrating more robust performance on the 3DPW dataset in the wild.


2
UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training

Jiawei Qin ⋅ Xucong Zhang ⋅ Yusuke Sugano

Despite decades of research on data collection and model architectures, current gaze estimation models encounter significant challenges in generalizing across diverse data domains. Recent advances in self-supervised pre-training have demonstrated remarkable generalization across various vision tasks. However, their effectiveness in gaze estimation remains unexplored. We propose UniGaze, for the first time, leveraging large-scale in-the-wild facial datasets for gaze estimation through self-supervised pre-training. Through systematic investigation, we clarify critical factors that are essential for effective pre-training in gaze estimation. Our experiments reveal that self-supervised approaches designed for semantic tasks fail when applied to gaze estimation, while our carefully designed pre-training pipeline consistently improves cross-domain performance. Through comprehensive experiments of challenging cross-dataset evaluation and novel protocols, including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. Source code and model will be available.


3
Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities

Fan Yang ⋅ Quanting Xie ⋅ Atsunori Moteki ⋅ Shoichi Masui ⋅ Shan Jiang ⋅ Kanji Uchino ⋅ Yonatan Bisk ⋅ Graham Neubig

Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities---characterized by simple structures and high-contrast patterns---have been widely studied, long-term periodic workflows with low-contrast patterns remain largely underexplored. To bridge this gap, we introduce the first benchmark comprising 580 multimodal human activity sequences featuring long-term periodic workflows. The benchmark supports three evaluation tasks aligned with real-world applications: unsupervised periodic workflow detection, task completion tracking, and procedural anomaly detection. We also propose a lightweight, training-free baseline for modeling diverse periodic workflow patterns. Experiments show that: (i) our benchmark presents significant challenges to both unsupervised periodic detection methods and zero-shot approaches based on powerful large language models (LLMs); (ii) our baseline outperforms competing methods by a substantial margin in all evaluation tasks; and (iii) in real-world applications, our baseline demonstrates deployment advantages on par with traditional supervised workflow detection approaches, eliminating the need for annotation and retraining.Our code and dataset will be available on GitHub and HuggingFace.


4
VAST-ReID: A Low-Light Benchmark Dataset for Person Re-Identification with Visual and Attribute-Rich Semantic Tracking

Hammad Khan ⋅ Rakesh Giri ⋅ Kamalakar Thakare ⋅ Heeseung Choi ⋅ Hyungjoo Jung ⋅ Debi Dogra ⋅ Ig-Jae Kim

Person Re-Identification (ReID) task is important for designing intelligent surveillance systems. ReID can be highly challenging in low-light and low resolution scenarios. Existing ReID datasets predominantly feature cropped pedestrian images captured in well-lit environments, often lacking semantic richness, frame-level temporal continuity, and robustness to adverse conditions. To address these limitations, we introduce VAST-ReID, a new benchmark dataset specifically designed for the low-light person ReID task in real-world surveillance contexts. VAST-ReID consists of 1,211 surveillance videos collected at 21 different locations, capturing 169 distinct pedestrians of various age groups. The dataset emphasizes naturally low-light and visually degraded scenarios. Each identity is annotated with dense bounding boxes and enriched with auxiliary semantic labels, including pedestrian attributes and LLM-generated descriptions. While these annotations are not used during supervised training, they provide valuable semantic context for advancing research in language-guided retrieval and attribute-aware modeling. Additionally, we release identity-aligned image crops under the BoxTrack-ReID subset, which has over 14.3K frames sampled at 1fps from the raw videos, with standard training, gallery, and query splits compatible with the Market-1501 evaluation protocol, enabling straightforward benchmarking. The dataset has been benchmarked against SOTA methods, and experiments reveal that there is huge scope for improvement in ReID research.


5
DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors

Kaustubh Kundu ⋅ Hrishav Barua ⋅ Lucy Robertson-Bell ⋅ Zhixi Cai ⋅ Kalin Stefanov

The trend in sign language generation is centered around data-driven generative methods. These methods require vast amounts of precise 2D and 3D human pose data to achieve a generation quality acceptable to the Deaf community. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information.However, manual production of accurate 2D and 3D human pose information from videos is a labor-intensive process. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11\% in the estimation of body and hand poses compared to the state-of-the-art.

Vision-language models (VLMs) exhibit impressive zero-shot transfer, but remain static and cannot adapt when exposed to new tasks. Meanwhile, conventional continual learning often overlooks preserving this zero-shot capability during adaptation. In this work, we present DREAM, a parameter-efficient framework that enables continual adaptation of VLMs while minimizing forgetting and preserving zero-shot performance. DREAM employs a dynamic prompt system with lightweight, task-specific parameters managed by two modules: a prompt composition module that dynamically generates prompts to adapt the VLM, and a query-key module that uses learned token weights to reliably activate the appropriate parameters at inference. To enhance robustness, we propose GuidedMix, which creates semantically meaningful mixed images, and pair them with mixture‑aware text embeddings to strengthen representation learning through image-text alignment. We further leverage the GuidedMix samples to estimate task-specific query-key similarity thresholds that identify samples of unseen tasks and and prevent spurious prompt usage on the VLM, thereby safeguarding its zero-shot behavior. Experiments show that our method adapts efficiently, mitigates forgetting, and maintains strong zero-shot transfer with substantially fewer trainable parameters, showing consistent gains even under partial supervision.


7
brat: Aligned Multi-View Embeddings for Brain MRI Analysis

Maxime Kayser ⋅ Maksim Gridnev ⋅ Wanting Wang ⋅ Max Bain ⋅ Aneesh Rangnekar ⋅ Avijit Chatterjee ⋅ Aleksandr Petrov ⋅ Harini Veeraraghavan ⋅ Nathaniel Swinburne

We present brat (brain report alignment transformer), a multi-view representation learning framework for brain magnetic resonance imaging (MRI) trained on MRIs paired with clinical reports. Brain MRIs present unique challenges due to the presence of numerous, highly varied, and often subtle abnormalities that are localized to a few slices within a 3D volume. To address these challenges, we introduce a brain MRI dataset $10\times$ larger than existing ones, containing approximately 80,000 3D scans with corresponding radiology reports, and propose a multi-view pre-training approach inspired by advances in document retrieval. We develop an implicit query-feature matching mechanism and adopt concepts from quality-diversity to obtain multi-view embeddings of MRIs that are aligned with the clinical features given by report sentences. We evaluate our approach across multiple vision-language and vision tasks, demonstrating substantial performance improvements. By publicly releasing our suite of model weights, we aim to facilitate further research in brain MRI analysis.


8
Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score

Eman Ali ⋅ Sathira Silva ⋅ Chetan Arora ⋅ Muhammad Haris Khan

Vision-language models (VLMs) like CLIP excel in zero-shot learning by aligning image and text representations through contrastive pretraining. Existing approaches to unsupervised adaptation (UA) for fine-grained classification with VLMs either rely on fixed alignment scores that may not capture evolving, subtle class distinctions or rely on computationally expensive pseudo-labeling strategies that limit scalability.In contrast, we show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels and substantially improves performance over state-of-the-art (SOTA) methods.We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings through a set of Class Description Anchors (CDA). This enables the definition of a Learned Alignment Score (LAS), which incorporates CDA as an adaptive classifier, facilitating cross-modal interactions to improve self-training in unsupervised adaptation.Furthermore, we propose a self-training weighting mechanism designed to refine pseudo-labels in the presence of inter-class ambiguities.Our approach, FAIR, delivers a substantial performance boost in fine-grained unsupervised adaptation, achieving a notable overall gain of 2.78% across 13 fine-grained datasets compared to SOTA methods.


9
Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation

Liu He ⋅ Xiao Zeng ⋅ Yizhi Song ⋅ Albert Chen ⋅ Lu Xia ⋅ Shashwat Verma ⋅ Sankalp Dayal ⋅ Min Sun ⋅ Cheng-Hao Kuo ⋅ Daniel Aliaga

Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.

Vision-Language Models (VLMs) demonstrate remarkable capabilities in visual understanding and reasoning, such as in Visual Question Answering (VQA), where the model is asked a question related to a visual input. Still, these models can make distinctly unnatural errors, for example, providing (wrong) answers to unanswerable VQA questions, such as questions asking about objects that do not appear in the image.To address this issue, we propose CLIP-UP: CLIP-based Unanswerable Problem detection, a novel lightweight method for equipping VLMs with the ability to withhold answers to unanswerable questions. CLIP-UP leverages CLIP-based similarity measures to extract question-image alignment information to detect unanswerability, requiring efficient training of only a few additional layers, while keeping the original VLMs' weights unchanged.Tested across several models, CLIP-UP achieves significant improvements on benchmarks assessing unanswerability in both multiple-choice and open-ended VQA, surpassing other methods, while preserving original performance on other tasks. We will release our code and training data to support future research.


11
Isolating the Role of Temporal Information in Video Saliency: A Controlled Experimental Analysis

Peter El-Jiz ⋅ Matthias Kuemmerer ⋅ Matthias Tangemann ⋅ Matthias Bethge ⋅ Andreas Bartels ⋅ Michael Bannert

The role of temporal information in predicting human gaze in dynamic scenes remains a critical open question, underscored by the paradoxical finding that strong static models can outperform complex video-based models. This suggests that the true contribution of temporal cues has been obscured by confounding architectural variables. To resolve this, we present a rigorous, controlled experiment centered on a minimal architectural pair: a spatio-temporal saliency model (UniformerSal-ST) and its identical spatial-only counterpart (UniformerSal-S), designed to unambiguously isolate the impact of temporal feature integration. Our results demonstrate that principled temporal fusion yields a substantial Information Gain (IG) of +0.20 bits on temporally coherent datasets like LEDOV. Crucially, our controlled comparison also uncovers a key failure mode: on datasets with frequent hard cuts like DIEM, the same mechanism degrades performance, incurring a 0.07 bits IG deficit. We provide a mechanistic explanation for this dichotomy, revealing how certain visual scenarios (scene discontinuities, rapid camera zooms) can disrupt current temporal fusion approaches. By precisely quantifying both the benefits and drawbacks of temporal processing, our work provides the community with clear, actionable insights into when and why temporal information should be modeled for more robust and accurate video saliency prediction.


12
Diffusion-Based Action Recognition Generalizes to Untrained Domains

Rogério Guimarães ⋅ Frank Xiao ⋅ Pietro Perona ⋅ Markus Marks

Humans can recognize the same actions despite large context and viewpoint variations, such as differences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies). Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts. Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness.


13
CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation

Prantik Howlader ⋅ Hoang Nguyen-Canh ⋅ Srijan Das ⋅ Jingyi Xu ⋅ Hieu Le ⋅ Dimitris Samaras

Reasoning segmentation is a powerful tool, yet its generalization remains limited due to the high cost of acquiring diverse, high-quality visual and linguistic supervision. In this work, we present CORA, a semi-supervised framework that jointly learns from limited labeled data and a large corpus of unlabeled images. To improve supervision from limited labeled data, CORA introduces conditional visual instructions that encode spatial and contextual relationships between objects. To utilize unlabeled data, we propose a VLM-guided output consistency strategy that filters noisy pseudo-labels based on the stability of predictions across queries that are semantically equivalent. Additionally, we enforce token-level contrastive alignment between labeled and pseudo-labeled samples to enhance feature consistency. Together, these components enable CORA to perform robust reasoning segmentation with minimal supervision, outperforming existing baselines under constrained annotation settings. Our method achieves state-of-the-art results, requiring as few as 100 labeled images on Cityscapes, a benchmark dataset for urban scene understanding, surpassing the baseline by $+2.3\%$. Similarly, our approach improves performance by $+2.4\%$ with only 180 labeled images on PanNuke, a histopathology dataset.


14
Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information

Neelima Prasad ⋅ Jarek Reynolds ⋅ Neel Karsanbhai ⋅ Tanusree Sharma ⋅ Lotus Zhang ⋅ Abigale Stangl ⋅ Yang Wang ⋅ Leah Findlater ⋅ Danna Gurari

We propose a novel task; hierarchical instance tracking, which entails tracking all instances of predefined categories of objects and parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset supporting this task, consisting of 2,765 unique entities that are tracked in 552 videos and belong to 40 categories (across objects and parts). Evaluation of seven variants of four models tailored to our novel task reveals the new dataset is challenging. To facilitate community-wide progress, we publicly release our dataset with an evaluation server at https://anonymous.com.


15
Causality-Driven Audits of Model Robustness

Nathan Drenkow ⋅ William Paul ⋅ Christopher Ribaudo ⋅ Mathias Unberath

Robustness audits of deep neural networks (DNN) provide a means to uncover model sensitivities to the challenging real-world imaging conditions that significantly degrade DNN performance in-the-wild. Such conditions are often the result of multiple interacting factors inherent to the environment, sensor, or processing pipeline and may lead to complex image distortions that are not easily categorized. When robustness audits are limited to a set of isolated imaging effects or distortions, the results cannot be (easily) transferred to real-world conditions where image corruptions may be more complex or nuanced. To address this challenge, we present a new alternative robustness auditing method that uses causal inference to measure DNN sensitivities to the factors of the imaging process that cause complex distortions. Our approach uses causal models to explicitly encode assumptions about the domain-relevant factors and their interactions. Then, through extensive experiments on natural and rendered images across multiple vision tasks, we show that our approach reliably estimates causal effects of each factor on DNN performance using observational domain data. These causal effects directly tie DNN sensitivities to observable properties of the imaging pipeline in the domain of interest towards reducing the risk of unexpected DNN failures when deployed in that domain.

Deep learning-based vision systems are increasingly deployed in high-stakes applications, yet remain vulnerable to imperceptible manipulations that exploit model blind spots. We present BAFLE-DCT, a frequency-domain steganography framework that achieves high-capacity, imperceptible data embedding while evading state-of-the-art deep steganalysis. Unlike traditional spatial-domain methods that alter pixel values and trigger visual or statistical artifacts, BAFLE-DCT operates in the DCT domain, selectively modifying mid-frequency coefficients in perceptually insignificant regions identified via saliency analysis. A lightweight feedforward network further refines block selection using entropy and DCT variance features to balance embedding capacity and visual fidelity. Stego images generated by BAFLE-DCT consistently bypass advanced detectors such as YeNet and SRNet, yielding near-random detection rates (~50%) across payload sizes. Importantly, embedded images maintain classification consistency under CLIP, demonstrating semantic preservation. We also release a large-scale, full-color steganographic dataset for frequency-domain research, addressing limitations of grayscale, spatial-domain benchmarks. Our results expose critical vulnerabilities in visual content authentication pipelines and motivate the development of frequency-aware detection strategies. Code and data will be released publicly upon acceptance.


17
Logit-Adjusted Test-Time Adaptation under Partial Class Imbalance

Thilina Weerasinghe ⋅ Ruwan Tennakoon ⋅ WeiQin Chuah ⋅ Alireza Bab-Hadiashar

Test-Time Adaptation (TTA) enables deep neural networks to handle distribution shifts without requiring labels at inference. However, existing methods commonly assume complete class overlap between source and target domains, which rarely holds in practice. We study the challenging setting of \textbf{Partial Class Imbalance}, where the target domain contains only a subset of source classes. We show that entropy minimization--based TTA methods degrade over long test sequences because batch normalization updates bias feature representations toward visible classes, resulting in skewed predictions. To address this, we propose \textbf{Logit-Adjusted Entropy Minimization}, a simple yet effective strategy that integrates target class priors into the adaptation objective. Our method is model-agnostic and can be seamlessly applied to a wide range of TTA algorithms. Extensive experiments on CIFAR-100-C, ImageNet-C under diverse corruptions and severity levels, and the large-scale DomainNet-126 dataset demonstrate that our method consistently improves adaptation stability and accuracy for both CNNs and Vision Transformers. Compared to strong baselines, our approach reduces overfitting to visible classes and mitigates performance degradation in long-sequence adaptation. Code is available at \url{https://anonymous.4open.science/r/latte_2025}.


18
Test Time Adaptation Using Adaptive Quantile Recalibration

Paria Mehrbod ⋅ Pedro Vianna ⋅ Geraldin Nanfack ⋅ Guy Wolf ⋅ Eugene Belilovsky

Domain adaptation is a key strategy for enhancing the generalizability of deep learning models in real-world scenarios, where test distributions often diverge significantly from the training domain. However, conventional approaches typically rely on prior knowledge of the target domain or require model retraining, limiting their practicality in dynamic or resource-constrained environments. Recent test-time adaptation methods based on batch normalization statistic updates allow for unsupervised adaptation, but they often fail to capture complex activation distributions and are constrained to specific normalization layers. We propose Adaptive Quantile Recalibration (AQR), a test-time adaptation technique that modifies pre-activation distributions by aligning quantiles on a channel-wise basis. AQR captures the full shape of activation distributions and generalizes across architectures employing BatchNorm, GroupNorm, or LayerNorm. To address the challenge of estimating distribution tails under varying batch sizes, AQR incorporates a robust tail calibration strategy that improves stability and precision. Our method leverages source-domain statistics computed at training time, enabling unsupervised adaptation without retraining models. Experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C across multiple architectures demonstrate that AQR achieves robust adaptation across diverse settings, outperforming existing test-time adaptation baselines. These results highlight AQR’s potential for deployment in real-world scenarios with dynamic and unpredictable data distributions.

Pre-trained text-to-image (T2I) diffusion models produce high-quality images from random noise guided by textual prompts, but classifier-free guidance (CFG) often yields suboptimal samples. Retraining diffusion models can enhance performance, but the associated computational cost is often prohibitive. To address this, researchers have proposed modifying the CFG step through perturbation techniques for cost-effective sampling improvements. For instance, Smoothed Energy Guidance (SEG) enhances image quality by reducing the curvature of the energy landscape in self-attention layers. However, SEG’s indiscriminate curvature reduction can suppress fine details and cause prompt misalignment. We introduce Orthogonal Smoothed Energy Guidance (OSEG), which selectively smooths the orthogonal components of token embeddings to preserve critical details while maintaining prompt alignment. Theoretical analysis justifies OSEG’s effectiveness, supported by empirical comparisons with recent methods. By plugging into text-to-video (T2V) models, OSEG consistently improves visual and temporal coherence. Finally, we present extensive metric comparisons to demonstrate the efficiency of OSEG for both image and video domains.

Volumetric medical image segmentation remains a challenging and critical task in both clinical and research settings due to the inherent complexity of anatomical structures, modality-specific variability, and the need to capture both fine-grained local details and long-range spatial dependencies across 3D volumes. To address these challenges, we propose Hymavi, a novel hybrid architecture that combines Mamba-based sequence modeling with attention mechanisms in a parallel design. This dual-branch structure enables Hymavi to simultaneously leverage the high-resolution spatial reasoning capabilities of attention and the efficient global context modeling afforded by Mamba's recurrent-style architecture. In addition to its architectural innovations, Hymavi incorporates a multi-view learning strategy that leverages sagittal and coronal perspectives alongside the conventional axial view. This multi-view fusion enriches volumetric representation by integrating complementary anatomical information from different orientations, allowing the network to better capture inter-slice continuity and organ-specific variations. Extensive experiments on three widely used benchmark datasets including ACDC, BraTS2023, and AMOS22 demonstrate the effectiveness and strong generalization ability of our method across diverse segmentation tasks and imaging modalities. These results underscore the potential of Hymavi as a powerful tool for advancing automated medical image analysis. The code is publicly available at https://anonymous.4open.science/r/Hymavi_segmentation-C78E.


21
RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding

Xi Xiao ⋅ Yunbei Zhang ⋅ Janet Wang ⋅ Lin Zhao ⋅ YUXIANG WEI ⋅ Hengjia Li ⋅ Yanshu Li ⋅ Xiao Wang ⋅ Swalpa Roy ⋅ Hao Xu ⋅ Tianyang Wang

Accurate road damage detection is crucial for timely infrastructure maintenance and public safety, but existing vision-only datasets and models lack the rich contextual understanding that textual information can provide. To address this limitation, we introduce \textbf{RoadBench}, the first multimodal benchmark for comprehensive road damage understanding. This dataset pairs high-resolution images of road damages with detailed textual descriptions, providing a richer context for model training. We also present \textbf{RoadCLIP}, a novel vision-language model that builds upon CLIP by integrating domain-specific enhancements. It includes a disease-aware positional encoding that captures spatial patterns of road defects and a mechanism for injecting road-condition priors to refine the model’s understanding of road damages. We further employ a GPT-driven data generation pipeline to expand the image–text pairs in RoadBench, greatly increasing data diversity without exhaustive manual annotation. Experiments demonstrate that RoadCLIP achieves state-of-the-art performance on road damage recognition tasks, significantly outperforming existing vision-only models by 19.2%. These results highlight the advantages of integrating visual and textual information for enhanced road condition analysis, setting new benchmarks for the field and paving the way for more effective infrastructure monitoring through multimodal learning.


22
IMKD: Intensity-Aware Multi-Level Knowledge Distillation for Camera-Radar Fusion

Shashank Mishra ⋅ Karan Patil ⋅ Didier Stricker ⋅ Jason Rambach

High-performance Radar-Camera 3D object detection can be achieved by leveraging knowledge distillation without using LiDAR at inference time. However, existing distillation methods typically transfer modality-specific features directly to each sensor, which can distort their unique characteristics and degrade their individual strengths. To address this, we introduce IMKD, a radar-camera fusion framework based on multi-level knowledge distillation that preserves each sensor’s intrinsic characteristics while amplifying their complementary strengths. IMKD applies a three-stage, intensity-aware distillation strategy to enrich the fused representation across the architecture: (1) LiDAR-to-Radar intensity-aware feature distillation to enhance radar representations with fine-grained structural cues, (2) LiDAR-to-Fused feature intensity-guided distillation to selectively highlight useful geometry and depth information at the fusion level, fostering complementarity between the modalities rather than forcing them to align, and (3) Camera-Radar intensity-guided fusion mechanism that facilitates effective feature alignment and calibration. Extensive experiments on the nuScenes benchmark show that IMKD reaches 67.0\% NDS and 61.0\% mAP, outperforming all prior distillation-based radar-camera fusion methods. Our code and models will be publicly released.


23
LogicCBMs: Logic-Enhanced Concept-Based Learning

Deepika Vemuri ⋅ Gautham Bellamkonda ⋅ Aditya Pola ⋅ Vineeth Balasubramanian

Concept Bottleneck Models (CBMs) provide a basis for semantic abstractions within a neural network architecture. Such models have primarily been seen through the lens of interpretability so far, wherein they offer transparency by inferring predictions as a linear combination of semantic concepts. However, a linear combination is inherently limiting. So we propose the enhancement of concept-based learning models through propositional logic. We introduce a logic module that is carefully designed to connect the learned concepts from CBMs through differentiable logic operations, such that our proposed LogicCBM can go beyond simple weighted combinations of concepts to leverage various logical operations to yield the final predictions, while maintaining end-to-end learnability. Composing concepts using a set of logic operators enables the model to capture inter-concept relations, while simultaneously improving the expressivity of the model in terms of logic operations. Our empirical studies on well-known benchmarks and synthetic datasets demonstrate that these models have better accuracy, perform effective interventions and are highly interpretable.


24
Understanding the Visual Projection Space of Multimodal LLMs

SungHeon Jeong ⋅ Yoojeong Song ⋅ Yoojeong Song

What does a single vision token really do inside a multimodal large language model (MLLM)? Despite their success, today’s MLLMs rely on a surprisingly simple mechanism: a single projected visual feature $z=P(f_x)$ prepended to the text. But whether this vector merely adds context or actively steers generation remains an open question. In this work, we peel back the layers of this design and introduce a geometric probing framework that reveals how $z$ shapes the model’s output token space. Through latent–token alignment, subspace sensitivity, and controlled perturbations, we uncover distinct vision–language coupling patterns across popular MLLMs (LLaVA, BLIP-2, Kosmos-2). Our findings show that projected vision features lie in low-dimensional, anisotropic cones and can either dominate or defer to the language model depending on the architecture. Some models treat $z$ as a rigid prior; others allow it to guide flexibly. These geometric signatures offer a new lens on MLLM behavior—revealing not just what models say, but how vision tells them to say it.


25
NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

Maksim Kuprashevich ⋅ Grigorii Alekseenko ⋅ Irina Tolstykh ⋅ Georgii Fedorov ⋅ Bulat Suleimanov ⋅ Vladimir Dokholyan ⋅ Aleksandr Gordeev

Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets $\langle$original image, instruction, edited image$\rangle$, yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by $\approx 2.6\times$, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release **NHR-Edit**, an open dataset of 720k high-quality triplets, curated at industrial scale via millions of guided generations and validator passes, and we analyze the pipeline’s stage-wise survival rates, providing a framework for estimating computational effort across different model stacks. In the largest cross-dataset evaluation, it **surpasses all public alternatives**. We also release **Bagel-NHR-Edit**, a fine-tuned Bagel model with state-of-the-art metrics.**Datasets and model are released under the Apache License, Version 2.0. URLs will be added after the review period.**


26
SSMT-Net: A Semi-Supervised Multitask Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images

Muhammad Umar Farooq ⋅ Abd Ur Rehman ⋅ Azka Rehman ⋅ Muhammad Usman ⋅ Dong-Kyu Chae

Accurate thyroid nodule segmentation in ultrasound images is essential for effective diagnosis and treatment planning. While multitask learning has shown promise in improving segmentation performance, several challenges remain unresolved: (a) scarcity of labeled data, (b) lack of integration of domain-specific prior knowledge, and (c) limited robustness in real-world clinical scenarios. To address these, we propose SSMT-Net, a Semi-Supervised Multi-Task Transformer-based Network that leverages unlabeled data to enhance feature extraction during an initial unsupervised phase. In the subsequent supervised phase, our model jointly optimizes thyroid nodule segmentation, thyroid gland segmentation, and nodule size estimation, effectively integrating both local and global contextual cues. This multitask formulation enables the model to generalize better and remain robust across variable clinical conditions. Evaluated on two public datasets, TN3K and DDTI, SSMT-Net sets a new benchmark in thyroid nodule segmentation, achieving up to 3.32% and 1.23% absolute improvements in IoU and DSC, respectively, compared to existing state-of-the-art methods.


27
Gaussian Splatting Map Registration with Orthographic Bird's-Eye-View Renderings

Hugo LEBLOND ⋅ Gilles SIMON ⋅ Renato Martins ⋅ Cedric Demonceaux ⋅ Marie-odile Berger

Gaussian Splatting (GS) are promising scene representations for visual localization and SLAM. Recent works have explored loop closure detection via Gaussian registration, improving map consistency and accuracy. However, achieving reliable registration given two GS representations from different acquisitions remains challenging.In this paper, we propose a complete pipeline to perform the matching and registration given two GS maps. The proposed method is grounded in generating orthographic bird’s-eye views (BEVs) of optimized Gaussian models. The proposed approach leverages photometric and geometric information extracted directly from the GS to provide a trade-off of accuracy and invariance to different viewing changes (e.g., as type of GS maps, seasons or illumination). Unlike 3D registration methods, which become inefficient as the number of Gaussians grows, our approach leverages 2D orthographic renders thus considerably reducing the registration complexity. Experiments on two public datasets demonstrate that our method achieves higher accuracy than several existing baselines, while also maintaining better registration results when dealing with GS maps learned by different techniques (e.g., 3DGS to LightGaussian), or GS maps presenting viewing changes such as varying illumination conditions. Code and evaluation setup will be made publicly available.

In this work, we present MUSE (Model-based Uncertainty-aware Similarity Estimation), a training-free framework for model-based zero-shot 2D object detection and segmentation. First, MUSE incorporates 2D multi-view templates from 3D unseen objects and 2D object proposals from the input query image, respectively. In the embedding stage, we propose a new feature embedding scheme which integrates class and patch embeddings. Specifically, the patch embeddings are normalized using the generalized mean pooling (GeM). In the matching stage, a joint similarity score is introduced, which integrates an absolute score and a relative score. Finally, we update the similarity score using an uncertainty-aware object prior. MUSE achieves state-of-the-art performance on the BOP Challenge 2025, ranking first in the Classic Core, H3, and Industrial tracks—without any additional training or fine-tuning. Therefore, we believe that MUSE is a promising framework for zero-shot 2D object detection and segmentation.


29
MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection

Saad Lahlali ⋅ Alexandre Montgieux ⋅ Nicolas Granger ⋅ Hervé Le Borgne ⋅ Quoc Cuong PHAM

Annotating 3D data remains a costly bottleneck for 3D object detection, motivating the development of weakly supervised annotation methods that rely on more accessible 2D box annotations. However, relying solely on 2D boxes introduces projection ambiguities since a single 2D box can correspond to multiple valid 3D poses. Furthermore, partial object visibility under a single viewpoint setting makes accurate 3D box estimation difficult.We propose MVAT, a novel framework that leverages temporal multi-view present in sequential data to address these challenges.Our approach aggregates object-centric point clouds across time to build 3D object representations as dense and complete as possible.A Teacher-Student distillation paradigm is employed: The Teacher network learns from single viewpoints but targets are derived from temporally aggregated static objects. Then the Teacher generates high quality pseudo-labels that the Student learns to predict from a single viewpoint for both static and moving objects. The whole framework incorporates a multi-view 2D projection loss to enforce consistency between predicted 3D boxes and all available 2D annotations. Experiments on the nuScenes and Waymo Open datasets demonstrate that MVAT achieves state-of-the-art performance for weakly supervised 3D object detection, significantly narrowing the gap with fully supervised methods without requiring any 3D box annotations.


30
ODEt(ODEl): Shortcutting the Time and the Length in Diffusion and Flow Models for Faster Sampling

Denis Gudovskiy ⋅ Wenzhao Zheng ⋅ Tomoyuki Okuno ⋅ Yohei Nakata ⋅ Kurt Keutzer

Continuous normalizing flows (CNFs) and diffusion models (DMs) generate high-quality data from a noise distribution. However, their sampling process demands multiple iterations to solve an ordinary differential equation (ODE) with high computational complexity.State-of-the-art methods focus on reducing the number of discrete time steps during sampling to improve efficiency.In this work, we explore a complementary direction in which the quality-complexity tradeoff can also be controlled in terms of the neural network length.We achieve this by rewiring the blocks in the transformer-based architecture to solve an inner discretized ODE w.r.t. its depth.Then, we apply a length consistency term during flow matching training, and as a result, the sampling can be performed with an arbitrary number of time steps and transformer blocks.Unlike others, our ODE$_t$(ODE$_l$) approach is solver-agnostic in time dimension and reduces both latency and, importantly, memory usage. CelebA-HQ and ImageNet generation experiments show a latency reduction of up to $2\times$ in the most efficient sampling mode, and FID improvement of up to $2.8$ points for high-quality sampling when applied to prior methods.


31
TM-Adapter: Temporal Merge Adapter for Efficient Global Temporal Modeling

WooJoo Hahm ⋅ Seungwoo Jang ⋅ Hyeon Kim ⋅ Daeun Lee ⋅ Kwangsu Kim

We propose the Temporal Merge Adapter} (TM-Adapter), a novel framework for image-to-video parameter-efficient transfer learning (PETL), specifically designed for temporal representation learning in video understanding. PETL has emerged as a practical strategy for adapting large-scale vision models to video tasks under limited computational budgets. However, existing PETL approaches suffer from local redundancy caused by highly similar consecutive frames, which limits the modeling of diverse temporal dependencies. To address this limitation, we introduce a lightweight merge-unmerge mechanism that temporally aggregates and redistributes token embeddings, enabling the model to capture diverse temporal patterns by mitigating redundancy. Furthermore, to effectively handle diverse temporal dependencies across different time scales, TM-Adapter introduces a single adapter module with two parallel branches, local and global adapters, each specialized in capturing complementary patterns at different temporal ranges. We validate our approach through experiments on Kinetics-400, Something-Something V2, and HMDB-51, demonstrating competitive performance compared to existing methods while maintaining high parameter efficiency.

With the increasing demand for histopathological specimen examination and diagnostic reporting, Multiple Instance Learning (MIL) has received heightened research focus as a viable solution for AI-centric diagnostic aid. Recently, to improve its performance and make it work more like a pathologist, several MIL approaches based on the use of multiple-resolution images have been proposed, delivering often higher performance than those that use single-resolution images. However, both of these approaches only focus on improving performance, which varies depending on the nature of the data, thereby making it difficult to rely on MIL’s predictions in clinical settings, where patients can vary day to day. In this study, we propose Uncertainty-Focused Calibrated MIL (UFC-MIL), which more closely mimics the pathologists’ examination behaviors while providing calibrated diagnostic decisions, using multiple images with different resolutions. UFC-MIL includes a novel patch-wise loss that learns the latent patterns of instances and expresses their uncertainty for classification. Also, the attention-based architecture with a neighbor patch aggregation module collects features for the classifier. Moreover, aggregated predictions are calibrated through patch-level uncertainty without requiring multiple iterative inferences, which is a key practical advantage. On challenging public datasets, UFC-MIL shows superior performance in model calibration compared to baseline methods while achieving classification accuracy comparable to that of state-of-the-art architectures.


33
Align Video Diffusion Model with Online Video-Centric Preference Optimization

Jiacheng Zhang ⋅ Jie Wu ⋅ Weifeng Chen ⋅ Yatai Ji ⋅ Xuefeng Xiao ⋅ Weilin Huang ⋅ Kai Han

Video diffusion models (VDMs) have demonstrated remarkable capabilities in text-to-video (T2V) generation. Despite their success, VDMs still suffer from degraded image quality and flickering artifacts. To address these issues, some approaches have introduced preference learning to exploit the human feedback to enhance the video generation. However, these methods primarily adopt the routine in the image domain without an in-depth investigation into video-specific preference optimization. In this paper, we reexamine the design of the video preference learning from two key aspects: \textit{feedback source} and \textit{feedback tuning methodology}, and present OnlineVPO, a more efficient preference learning framework tailored specifically for VDMs. On the feedback source, we found that the image-level reward model commonly used in existing methods fails to provide a human-aligned video preference signal due to the modality gap. In contrast, video quality assessment (VQA) models show superior alignment with human perception of video quality. Building on this insight, we propose leveraging VQA models as a proxy of human to provide more modality-aligned feedback for VDMs. For the preference tuning, we introduce an online DPO algorithm tailored for VDMs. It not only enjoys the merits of superior scalability on the optimization for the video with higher resolution and longer time compared with the existing method, but also mitigates the insufficient optimization issue caused by the off-policy learning via the online preference generation and curriculum preference update designs. Extensive experiments on the open-source video-diffusion model demonstrate OnlineVPO as a simple yet effective and, more importantly, scalable preference learning algorithm for video diffusion models.

Grounding complex, compositional visual queries with multiple objects and relationships is a fundamental challenge for vision-language models. While standard phrase grounding methods excel at localizing single objects, they lack the structural inductive bias to parse intricate relational descriptions, often failing as queries become more descriptive.To address this structural deficit, we focus on scene-graph grounding, a powerful but less-explored formulation where the query is an explicit graph of objects and their relationships.However, existing methods for this task also struggle, paradoxically showing decreased performance as the query graph grows---failing to leverage the very information that should make grounding easier.We introduce SceneProp, a novel method that resolves this issue by reformulating scene-graph grounding as a Maximum a Posteriori (MAP) inference problem in a Markov Random Field (MRF).By performing global inference over the entire query graph, SceneProp finds the optimal assignment of image regions to nodes that jointly satisfies all constraints.This is achieved within an end-to-end framework via a differentiable implementation of the Belief Propagation algorithm.Experiments on four benchmarks show that our dedicated focus on the scene-graph grounding formulation allows SceneProp to significantly outperform prior work.Critically, its accuracy consistently improves with the size and complexity of the query graph, demonstrating for the first time that more relational context can, and should, lead to better grounding.


35
GHOST: Getting to the Bottom of Hallucinations with A Multi-round Consistency Benchmark

Vibashan VS ⋅ Nadine Chang ⋅ Jenny Schmalfuss ⋅ Vishal Patel ⋅ Zhiding Yu ⋅ Jose M. Alvarez

Hallucinations remain a central challenge in multimodal large language models (MLLMs), where models generate incorrect or fabricated information not present in the input. Existing benchmarks assess hallucinations only at the image level and lack a holistic object-level evaluation of types, attributes, and relations for multiple individual objects within the image. To address this limitation, we introduce GHOST, a benchmark that focuses on evaluating hallucinations at the object level. GHOST offers a fine-grained assessment by evaluating compositional triplets -- combinations of object types, attributes, and relations tied together for objects within images. We also propose a multi-round consistency-based evaluation framework and introduce the GHOST Consistency Score, a novel metric based on consistency checks using both positive (true) and hard negative (false) statements about the same object. This approach better captures hallucination tendencies by penalizing inconsistent and contradicting responses. Our benchmark includes 765 images and offers a comprehensive dataset of 38,088 questions for comprehensive hallucination evaluation. We conduct extensive experiments on 20 state-of-the-art MLLMs, including GPT-4o and Gemini-1.5-Pro and we reveal significant gaps in object-level understanding and consistency. Our analysis emphasizes the necessity of object-centric evaluation and provides valuable insights into MLLM vision encoders and their sizes in applications where accuracy and reliability are critical. Code and dataset will be released.


36
GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts

Jenna Kang ⋅ Maria Silva ⋅ Patsorn Sangkloy ⋅ Kenneth Chen ⋅ Niall Williams ⋅ Qi Sun

Recent advances in probabilistic generative models have extended capabilities from static image synthesis to text-driven video generation. However, the inherent randomness of their generation process can lead to unpredictable artifacts, such as impossible physics and temporal inconsistency. Progress in addressing these challenges requires systematic benchmarks, yet existing datasets primarily focus on generative images due to the unique spatio-temporal complexities of videos. To bridge this gap, we introduce GeneVA, a large-scale artifact dataset with rich human annotations that focuses on spatio-temporal artifacts in videos generated from natural text prompts. We hope GeneVA can enable and assist critical applications, such as benchmarking model performance and improving generative video quality.


37
Zero‑Shot Domain Generalisation via Prompt-Driven Feature Refinement

Tingrui Qiao ⋅ Di Zhao ⋅ Caroline Walker ⋅ Chris Cunningham ⋅ Yun Sing Koh

Domain generalisation aims to develop models that generalise from source domains to unseen target domains. However, most existing methods assume access to source domain data and require additional training, which may not always be practical. We focus on a more flexible and broadly applicable setting, zero-shot domain generalisation, where models generalise without access to source data, target data, or any additional training. In this work, we propose Prefer (prompt-driven feature refinement), a simple and effective approach which enhances the zero-shot domain generalisation ability of vision-language foundation models. Prefer generates a diverse set of textual prompts for each class by imagining domain-specific variations (e.g., "a painting of a cat under a golden sunset with thick brush strokes''), and uses them to probe the model. We evaluate how reliably each feature channel represents a class across domains by measuring two quantities: (1) how strongly the channel aligns with the original class prompt (e.g., "a photo of a cat") across the generated domain-specific prompts, and (2) how stable the channel remains across those prompts, quantified by its variance. Channels that exhibit both high alignment and low variability are selected at inference time to improve class prediction under domain shift. Without any model updates or external data, Prefer achieves consistent improvements across domain generalisation benchmarks, outperforming existing state-of-the-art methods. The source code is available at https://anonymous.4open.science/r/WACV26-Prefer.

We introduce Forensim, an attention-based state-space framework for image forgery detection that jointly localizes both manipulated (target) and source regions. Unlike traditional approaches that rely solely on artifact cues to detect spliced or forged areas, Forensim is designed to capture duplication patterns crucial for understanding context. In scenarios like protest imagery, detecting only the forged region—e.g., a duplicated act of violence inserted into a peaceful crowd—can mislead interpretation, highlighting the need for joint source-target localization. Forensim outputs three-class masks (pristine, source, target) and supports detection of both splicing and copy-move forgeries within a unified architecture. We propose a visual state-space model that leverages normalized attention maps to identify internal similarities, paired with a region-based block-attention module to distinguish manipulated regions. This design enables end-to-end training and precise localization. Forensim achieves state-of-the-art performance on standard benchmarks. We also release CMFD_Anything, a new dataset addressing limitations of existing copy-move forgery datasets. Source code and dataset will be released.


39
Distilling Offline Action Detection Models into Real-Time Streaming Models

Deep Patel ⋅ Yasunori Babazazki ⋅ YASUTO NAGASE ⋅ Iain Melvin ⋅ Martin Min

Vision Transformers (ViTs) have achieved state-of-the-art performance in offline video action detection, but their reliance on processing fixed-size clips with full spatio-temporal attention makes them computationally expensive and ill-suited for real-time streaming applications due to massive computational redundancy. This paper introduces a novel framework to adapt these powerful offline models into efficient, online student models through knowledge distillation. We propose two causal, streaming-friendly attention architectures that replace the full self-attention mechanism: (1) a lightweight Temporal Shift Attention that integrates past context with minimal overhead, and (2) a Decomposed Spatial-Temporal Attention that combines intra-frame spatial attention with an LSTM for temporal modeling. Both architectures utilize caching to eliminate redundant operations on a frame-by-frame basis. To maximize knowledge transfer, we introduce an uncertainty-guided distillation process, which formulates the training as a multi-task learning problem. Our resulting models demonstrate significant efficiency gains, achieving up to a 4x improvement in latency and throughput compared to the original offline teacher while ensuring state-of-the-art performance for online methods. Our work provides a practical and effective methodology for deploying high-accuracy transformer models in latency-sensitive, real-world video analysis systems.


40
AuthGuard: Generalizable Deepfake Detection via Language Guidance

Guangyu Shen ⋅ Zhihua Li ⋅ Xiang Xu ⋅ Tianchen Zhao ⋅ Zheng Zhang ⋅ DONGSHENG An ⋅ Zhuowen Tu ⋅ Yifan Xing ⋅ Qin ZHANG

Existing deepfake detection techniques struggle to keep-up with the ever-evolving novel, unseen forgeries methods. This limitation stems from their reliance on statistical artifacts learned during training, which are often tied to specific generation processes that may not be representative of samples from new, unseen deepfake generation methods encountered at test time. We propose that incorporating language guidance can improve deepfake detection generalization by integrating human-like commonsense reasoning -- such as recognizing logical inconsistencies and perceptual anomalies -- alongside statistical cues. To achieve this, we train an expert deepfake vision encoder by combining discriminative classification with image-text contrastive learning, where the text is generated by generalist MLLMs using few-shot prompting. This allows the encoder to extract both language-describable, commonsense deepfake artifacts and statistical forgery artifacts from pixel-level distributions. To further enhance robustness, we integrate data uncertainty learning into vision-language contrastive learning, mitigating noise in image-text supervision. Our expert vision encoder seamlessly interfaces with an LLM, further enabling more generalized and interpretable deepfake detection while also boosting accuracy. The resulting framework, \textbf{AuthGuard}, achieves state-of-the-art deepfake detection accuracy in both in-distribution and out-of-distribution settings, achieving AUC gains of 6.15\% on the DFDC dataset and 16.68\% on the DF40 dataset. Additionally, \textbf{AuthGuard} significantly enhances deepfake reasoning, improving performance by 24.69\% on the DDVQA dataset.


41
GroupPortrait: Multi-ID Portrait Generation with High Identity Preservation and Fine-Grained Control

Meijia Huang ⋅ Ruida Li ⋅ Bing Ma ⋅ Liangwei Jiang ⋅ Shuo Fang ⋅ Chenguang Ma

Identity-preserving portrait generation has achieved tremendous advancements with the development of diffusion models. However, evolving from single-ID to multi-ID generation remains challenging due to reduced identity preservation and uncontrollable layouts, poses, and expressions of individuals. To address these challenges, we propose GroupPortrait, a novel approach for multi-ID portrait generation with three key innovations:(1) LatentID for high-fidelity identity preservation, (2) Facial Controller enabling layout guidance and fine-grained facial control, and (3) Mask-Attention Controller allocating identity embeddings to specific facial regions. First, The LatentID module improves identity preservation by adding LatentID loss during training. It maps latent representations to identity features and uses ID consistency loss for feedback training to improve identity retention. Since LatentID loss is calculated in latent space, it is more efficient in terms of time and GPU usage compared to the method that calculates ID loss in pixel space. Second, to enhance layout and facial controllability, the Facial Controller utilizes 3D Morphable Models (3DMM) to acquire facial shapes, poses, and expressions for each individual, imposing strong spatial conditions during the diffusion process. Finally, we propose a novel Mask-Attention Controller for multi-ID generation, which distributes ID embeddings into target facial regions by aligning the cross-attention map of LatentID with the given facial region masks. Extensive experiments demonstrate that GroupPortrait can generate human images with high fidelity, local harmony, and controllability.


42
GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion

Hichem Felouat ⋅ Hanrui Wang ⋅ Isao Echizen

3D face recognition offers a promising biometric solution by capturing the geometric structure of facial surfaces, making it robust to lighting conditions, pose variations, and presentation attacks. Its strong resistance to spoofing makes it particularly attractive for deployment in high-security applications. However, as biometric systems store sensitive identity information, the protection of biometric templates becomes critical. In this paper, we present GFT-GCN, a privacy-preserving framework for 3D face recognition that addresses this challenge through a combination of spectral graph learning and diffusion-based template protection. Our method integrates the Graph Fourier Transform (GFT) and Graph Convolutional Networks (GCN) to extract discriminative and compact spectral features from 3D face meshes. To protect these features, we introduce a novel spectral diffusion mechanism that generates irreversible, renewable, and unlinkable templates. A lightweight client-server architecture ensures that raw biometric data never leaves the client device. We evaluate our system on the BU-3DFE and FaceScape datasets, demonstrating high recognition performance and strong resistance to reconstruction attacks. Our results show that GFT-GCN effectively balances privacy and accuracy, providing a practical solution for secure 3D face authentication.


43
Denoise, Divide, Distill, and Predict (D3P): Towards Forecasting Long-horizon Real-world Anomaly from Normalcy

Quentin Mérilleau ⋅ Snehashis Majhi ⋅ Antitza Dantcheva ⋅ Quan Kong ⋅ Lorenzo Garattoni ⋅ Gianpiero Francesca ⋅ Francois Bremond

Forecasting abnormal human behavior (AHB) in unconstrained real-world environments is critical for enabling proactive safety interventions. Unlike short-term anomaly detection, long-horizon forecasting offers a vital reaction window but remains underexplored due to three core challenges: (i) noisy, complex human–agent interactions; (ii) weak temporal coupling between normal observations and distant anomalies; and (iii) data scarcity limiting the scalability of autoregressive models. To address these, we propose $\mathcal{D}^3\mathcal{P}$ (Denoise, Divide, Distill, and Predict), a novel encoder–decoder framework that bridges denoised pasts with distilled autoregressive futures. Our Differential Past Encoder (DiPE) disentangles scene-level and object-level dynamics via differential attention, suppressing irrelevant interactions and enhancing discriminative cues. The Distilled Future Auto-Regressive Decoder (D-FAD) adopts a divide-and-conquer strategy, segmenting future queries into temporal chunks for sequential prediction, while leveraging distillation to balance robustness and latency. We validate our approach on the AHB-F benchmark, the only dataset dedicated to abnormal behavior forecasting, and further integrate D-FAD with several state-of-the-art methods. In all cases, our framework consistently outperforms prior work in both forecasting accuracy and computational efficiency.

System designers and developers need data-driven approaches for user-interface (UI) development and testing. They need trace-based workflows to support UI navigation agents and inputs for UI code generation. Given the high production costs of manually constructing such workflows, the UI agent research community has explored automated, fully-synthetic UI workflow generation. However, there is an open need to characterize the fidelity and effectiveness of these synthetic approaches with respect to the current manual approaches.In this work, we aim to provide larger-scale synthetic workflows based on past human usage. We particularly focused on the desktop application modality since prior synthetic generation has mostly targeted mobile or web applications. Using video tutorials with permissive licenses, we derive associated UI behaviors to construct synthetic dataset of UI workflows in desktop applications. We provide data from these videos to a large language model (LLM) to generate a set of ``task list'' instructions that replicate the videos' actions. We execute the task list instructions in a UI agent within an instrumented desktop operating system. Using the sensor data from that environment, we can enable simpler actuation scripts for replay of the tasks. Along with detailing our approach, we provide a dataset with over 5,000 workflows across a range of desktop applications with costs that are roughly 51% to 70% of manual construction.

Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.


46
FocalComm: Hard Instance-Aware Multi-Agent Perception

Dereje Shenkut ⋅ Vijayakumar Bhagavatula

Multi-agent collaborative perception (CP) is a promising paradigm for improving autonomous driving safety, particularly for vulnerable road users like pedestrians, through robust 3D perception. However, existing CP approaches often optimize for vehicle detection performance metrics, underperforming on smaller, safety-critical objects such as pedestrians, where detection failures can be catastrophic. Furthermore, previous CP methods rely on full feature exchange rather than communicating only salient features that help reduce false negatives. To this end, we present FocalComm, a novel collaborative perception framework that focuses on exchanging hard-instance-oriented features among connected collaborative agents. FocalComm consists of two key novel designs: (1) a learnable progressive hard instance mining (HIM) module to extract hard instance-oriented features per agent, and (2) a query-based feature-level (intermediate) fusion technique that dynamically weights these identified features during collaboration. We show that FocalComm outperforms state-of-the-art collaborative perception methods on two challenging real-world datasets (V2X-Real and DAIR-V2X) across both vehicle-centric and infrastructure-centric collaborative setups. FocalComm also shows strong performance gains in pedestrian detection in V2X-Real. Our code and model checkpoints will be made publicly available.

We propose SFMNet, a novel 3D sparse detector that combines the efficiency of sparse convolutions with the ability to model long-range dependencies. While traditional sparse convolution techniques efficiently capture local structures, they struggle with modeling long-range relationships. However, these relationships are essential for 3D object detection. In contrast, transformers are designed to capture these long-range dependencies through attention mechanisms. But, they come with high computational costs, due to their quadratic query-key-value interactions. Furthermore, directly applying attention to non-empty voxels is inefficient due to the sparse nature of 3D scenes. Our SFMNet is built on a novel Sparse Focal Modulation (SFM) module, which integrates short- and long-range contexts with linear complexity by leveraging a new hierarchical sparse convolution design. This approach enables SFMNet to achieve high detection performance with improved efficiency, making it well-suited for large-scale LiDAR scenes. We show that our detector achieves state-of-the-art performance on autonomous driving datasets.

Generating expressive, fine-grained human motion from text remains a formidable challenge, particularly when aiming for high fidelity without incurring excessive computational cost. Existing methods often rely on complex, multi-stage pipelines with slow inference and large memory footprints, hindering real-time deployment. To address these limitations, we introduce MoSCo, a simple autoregressive text-to-motion framework that discretizes motion into part-level token sequences and models temporal dynamics via $\textbf{Delta-based training}$ strategy —i.e., predicting the motion difference from the previous time step—before fusing these tokens with textual embeddings through our $\textbf{Part-Aware Coordinator(PAO)}$ and generating the full sequence with a single, lightweight transformer decoder. MoSCo sets a new milestone in text-to-motion inference speed—achieving an AITS of just $\textbf{0.002s}$(vs 0.03s), over an order of magnitude faster than all prior methods—while maintaining a compact model footprint and delivering highly realistic motions ($\textbf{FID 0.085}$), making real-time, high-quality generation practical.


49
VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion

Samet Hicsonmez ⋅ Abd El Rahman Shabayek ⋅ Djamila Aouada

Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce VLMDiff, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. VLMDiff, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches.

Understanding how specific cognitive tasks activate different brain regions is crucial for neuroscience and clinical applications. Functional Magnetic Resonance Imaging (fMRI) provides valuable insights into these activations. However, acquiring task-based fMRI (tfMRI) is costly, time-consuming, and particularly challenging for individuals with cognitive or motor impairments, such as patients in a coma or those who have suffered a stroke and are unable to perform tasks. Resting-state fMRI (rsfMRI), which is more widely available and does not require task compliance, can be utilized to derive task-related activations. In this work, we propose **BrainSparseCNN**, a sparse 3D convolutional neural network that leverages the unique spatial structure of the brain to predict tfMRI contrasts from rsfMRI. Our approach efficiently processes high-dimensional neuroimaging data while preserving critical spatial relationships. We demonstrate that BrainSparseCNN achieves up to 7\% higher Pearson correlation than prior state-of-the-art, with statistically significant gains ($p < 0.01$) across all tasks. It further improves spatial alignment (Dice score), subject identification accuracy, and saliency interpretability, outperforming surface-based and volumetric baselines. BrainSparseCNN establishes a robust, interpretable, and scalable framework for inferring individual functional activation maps from resting-state data.


51
SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection

Chun-Jung Lin ⋅ Tat-Jun Chin ⋅ Sourav Garg ⋅ Feras Dayoub

Accurate, up-to-date High-Definition (HD) maps are critical for urban planning, infrastructure monitoring, and autonomous navigation. However, these maps quickly become outdated as environments evolve, creating a need for robust methods that not only detect changes but also incorporate them into updated 3D representations. While change detection techniques have advanced significantly, there remains a clear gap between detecting changes and actually updating 3D maps, particularly when relying on 2D image-based change detection. To address this gap, we introduce SceneEdited, the first city-scale dataset explicitly designed to support research on HD map maintenance through 3D point cloud updating. SceneEdited contains over 800 up-to-date scenes covering 73 km of driving and approximate 3 $\text{km}^2$ of urban area, with more than 23,000 synthesized object changes created both manually and automatically across 2000+ out-of-date versions, simulating realistic urban modifications such as missing roadside infrastructure, buildings, overpasses, and utility poles. Each scene includes calibrated RGB images, LiDAR scans, and detailed change masks for training and evaluation. We also provide baseline methods using a foundational image-based structure-from-motion pipeline for updating outdated scenes, as well as a comprehensive toolkit supporting scalability, trackability, and portability for future dataset expansion and unification of out-of-date object annotations. Both the dataset and source code will be made publicly available upon acceptance, establishing a standardized benchmark for 3D map updating research.


52
Overcoming Small Data Limitations in Video-Based Infant Respiration Estimation

Liyang Song ⋅ Hardik Bishnoi ⋅ Sai Manne ⋅ Sarah Ostadabbas ⋅ Briana Taylor ⋅ Michael Wan

The development of contactless respiration monitoring for infants could enable advances in the early detection and treatment of breathing irregularities, which are associated with neurodevelopmental impairments and conditions like sudden infant death syndrome (SIDS). But while respiration estimation for adults is supported by a robust ecosystem of computer vision algorithms and video datasets, only one small public video dataset with annotated respiration data for infant subjects exists, and there are no reproducible algorithms which are effective for infants. We introduce the annotated infant respiration dataset of 400 videos (AIR-400), contributing 275 new, carefully annotated videos from 10 recruited subjects to the public corpus. We develop the first reproducible pipelines for infant respiration estimation, based on infant-specific region-of-interest detection and spatiotemporal neural processing enhanced by optical flow inputs. We establish, through comprehensive experiments, the first reproducible benchmarks for the state-of-the-art in vision-based infant respiration estimation. Our dataset, code repository, and trained models will be published at press time.


53
Non-Aligned Reference Image Quality Assessment for Novel View Synthesis

Abhijay Ghildyal ⋅ Rajesh Sureddi ⋅ Nabajeet Barman ⋅ Saman Zadtootaghaj ⋅ Alan Bovik

Evaluating the perceptual quality of Novel View Synthesis (NVS) images remains a key challenge, particularly in the absence of pixel-aligned ground truth references. Full-Reference Image Quality Assessment (FR-IQA) methods fail under misalignment, while No-Reference (NR-IQA) methods struggle with generalization. In this work, we introduce a Non-Aligned Reference (NAR-IQA) framework tailored for NVS, where it is assumed that the reference view shares partial scene content but lacks pixel-level alignment. We constructed a large-scale image dataset containing synthetic distortions targeting Temporal Regions of Interest (TROI) to train our NAR-IQA model. Our model is built on a contrastive learning framework that incorporates LoRA-enhanced DINOv2 embeddings and is guided by supervision from existing IQA methods. We train exclusively on synthetically generated distortions, deliberately avoiding overfitting to specific real NVS samples and thereby enhancing the model’s generalization capability. Our model outperforms state-of-the-art FR-IQA, NR-IQA, and NAR-IQA methods, achieving robust performance on both aligned and non-aligned references. We also conducted a novel user study to gather data on human preferences when viewing non-aligned references in NVS. We find strong correlation between our proposed quality prediction model and the collected subjective ratings.


54
Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis

Thang-Anh-Quan Nguyen ⋅ Laurent Caraffa ⋅ Jean-Philippe Tarel ⋅ Roland Brémond

Synthesizing extrapolated views remains a difficult task, especially in urban driving scenes, where the only reliable sources of data are forward-facing RGB captures and sparse LiDAR points. To address this problem, we present PointmapDiffusion, a novel framework for novel view synthesis that utilizes pre-trained 2D diffusion models. Our method leverages pointmaps (i.e. rasterized 3D scene coordinates) as a conditioning signal, capturing geometric and photometric priors from the reference images to guide the image generation process. With our proposed reference attention blocks and ControlNet for pointmap features, the model generates accurate and consistent results across varying viewpoints while respecting geometric information. Experiments on real-life driving data demonstrate that PointmapDiffusion achieves high-quality generation with flexible control over pointmap conditioning signals (e.g. dense depth map or even sparse LiDAR points) and can be used to distill to 3D representations such as 3D Gaussian Splatting for improving view extrapolation.


55
TED-4DGS: Temporally Activated and Embedding-based Deformation for 4DGS Compression

Cheng-Yuan Ho ⋅ He-Bi Yang ⋅ Jui-Chiu Chiang ⋅ Yu-Lun Liu ⋅ Wen-Hsiao Peng

Building on the success of 3D Gaussian Splatting (3DGS) in static 3D scene representation, its extension to dynamic scenes--commonly referred to as 4DGS or dynamic 3DGS--has attracted increasing attention. However, designing more compact, efficient deformation schemes together with rate-distortion-optimized compression strategies for dynamic 3DGS representations remains an underexplored area. Prior methods either rely on space-time 4DGS with overspecified, short-lived Gaussian primitives or on canonical 3DGS with deformation that lacks explicit temporal control. To address this, we present TED-4DGS, a temporally activated and embedding-based deformation scheme for rate-distortion-optimized 4DGS compression that unifies the strengths of both families. TED-4DGS is built on a sparse anchor-based 3DGS representation. Each canonical anchor is assigned with learnable temporal-activation parameters to specify its appearance and disappearance transitions over time, while a lightweight per-anchor temporal embedding queries a shared deformation bank to produce anchor-specific deformation. For rate-distortion compression, we incorporate an implicit neural representation (INR)-based hyperprior to model anchor attribute distributions, along with a channel-wise autoregressive model to capture intra-anchor correlations. With these novel elements, our scheme achieves the state-of-the-art rate-distortion performance on several commonly used real-world datasets. To the best of our knowledge, this work represents one of the first attempts to pursue a rate-distortion-optimized compression framework for dynamic 3DGS representations.


56
QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain

Wenfang Sun ⋅ Yingjun Du ⋅ Gaowen Liu ⋅ Yefeng Zheng ⋅ Cees Snoek

We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, which leads to high computational costs and limited scalability, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining. It leverages a dual-loop meta-learning strategy to optimize a domain-invariant prompt. Further, by integrating prompt learning with learnable counting and domain tokens, our method captures stylistic variations and maintains accuracy, even for object classes not encountered during training.For evaluation, we adopt a new benchmark QUANT-Bench specifically designed for object quantification in domain generalization, enabling rigorous assessment of object quantification accuracy and adaptability across unseen domains in text-to-image generation.Extensive experiments demonstrate that QUOTA outperforms conventional models in both object quantification accuracy and semantic consistency, setting a new benchmark for efficient and scalable text-to-image generation for any domain.


57
Single-step Diffusion for Image Compression at Ultra-Low Bitrates

Chanung Park ⋅ Joo Chan Lee ⋅ Jong Hwan Ko

Although there have been significant advancements in image compression techniques, such as standard and learned codecs, these methods still suffer from severe quality degradation at extremely low bits per pixel. While recent diffusion-based models provided enhanced generative performance at low bitrates, they often yields limited perceptual quality and prohibitive decoding latency due to multiple denoising steps.In this paper, we propose the first single-step diffusion model for image compression that delivers high perceptual quality and fast decoding at ultra-low bitrates. Our approach incorporates two key innovations: (i) Vector-Quantized Residual (VQ-Residual) training, which factorizes a structural base code and a learned residual in latent space, capturing both global geometry and high‑frequency details; and (ii) rate‑aware noise modulation, which tunes denoising strength to match the desired bitrate. Extensive experiments show that ours achieves comparable compression performance to state-of-the-art methods while improving decoding speed by about 50× compared to prior diffusion-based methods, greatly enhancing the practicality of generative codecs.

The Herculaneum Papyri are a collection of rolled papyrus documents that were charred and buried by the famous eruption of Mount Vesuvius. They promise to contain a wealth of previously unseen Greek and Latin texts, but are extremely fragile and thus most cannot be unrolled physically. A solution to access these texts is virtual unrolling, where the papyrus surface is digitally traced out in a CT scan of the scroll, to create a flattened representation. This tracing is very laborious to do manually in gigavoxel-sized scans, so automated approaches are desirable. We present the first top-down method that automatically fits a surface model to a CT scan of a severely damaged scroll. We take a novel approach that globally fits an explicit parametric model of the deformed scroll to existing neural network predictions of where the rolled papyrus likely passes. Our method guarantees the resulting surface is a single continuous 2D sheet, even passing through regions where the surface is not detectable in the CT scan. We conduct comprehensive experiments on a high-resolution synchrotron CT scan of the scroll \phpf{}, showing that our approach successfully unrolls large regions, and exceeds the performance of the only existing method suitable for this data.


59
Diffusion Noise Optimization for Synthetic VLM Training

Ren Ohkubo ⋅ Rintaro Yanagi ⋅ Hirokatsu Kataoka ⋅ Yutaka Satoh

Recent advances in image generation models have enabled the production of high-quality images, making synthetic images a promising alternative to real images for dataset construction. However, a critical challenge remains in that the performance of Vision–Language Models (VLMs) tends to degrade as the proportion of synthetic images in a dataset increases in conventional approaches. To alleviate the challenge, we introduce a plug-and-play dataset construction framework that enhances text-to-image diffusion models by optimizing their initial noise. Our method treats the initial noise as a learnable parameter and iteratively updates it to maximize text–image alignment based on multiple embedding models without retraining the generator. Since the initial noise plays a crucial role in determining the quality of the synthetic image, its optimization enables the search for initial conditions that yield semantically faithful and realistic images. By improving FID and text–image alignment compared to conventional latent diffusion model (LDM)-based methods, our approach produces synthetic images better suited for training. When CLIP models were trained on such images, it achieved up to +5.09\% higher Average R@1 in zero-shot retrieval, +2.88\% higher Average top-1 accuracy in zero-shot classification, and +5.05\% higher performance in linear-probing. These results demonstrate that initial noise optimization is an effective and scalable strategy for enabling robust VLM training with synthetic images.


60
Histogram Assisted Quality Aware Generative Model for Resolution Invariant NIR Image Colorization

Abhinav Abhinav ⋅ Rajeev Ranjan Dwivedi ⋅ Samiran Das ⋅ Vinod Kurmi

We present HAQAGen, a unified generative model for resolution-invariant NIR-to-RGB colorization that balances chromatic realism with structural fidelity. The proposed model introduces (i) a combined loss term aligning the global color statistics through differentiable histogram matching, perceptual image quality measure, and feature-based similarity to preserve texture information, (ii) local hue–saturation priors injected via Spatially Adaptive Denormalization (SPADE) to stabilize chromatic reconstruction, and (iii) texture-aware supervision within a Mamba backbone to preserve fine details. We introduce an adaptive-resolution inference engine that further enables high-resolution translation without sacrificing quality. Our proposed NIR-to-RGB translation model simultaneously enforces global color statistics and local chromatic consistency, while scaling to native resolutions without compromising texture fidelity or generalization. Extensive evaluations on FANVID, OMSIV, VCIP2020, and RGB2NIR using different evaluation metrics demonstrate consistent improvements over state-of-the-art baseline methods. HAQAGen produces images with sharper textures, natural colors, attaining significant gains as per perceptual metrics. These results position HAQAGen as a scalable and effective solution for NIR-to-RGB translation across diverse imaging scenarios.


61
Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices

Saeid Ghafouri ⋅ Mohsen Fayyaz ⋅ Xiangchen Li ⋅ Deepu John ⋅ Bo Ji ⋅ Dimitrios Nikolopoulos ⋅ Hans Vandierendonck

Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency, and energy overhead. Polymorph achieves 40\% lower energy consumption and improves mAP by 9 points over strong baselines executing the TAO dataset.

Autonomous multi-agent trajectory prediction in open-world scenarios presents persistent challenges, including high behavioral uncertainty, long-horizon dependencies, and the lack of structured guidance during generation. Existing generative approaches often compromise behavioral fidelity in favor of accuracy or diversity, resulting in predictions that are either unrealistic or difficult to control. We propose M²Traj, a unified framework that couples a closed-loop conditional diffusion model with structured trajectory reasoning and behavior-driven constraints.M²Traj features a history-guided encoder that captures long-range cross-agent dependencies and scene semantics, and a dynamic closed-loop rollout mechanism that refines predictions through goal-conditioned denoising with iterative feedback. To enable fine-grained control, we introduce a learnable behavior guidance module that softly enforces constraints on velocity, collision risk, comfort, and traffic rule adherence. By jointly modeling agent interactions, future constraints, and uncertainty within a structured generative process, M²Traj delivers controllable and reliable predictions across diverse urban scenarios. Extensive experiments on three large-scale benchmarks—Waymo, HighD, and MoCAD—demonstrate that M²Traj achieves competitive or superior performance across standard accuracy, diversity, and behavior-sensitive metrics, highlighting its potential as a generalizable solution for controllable, structure-aware trajectory prediction in complex multi-agent environments.


63
ChartQA-X: Generating Explanations for Visual Chart Reasoning

Shamanthak Hegde ⋅ Pooyan Fazli ⋅ Hasti Seifi

The ability to explain complex information from chart images is vital for effective data-driven decision-making. In this work, we address the challenge of generating detailed explanations alongside answering questions about charts. We present ChartQA-X, a comprehensive dataset comprising 30,299 chart samples across four chart types, each paired with contextually relevant questions, answers, and explanations. Explanations are generated and selected based on metrics such as faithfulness, informativeness, coherence, and perplexity. Our human evaluation with 245 participants shows that model-generated explanations in ChartQA-X surpass human-written explanations in accuracy and logic and are comparable in terms of clarity and overall quality. Moreover, models fine-tuned on ChartQA-X show substantial improvements across various metrics, including absolute gains of up to 24.57 points in explanation quality, 18.96 percentage points in question-answering accuracy, and 14.75 percentage points on unseen benchmarks for the same task. By integrating explanatory narratives with answers, our approach enables agents to communicate complex visual information more effectively, improving comprehension and fostering greater trust in the generated responses.


64
Distribution Highlighted Reference-based Label Distribution Learning for Facial Age Estimation

Satoshi Suzuki ⋅ Shin'ya Yamaguchi ⋅ Shoichiro Takeda ⋅ Takuhiro Kaneko ⋅ Shota Orihashi ⋅ Ryo Masumura

Estimating age from a facial image is a fundamental task in computer vision.In this task, age labels have ambiguity because faces of the same individual across similar ages are often difficult to distinguish.To model this ambiguity, label distribution learning (LDL) trains a deep neural network (DNN) using a label distribution, which is the probability that an image belongs to each age, instead of a single age label.However, the heuristic constraints utilized for LDL often fail to accurately model the label ambiguity.We have therefore developed a novel LDL method called distribution highlighted reference-based LDL (DHRL), which introduces an input-dependent constraint by utilizing a reference DNN pre-trained with any LDL method and minimizing the gap between the reference and target DNNs' outputs.DHRL incorporates two techniques to highlight the label ambiguity hidden in the pre-trained reference DNN's output: noisy augmentation-based ensembling (NAE) and different scale multi-temperature (DSM).NAE inputs noisy images to the reference DNN and provides an ensemble effect by averaging all the outputs to highlight hidden information about the label ambiguity.DSM sets multiple temperatures simultaneously in the gap minimization between the two DNNs' outputs and highlights various information about the label ambiguity.Experimental results indicate that our method achieves state-of-the-art performance across various datasets and conditions.


65
T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation

Yubin Chen ⋅ Xuyang Guo ⋅ Zhenmei Shi ⋅ Zhao Song ⋅ Jiahao Zhang

Text-to-video (T2V) models have demonstrated impressive capabilities in generating visually reasonable scenes, while their capability to leverage world knowledge for ensuring semantic consistency and factual accuracy remains largely unexplored. To address this challenge, we propose T2VWorldBench, the first systematic benchmark for evaluating the world knowledge generation capabilities of text-to-video models, covering 6 major categories, 60 subcategories, and 1,200 prompts across a wide range of domains, including physics, nature activity, culture, causality, and object. To address both human preference and scalable evaluation, our benchmark incorporates both human evaluation and automated evaluation using vision-language models (VLMs). We evaluated the 10 most advanced text-to-video models currently available, ranging from open source to commercial models, and found that most models are unable to understand world knowledge and generate truly correct videos. These findings highlight a critical gap in the ability of current text-to-video models to leverage world knowledge, providing valuable research opportunities and entry points for constructing models with robust capabilities for commonsense reasoning and factual generation.


66
UniTabBank: A Large Scale Multi-Lingual, Multi-Layout, Multi-Type, Multi-Format Dataset for Table Detection

Ajoy Mondal ⋅ Saumya Vijay Mundra ⋅ Avijit Dasgupta ⋅ Jawahar CV

Tables play a key role in conveying structured data across documents. Accurate table detection is crucial for downstream tasks like structure recognition and information extraction. However, current datasets lack diversity in format, language, and layout, limiting real-world generalization. This underscores the need for well-annotated datasets that are multi-lingual, layout-diverse, document-agnostic and format-rich.To address these limitations, we introduce UniTabBank, a large scale, diverse table detection dataset designed to reflect realistic use cases. UniTabBank is characterized by five key attributes: (i) Multi-Lingual --- supporting 28 languages (including Arabic, English, Hindi, etc.); (ii) Multi-Layout --- encompassing both single-column and multi-column documents; (iii) Multi-Type --- covering a wide range of document genres such as annual reports, books, newspapers, and magazines; (iv) Multi-Format --- comprising scanned documents, photographed pages, and PDFs; and finally (v) Scale and Annotation Quality --- consists of 55,443 document page images with 81,179 accurately annotated table instances, offering scale and annotation precision.Additionally, we introduce UniTabDet, a YOLO-based model for table detection, which outperforms state-of-the-arts on eight out of nine table detection benchmarks. Evaluated in a zero-shot setting on four layout analysis benchmarks, UniTabDet also shows strong generalization across diverse documents without additional fine-tuning. The UniTabBank dataset and the UniTabDet model will be released publicly for community use and research advancement.


67
R-MMA: Enhancing Vision-Language Models with Recurrent Adapters for Few-Shot and Cross-Domain Generalization

Md Fahim ⋅ Md Ishmam ⋅ Mir Sazzat Hossain ⋅ M Ashraful Amin ⋅ Amin Ali ⋅ A K M Mahbubur Rahman

Despite the strong generalization capabilities of pre-trained vision-language models (VLMs) like CLIP, adapting these models for few-shot generalization tasks still presents a fundamental challenge. This challenge, often termed the discrimination–generalization dilemma, highlights the need to fine-tune task-specific knowledge while simultaneously preserving the model's general, pre-trained knowledge. Prompt learning offers a partial solution but often struggles to capture rich visual-textual interactions. Adapter-based methods like MMA improve alignment by adding learnable modules, but their use of multiple independent adapters increases parameter overhead and can limit transferability due to naive fusion of adapted and frozen features.To address these limitations, we introduce \textbf{R}ecurrent \textbf{M}ulti-\textbf{M}odal \textbf{A}dapter (R-MMA), a lightweight and efficient extension of MMA that enhances both performance and generalization. R-MMA employs a recurrent adapter module with shared weights across multiple layers of the image and text encoders. This design substantially reduces the parameter count while maintaining high expressive capacity. Additionally, R-MMA integrates an attention-based alignment mechanism to harmonize the adapter outputs with the frozen encoder features before fusion. This ensures better preservation of the pre-trained representations and enhances cross-modal consistency. Extensive experiments across 15 datasets on diverse tasks, including few-shot learning, generalization to novel classes and domains, and dataset transfer scenarios, demonstrate that R-MMA consistently surpasses state-of-the-art baselines, achieving strong performance with improved efficiency and a better balance between adaptation and generalization. Our work achieves one of the highest forms of parameter efficiency with only three trainable weight matrices for the whole network, regardless of the network depth.


69
Training-Free Few-Shot Segmentation via Vision-Language Guided Prompting

Euihyun Yoon ⋅ Taejin Park ⋅ Jaekoo Lee

Object segmentation relies heavily on costly pixel-level annotations and struggles to generalize to unseen domains. The recent introduction of the Segment Anything Model (SAM), a foundation model for segmentation, offers a prompt-driven, zero-shot capability that has been applied in various domains (e.g., autonomous driving, satellite imagery, medical imaging) and extended to Few-Shot Segmentation (FSS) tasks. However, existing SAM-based FSS methods typically generate prompts by using a vision encoder to measure support–query image similarity, which often biases towards the support images and fails when there are significant support–query context shifts. To address this limitation, we propose a training-free FSS approach that combines visual and textual cues to generate effective prompts for the target class. By leveraging both vision and language information, our approach bridges the support–query gap and guides SAM to segment novel objects more reliably. Without any additional training, our method outperforms previous state-of-the-art FSS methods on established benchmarks ($COCO\text{-}20^i$, $Pascal\text{-}5^i$), demonstrating its effectiveness and robust generalization. Our code is publicly available on GitHub.

Object detection in dynamic construction environments presents significant challenges due to vast scale variations, occlusions, and clutter. Conventional deep learning models struggle to balance the semantic information needed for classification with the spatial detail required for localization. This paper introduces a novel framework that systematically fuses features from different network depths to resolve this trade-off. Our primary contribution is a Hierarchical Feature Adjustment architecture that employs a coarse-to-fine strategy, progressively adjusting detections. We enhance robustness with an Efficient RoI Aggregation module for contextual aggregation and improve localization with a Modified IoU loss. Furthermore, a proposed Overlap Discriminating Module aids non-maximum suppression in dense scenes. Extensive experiments on the SODA, COD, and Small Tools datasets show our integrated approach significantly outperforms state-of-the-art methods, establishing a new benchmark for this critical application.


71
Mean-Shift Distillation for Diffusion Mode Seeking

Vikas Thamizharasan ⋅ Nikitas Chatzis ⋅ Iliyan Georgiev ⋅ Matthew Fisher ⋅ Evangelos Kalogerakis ⋅ Difan Liu ⋅ Nanxuan Zhao ⋅ Michal Lukáč

We present mean-shift distillation, a novel diffusion distillation technique that provides a provably good proxy for the gradient of the diffusion output distribution. This is derived directly from mean-shift mode seeking on the distribution, and we show that its extrema are aligned with the modes. We further derive an efficient product distribution sampling procedure to evaluate the gradient. Our method is formulated as a drop-in replacement for score distillation sampling (SDS), requiring neither model retraining nor extensive modification of the sampling procedure. We show that it exhibits superior mode alignment as well as improved convergence in both synthetic and practical setups, yielding higher-fidelity results when applied to both text-to-image and text-to-3D applications with Stable Diffusion.


72
TacticalCalib: End-to-End 6-DoF Camera Pose Regression for Tactical Camera Calibration

Liang Fan ⋅ Xiaoqian Liu ⋅ Zhi Chen ⋅ Lingkai Yang

Sports field calibration is critical for mapping image coordinates to standardized coordinates, enabling precise analysis of player trajectories and tactical formations. However, traditional methods designed for TV broadcast footage rely on sparse field features (corners, lines) that are susceptible to occlusion and viewpoint variations, limiting their effectiveness for tactical camera calibration. To address these limitations, we propose a novel pose-based calibration framework that directly regresses the 6-DoF camera pose from tactical view images. The proposed framework consists of three novel components: (1) a Lanczos-based spatial encoding module that preserves fine-grained geometric structures in the field representation, (2) an offset-based subpixel localization strategy that enhances occlusion robustness by refining keypoints' position to sub-pixel accuracy, and (3) a query-driven pose regression head with self-attention mechanisms that directly estimates camera pose without requiring additional calibration metadata. Extensive experiments on the SoccerNet-2023 and World Cup 2014 benchmarks demonstrate that our method achieves state-of-the-art performance in terms of Jaccard Index at threshold t = 5 (JaC@t), establishing superior accuracy and cross-dataset generalization capabilities.


73
F-INR: Functional Tensor Decomposition for Implicit Neural Representations

Sai Karthikeya Vemuri ⋅ Tim Büchner ⋅ Joachim Denzler

Implicit Neural Representations (INRs) model signals as continuous, differentiable functions.However, monolithic INRs scale poorly with data dimensionality, leading to excessive training costs.We propose F-INR, a framework that addresses this limitation by factorizing a high-dimensional INR into a set of compact, axis-specific sub-networks based on functional tensor decomposition.These sub-networks learn low-dimensional functional components that are then combined via tensor operations.This factorization reduces computational complexity while additionally improving representational capacity.F-INR is both architecture- and decomposition-agnostic.It integrates with various existing INR backbones (e.g., SIREN, WIRE, FINER, Factor Fields) and tensor formats (e.g., CP, TT, Tucker), offering fine-grained control over the speed-accuracy trade-off via the tensor rank and mode.Our experiments show F-INR accelerates training by up to $20\times$ and improves fidelity by over 6.0 dB PSNR compared to state-of-the-art INRs.We validate these gains on diverse tasks, including image representation, 3D geometry reconstruction, and neural radiance fields.We further show F-INR's applicability to scientific computing by modeling complex physics simulations. Thus, F-INR provides a scalable, flexible, and efficient framework for high-dimensional signal modeling.

Realistic scene simulation is a promising way to improve autonomous driving. While existing diffusion-based 2D augmentation and 3D asset libraries show potential for synthesizing diverse driving scenarios, they often struggle with multi-view photorealistic rendering and consistency. These issues are particularly challenging for vehicle-to-everything (V2X) collaborative perception, since its effectiveness relies on precise geometric alignment and visual coherence across multiple viewpoints. To address these challenges, we propose V2XScene, a 3D driving scene editing framework. This framework enhances V2X collaborative perception by enabling high-quality 3D vehicle asset generation and consistent multi-view insertion. V2XScene consists of three components: a visual question answering (VQA) guided generation module for query-driven 3D vehicle asset synthesis; a 3D object mapping module for vehicle placement optimization and occlusion reasoning; and a realistic insertion module for lighting estimation and virtual vehicle insertion. Extensive experiments demonstrate that V2XScene can generate multi-view consistent and realistic driving scenes, which significantly improves V2X perception accuracy.


75
WiSAR3D - Aerial LiDAR dataset for 3D object detection

Oren Shrout ⋅ Ori Nizan ⋅ Yizhak Ben-Shabat ⋅ Ayellet Tal

Wilderness Search and Rescue (WiSAR) operations aim to locate and rescue individuals in remote, rugged environments. We introduce \textit{WiSAR3D}, the first aerial LiDAR 3D dataset specifically tailored for 3D object detection in rural scenes. \textit{WiSAR3D} comprises $2633$ strips---contiguous point clouds collected along the flight path, all fully annotated with $67.5K$ 3D bounding boxes for $22$ categories. As the data is captured with multiple returns (echoes), it enables detection under foliage, which is impossible with RGB cameras. Additionally, as this is the first-of-its-kind dataset, we provide benchmark evaluations for state-of-the-art methods developed for autonomous cars on our aerial dataset. The results suggest that specialized models are needed in this domain, as finding individuals remains challenging for current 3D detectors. Some examples can be found at \href{https://wisar3d.github.io/}{https://wisar3d.github.io/}. The full dataset and code will be released upon acceptance.


76
A Deep Network for Object Detection on Inland Waters

Dennis Griesser ⋅ Bastian Goldluecke ⋅ Matthias Franz ⋅ Georg Umlauf

Collisions on inland waters frequently occur due to the absence of clearly defined navigation guidance. To prevent such accidents, optical sensors such as stereo cameras can be employed to detect obstacles at an early stage, enabling timely warnings to vessel operators or the planning of collision avoidance trajectories. In this context, the paper presents a neural network for object detection on inland waters, designed to address the specific challenges of waterborne vehicles, including strong ego-motion and contextual cues like the shoreline appearing in the background. The proposed network leverages a plane sweep approach to integrate multiple views of a scene and predict object locations in bird’s-eye view (BEV) coordinates. Its ability to incorporate more than two camera perspectives is demonstrated using the KITTI dataset. On real-world inland water data, object detection performance is evaluated against a traditional maritime stereo-based method, showing improved mean average precision.

The explosive growth of online video content has heightened the need for efficient text-to-video retrieval systems. These systems heavily rely on accurate video-text representations for effective retrieval. However, challenges such as the multi-thematic nature of videos and inconsistent caption quality hinder performance by making it difficult to capture comprehensive video-text relationships. To address these issues, we propose CoreCaption, a novel framework that extracts and leverages core captions—the most representative captions capturing essential video themes. Our approach includes a unique core caption extraction method based on similarity-based density estimation and clustering, and introduces the Core Caption Guided Attention (CCGA) mechanism to integrate video-specific semantic information into text queries while preserving their original intent. Furthermore, we employ a teacher-student architecture for efficient inference without reliance on core captions during deployment. Extensive experiments on benchmark datasets like MSR-VTT, VATEX, and MSVD demonstrate that CoreCaption outperforms state-of-the-art methods. These results validate the effectiveness and robustness of our framework across diverse video-text datasets.

Collecting and labeling large real-world wild animal datasets is impractical, costly, error-prone, and labor-intensive. For animal monitoring tasks, as detection, tracking, and pose estimation, out-of-distribution viewpoints (e.g. aerial) are also typically needed but rarely found in publicly available datasets. To solve this, existing approaches synthesize data with simplistic techniques that then necessitate strategies to bridge the synthetic-to-real gap. Therefore, real images, style constraints, complex animal models, or pre-trained networks are often leveraged. In contrast, we generate a fully synthetic dataset using a 3D photorealistic simulator and demonstrate that it can eliminate such needs for detecting and estimating 2D poses of wild zebras. Moreover, existing top-down 2D pose estimation approaches using synthetic data assume reliable detection models. However, these often fail in out-of-distribution scenarios, e.g. those that include wildlife or aerial imagery. Our method overcomes this by enabling the training of both tasks using the same synthetic dataset. Through extensive benchmarks, we show that models trained from scratch exclusively on our synthetic data generalize well to real images. We perform these using multiple real-world and synthetic datasets, pre-trained and randomly initialized backbones, and different image resolutions. Code, results, models, and data can be found at https://zebrapose.is.tue.mpg.de/.

We introduce FAIR-SIGHT, an innovative post-hoc framework designed to ensure fairness in computer vision systems by combining conformal prediction with a dynamic output repair mechanism. Our approach calculates a fairness-aware non-conformity score that simultaneously assesses prediction errors and fairness violations. Using conformal prediction, we establish an adaptive threshold that provides rigorous finite-sample, distribution-free guarantees. When the non-conformity score for a new image exceeds the calibrated threshold, FAIR-SIGHT implements targeted corrective adjustments, such as logit shifts for classification and confidence recalibration for detection, to reduce both group and individual fairness disparities, all without the need for retraining or having access to internal model parameters. Comprehensive theoretical analysis validates our method's error control and convergence properties. At the same time, extensive empirical evaluations on benchmark datasets show that FAIR-SIGHT significantly reduces fairness disparities while preserving high predictive performance.


80
GDoFS: Gaussian DoF Separation for Plausible 3D Geometry in Sparse-View 3DGS

Yongsung Kim ⋅ Jooyoung Choi ⋅ Sungroh Yoon

Recent deep learning-based Multi-View Stereo (MVS) approaches, such as MASt3R and VGGT, have shown strong performance in sparse-view 3D reconstruction. However, refining these outputs with 3D Gaussian Splatting (3DGS) remains non-trivial. The excessive positional degrees of freedom (DoFs) in Gaussians often cause instability and geometric artifacts, sometimes distorting geometry to represent texture patterns. To address this issue, we propose GDoFS (Gaussian DoF Separation), a strategy that divides positional DoFs into two categories—image-plane-parallel and ray-aligned—based on their uncertainty. For each category, GDoFS introduces tailored optimization techniques, including bounded offsets for low-uncertainty DoFs and a visibility-guided loss for ray-aligned DoFs. Experiments on standard benchmarks demonstrate that GDoFS effectively mitigates geometric artifacts and produces reconstructions that are both visually coherent and structurally accurate.


81
Learning Beyond Labels: Self-Supervised Handwritten Text Recognition

Shree Mitra ⋅ Ajoy Mondal ⋅ Jawahar CV

This paper addresses a key challenge in Handwritten Text Recognition (HTR): the dependence on large volumes of labeled data. To overcome this, we propose a self-supervised learning (SSL) framework, LoGo-HTR, that minimizes labeling requirements while achieving strong recognition performance. We introduce a large-scale dataset, SSL-HWD of 10 million word-level handwritten images from diverse scanned documents, partitioned into a small labeled subset and a much larger unlabeled subset.The LoGo-HTR combines a local contrastive loss for spatial consistency and a global decorrelation loss to enhance feature diversity. This dual objective enables robust, invariant, and spatially discriminative feature learning. After self-supervised pretraining, we fine-tune a transformer-based decoder using limited labeled data. Extensive experiments on standard HTR benchmarks --- IAM and GHNK, demonstrate that, after SSL pretraining on our unlabeled dataset, our method consistently outperforms state-of-the-art approaches, even when fine-tuned using only 80% and 20% of the available labeled training data from the respective benchmarks. Ablation studies highlight the effectiveness of our dual loss design and demonstrate the potential of scalable, label-efficient handwritten text recognition. The SSL-HWD and the LoGo-HTR will be released publicly for community use and research advancement.


82
Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment

Sangha Park ⋅ Eunji Kim ⋅ Yeongtak Oh ⋅ Jooyoung Choi ⋅ Sungroh Yoon

Despite substantial progress in text–to–image generation, precise text–image alignment remains challenging, especially for richly compositional prompts or imaginative scenes. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by discovering and leveraging negative prompts that suppress unintended content. We first use image–text attention to analyze why both targeted negatives—addressing prompt-related errors—and untargeted negatives—suppressing attributes unrelated to the prompt—improve alignment. To find effective negatives, NPC generates candidates through a verifier–captioner–proposer framework and prioritizes them with a salient text-space score, selecting effective negatives without additional image synthesis. Evaluated on GenEval++ and Imagine-Bench, NPC outperforms strong contemporary baselines: on GenEval++ it attains 0.571 (vs. 0.371 for the strongest baseline) and achieves the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text–image alignment in diffusion models.


83
Neural Geometry Image-Based Representations with Optimal Transport (OT)

Xiang Gao ⋅ Yuanpeng Liu ⋅ Jiazhi Li ⋅ Xinmu Wang ⋅ Minghao Guo ⋅ Yu Guo ⋅ Xiyun Song ⋅ Heather Yu ⋅ Zhiqiang Lao ⋅ David Gu

Neural representations for 3D meshes are emerging as an effective solution for compact storage and efficient processing. Existing methods often rely on neural overfitting, where a coarse mesh is stored and progressively refined through multiple decoder networks. While this can restore high-quality surfaces, it is computationally expensive due to successive decoding passes and the irregular structure of mesh data. In contrast, images have a regular structure that enables powerful super-resolution and restoration frameworks, but applying these advantages to meshes is difficult because their irregular connectivity demands complex encoder–decoder architectures. Our key insight is that a \emph{geometry image–based representation} transforms irregular meshes into a regular image grid, making efficient image-based neural processing directly applicable. Building on this idea, we introduce \emph{our neural geometry image–based representation}, which is decoder-free, storage-efficient, and naturally suited for neural processing. It stores a low-resolution \emph{geometry-image mipmap} of the surface, from which high-quality meshes are restored in a single forward pass. To construct geometry images, we leverage Optimal Transport (OT), which resolves oversampling in flat regions and undersampling in feature-rich regions, and enables continuous levels of detail (LoD) through geometry-image mipmapping. Experimental results demonstrate state-of-the-art storage efficiency and restoration accuracy, measured by compression ratio (CR), Chamfer distance (CD), and Hausdorff distance (HD).


84
RealDroneVision: Dataset and Architecture Advancements for Small-Object Drone Detection

Arun Kumar Sivapuram ⋅ Pranav Peddinti ⋅ Harish Puppala ⋅ Komuravelli Prashanth ⋅ Jaladi Sri Harsha ⋅ Gorthi Subrahmanyam

Drones are increasingly used in civilian and defense domains, but reliable detection remains challenging due to their small size, fast motion, and diverse environments. Existing datasets, such as synthetic benchmarks, fail to capture real-world variability. We introduce RealDroneVision, a unified contribution that advances both dataset and methodology. First, we curate a large-scale real-world drone detection dataset comprising 173,023 images, constructed via a semi-automatic pipeline inspired by self-annotated labeling from videos, enhanced with a human-in-the-loop to iteratively reduce false positives and false negatives. This approach yields high-quality annotations with reduced manual effort. Second, we propose the Nano Object Vision Attention (NOVA) module, a drop-in replacement for YOLOv8’s C2f block. By combining depthwise separable convolutions, scale-aware dilated branches, lightweight mixing, and coordinate-aware attention, our design improves small-object detection while remaining computationally efficient. Extensive benchmarks against YOLOv8m/l and YOLOv9c/e demonstrate that YOLOv8-NOVA dominates across precision (0.912), recall (0.870), and mAP@50 (0.920) while being significantly more lightweight (2.3M params, 5 MB weights). These results establish RealDroneVision as a strong foundation for advancing real-world drone detection research.

In recent years, personalization, which utilizes user-specific data to generate tailored responses, has been increasingly adopted in user-centric domains. However, while Large Language Models (LLMs) are actively researched, the exploration of the personalization capabilities of Large Vision-Language Models (LVLMs) remains limited. To systematically evaluate the personalization ability of LVLMs, we introduce PerVL-Bench, a synthetic benchmark specifically designed for this purpose. PerVL-Bench incorporates user-specific data, including multiple images and long text information, and provides two types of QA pairs. Furthermore, we use PerVL-Bench to comprehensively evaluate the essential capabilities for personalization in current state-of-the-art LVLMs. Through this evaluation, we reveal the limitations of current models in multimodal personalization and provide insights for the development of personalized LVLMs. We release PerVL-Bench, code to advance future research: {link}


86
TRACE: Confounder-free Adversarial Fine-tuning for Robust Object Detection

Wonho Lee ⋅ Jisu Lee ⋅ Hyunsik Na ⋅ Sohee Park ⋅ Daeseon Choi

Adversarial patch attacks critically endanger object detection systems by causing severe mispredictions with small, easily realizable perturbations in both digital and physical environments. Existing defenses such as certified methods or patch detection suffer from high latency, while conventional adversarial training often overfits to specific patches and lacks generalization, particularly in multi-object scenarios. To overcome high latency and poor generalization, we introduce TRACE (Tuning Robustness by Adversarial-patch Confounder Elimination), an adversarial fine-tuning framework that leverages Instrumental Variable Regression in the feature space. TRACE treats patch-related variations—including location, rotation, and brightness—as confounders, thereby eliminating spurious correlations and guiding the model toward causal features that sustain robust detection. Evaluations on YOLOv5 and YOLOv8 show that TRACE consistently outperforms conventional defense methods in both efficiency and robustness under adaptive and unseen patch attacks. Moreover, physical testbed experiments confirm its effectiveness beyond digital settings, highlighting TRACE as a practical solution for achieving generalized robustness in object detection.

Cross-modality medical image segmentation is critical for diagnosis and treatment planning, yet domain shifts and source data restrictions pose significant challenges. This paper introduces Zero-LEAD, the first unified framework for source-free universal domain adaptation (SF-UniDA) in segmentation, addressing all four UDA scenarios, closed-set, partial-set, open-set, and universal-set, without access to source data. Zero-LEAD integrates (1) Label-Efficient Adaptive Decomposition (LEAD) to decompose features into source-known and source-unknown components, and (2) a zero-shot segmentation module leveraging anatomical priors and semantic attributes to segment novel target classes.Extensive experiments across four datasets, Synapse, CHAOS, BTCV, and FLARE22, demonstrate strong performance across all adaptation settings. Zero-LEAD achieves 0.9159 Dice in closed-set (Synapse$\rightarrow$CHAOS), 0.8721 Dice in partial-set (BTCV$\rightarrow$CHAOS), 0.7801 Dice in open-set (Synapse$\rightarrow$BTCV), and 0.7716/0.6866 Dice in universal-set (BTCV$\leftrightarrow$FLARE22), significantly outperforming state-of-the-art baselines. Ablation studies confirm the complementary contributions of LEAD and zero-shot modules, and qualitative analysis highlights improved boundary precision and robustness under both domain and label shifts.


88
Correcting and Quantifying Systematic Errors in 3D Box Annotations for Autonomous Driving

Alexandre Justo Miro ⋅ Ludvig af Klinteberg ⋅ Bogdan Timus ⋅ Aron Asefaw ⋅ Ajinkya Khoche ⋅ Thomas Gustafsson ⋅ Sina Mansouri ⋅ Masoud DANESHTALAB

Accurate annotations are critical to providing ground truth to supervised learning and to evaluating the performance of autonomous vehicle systems. These vehicles are typically equipped with active sensors, such as LiDAR, which scan the environment in predefined patterns. 3D box annotation based on data from such sensors is challenging in dynamic scenarios, where objects are observed at different timestamps, hence different positions. Without proper handling of this phenomenon, systematic errors are prone to being introduced in the annotations.Our work is the first one to describe why such annotation errors occur and illustrates them using examples from widely used, publicly available datasets. Through our novel estimation method, we correct the annotations so that they follow physically feasible trajectories and achieve spatial and temporal consistency with the sensor data. For the first time, we define metrics for this problem; and we evaluate our method on the Argoverse 2 and MAN TruckScenes datasets, as well as in our proprietary dataset. Our approach demonstrates robust performance in increasing the quality of ground truth by more than 17% in these datasets. Finally, we quantify the annotation errors in them and find that the original annotations are misplaced by up to 2.5 m, with highly dynamic objects being the most affected. Our code is provided in the supplementary and will be published upon acceptance.

Recent transformer-based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep transformer architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges transformer layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based decoder that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline. Code and model weights will be released upon acceptance.

We propose a novel generative approach for 3D human pose estimation. 3D human pose estimation poses several key challenges due to the human body's complex geometry, self-occluding joints, and the requirement for large-scale real-world motion datasets. To address these challenges, we introduce Point2Pose, a novel generative framework that effectively models the distribution of human poses conditioned on sequential point cloud and pose data. Specifically, we employ a spatio-temporal point cloud encoder and a pose feature encoder to extract joint-wise features, followed by an attention-based generative regressor. Additionally, we present a large-scale indoor dataset MVPose3D, which contains multiple modalities, including IMU data of non-trivial human motions, dense multi-view point clouds, and RGB images. Experimental results show that the proposed method outperforms baseline models, demonstrating its superior performance across various datasets.

Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM’s uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling.

Effective image inversion in rectified flow models—mapping real images to editable latent representations—is crucial for practical image editing applications, however achieving optimal balance between reconstruction fidelity and editing flexibility remains a fundamental challenge. In this work, we introduce the Optimal Transport Inversion Pipeline (OTIP), a zero-shot framework that leverages optimal transport theory to guide the inversion process in rectified flow models. Our underlying hypothesis is that incorporating transport-based guidance during the reverse diffusion process can effectively balance reconstruction accuracy and editing controllability through principled trajectory optimization. The method computes optimal transport paths between image and noise distributions while maintaining computational efficiency. Our approach achieves high-fidelity reconstruction with LPIPS scores of 0.001 and SSIM of 0.992 on face editing benchmarks, demonstrating superior preservation of fine-grained details compared to existing methods. We evaluate the framework across multiple editing tasks, observing 7.8\% to 12.9\% improvements in reconstruction loss over RF-Inversion on the LSUN-Bedroom and LSUN-Church datasets, respectively. For semantic face editing, our method achieves an 11.2\% improvement in identity preservation and a 1.6\% enhancement in perceptual quality, while maintaining computational efficiency comparable to baseline approaches. Qualitatively, our method produces visually compelling edits with superior semantic consistency and fine-grained detail preservation across diverse editing scenarios. Code is available at https://anonymous.4open.science/r/OT-Inversion.


93
Synthesizing Compositional Videos from Text Description

Prajwal Singh ⋅ Kuldeep Kulkarni ⋅ Shanmuganathan Raman ⋅ Harsh Rangwani

Existing pre-trained text-to-video diffusion models can generate high-quality videos, but often struggle with misalignment between the generated content and the input text, particularly while composing scenes with multiple objects. To tackle this issue, we propose a straightforward, training-free approach for compositional video generation from text. We introduce Video-ASTAR, for test-time aggregation and segregation of attention with a novel centroid loss to enhance alignment, which enables the generation of multiple objects in the scene, modeling the actions and interactions. Additionally, we extend our approach to the Multi-Action video generation setting, where only the specified action should vary across a sequence of prompts. To ensure coherent action transitions, we introduce a novel token-swapping and latent interpolation strategy. Extensive experiments and ablation studies show that our method significantly outperforms baseline methods, generating videos with improved semantic and compositional consistency alongside improved temporal coherence


94
S2O: Static to Openable Enhancement for Articulated 3D Objects

Hanxiao Jiang ⋅ Hanxiao Jiang ⋅ Yiming Zhang ⋅ Manolis Savva ⋅ Angel Chang

Despite much progress in large 3D datasets there are currently few interactive 3D object datasets, and their scale is limited due to the manual effort required in their construction. We introduce the static to openable (S2O) task which creates interactive articulated 3D objects from static counterparts through openable part detection, motion prediction, and interior geometry completion. We formulate a unified data generation framework to tackle this task, and curate a challenging dataset of openable 3D objects that serves as a test bed for systematic evaluation. Our experiments benchmark methods from prior work, extended and improved methods, and simple yet effective heuristics for the S2O task. We find that turning static 3D objects into interactively openable counterparts is possible but that all methods struggle to generalize to realistic settings of the task on our dataset, and we highlight promising future work directions. Our work enables efficient creation of interactive 3D objects for robotic manipulation and embodied AI tasks.

Currently, prominent Transformer architectures applied on graphs and meshes for shape analysis tasks employ traditional attention layers that heavily utilize spectral features requiring costly eigenvalue decomposition-based methods. To encode the mesh structure, these methods derive positional embeddings, that heavily rely on eigenvalue decomposition based operations, e.g. on the Laplacian matrix, or on heat-kernel signatures, which are then concatenated to the input features.This paper proposes a novel approach inspired by the explicit construction of the Hodge Laplacian operator in Discrete Exterior Calculus as a product of discrete Hodge operators and exterior derivatives, i.e. $(L := \star_0^{-1} d_0^T \star_1 d_0)$. We adjust the Transformer architecture in a novel deep learning layer that utilizes the multi-head attention mechanism to approximate Hodge matrices $\star_0$, $\star_1$ and $\star_2$ and learn families of discrete operators $L$ that act on mesh vertices, edges and faces. Our approach results in a computationally-efficient architecture that achieves comparable performance in mesh segmentation and classification tasks, through a direct learning framework, while eliminating the need for costly eigenvalue decomposition operations or complex preprocessing operations.


96
Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data

Ivo Bueno ⋅ Ruikun Hou ⋅ Babette Bühler ⋅ Tim Fütterer ⋅ James Drimalla ⋅ Jonathan Foster ⋅ Peter Youngs ⋅ Peter Gerjets ⋅ Ulrich Trautwein ⋅ Enkelejda Kasneci

Observation of classroom interactions can provide concrete feedback to teachers, but current methods rely on manual annotation, which is resource-intensive and hard to scale. This work explores AI-driven analysis of classroom recordings, focusing on multimodal instructional activity and discourse recognition as a foundation for actionable feedback. Using a densely annotated dataset of 164 hours of video and 68 lesson transcripts, we design parallel, modality-specific pipelines. For video, we evaluate zero-shot multimodal LLMs, fine-tuned vision–language models, and self-supervised video transformers on 24 activity labels. For transcripts, we fine-tune a transformer-based classifier with contextualized inputs and compare it against prompting-based LLMs on 19 discourse labels. To handle class imbalance and multi-label complexity, we apply per-label thresholding, context windows, and imbalance-aware loss functions. Fine-tuned models consistently outperform prompting-based approaches, achieving macro-F1 scores of 0.577 for video and 0.46 for transcripts. These results demonstrate the feasibility of automated classroom analysis and establish a foundation for scalable teacher feedback systems.


97
From Prompt to Production: Automating Brand-Safe Marketing Imagery with Text-to-Image Models

Parmida Atighehchain ⋅ Henry Wang ⋅ Andrei Kapustin ⋅ Boris Lerner ⋅ Tiancheng Jiang ⋅ Taylor Jensen ⋅ Negin Sokhandan

Text-to-image models have made significant strides, producing impressive results in generating images from textual descriptions. However, creating a scalable pipeline for deploying these models in production remains a challenge. Achieving the right balance between automation and human feedback is critical to maintain both scale and quality. While automation can handle large volumes, human oversight is still an essential component to ensure that the generated images meet the desired standards and are align with the creative vision. This paper presents a new agentic workflow that offers a fully automated, scalable solution for generating advertisement images using text-to-image models. The proposed system maintains the quality and fidelity of images, while also introducing sufficient creative variation to adhere to marketing guidelines. By streamlining this process, we ensure a seamless blend of efficiency and human oversight.On average models integrated in our workflow achieve an average of 30.77\% increase in attaining the marketing object fidelity using DINOV2 and an average of 52.00\% increase human preference over the generated outcome.


98
Equivariant Sampling for Improving Diffusion Model-based Image Restoration

Chenxu Wu ⋅ Qingpeng Kong ⋅ Peiang Zhao ⋅ Wendi Yang ⋅ Wenxin ma ⋅ Fenghe Tang ⋅ Zihang Jiang ⋅ S Kevin Zhou

Recent advances in generative models, especially diffusion models, have significantly improved image restoration (IR) performance. However, existing problem-agnostic diffusion model-based image restoration (DMIR) methods face challenges in fully leveraging diffusion priors, resulting in suboptimal performance. In this paper, we address the limitations of current problem-agnostic DMIR methods by analyzing their sampling process and providing effective solutions. We introduce EquS, a DMIR method that imposes equivariant information through dual sampling trajectories. To further boost EquS, we propose the Timestep-Aware Schedule (TAS) and introduce EquS$^+$. TAS prioritizes deterministic steps to enhance certainty and sampling efficiency. Extensive experiments on benchmarks demonstrate that our method is compatible with previous problem-agnostic DMIR methods and significantly boosts their performance without increasing computational costs. Our code is available in the Supplementary.

Human pose estimation models are typically retrained from scratch to handle new keypoint definitions, sensing modalities, or deployment domains—a process that is inefficient, compute-intensive, and misaligned with real-world constraints. We present \textbf{ContinualPose}, the first open-source framework and benchmark suite designed for \emph{sustainable pose model adaptation} via continual learning (CL). At its core is \textbf{PoseAdapt}, a suite of domain- and class-incremental benchmarks that simulate realistic adaptation scenarios involving density, lighting, and modality shifts. The framework supports two primary workflows: (i) \textbf{Strategy Benchmarking}, enabling researchers to implement CL methods as plugins and evaluate them under standardized protocols, and (ii) \textbf{Model Adaptation}, allowing practitioners to adapt strong pretrained models to new tasks with minimal supervision. All benchmarks enforce a fixed lightweight backbone, no access to old data, and constrained per-step budgets, isolating the effect of the adaptation strategy. Through extensive experiments, we evaluate popular regularization-based methods under both single-step and sequential adaptation settings, highlighting the challenges of sustaining performance under tight constraints. By bridging modern CL research with the demands of pose estimation, ContinualPose lays the groundwork for adaptable models that evolve over time without repeated full retraining.


100
SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance

Suzanne Stathatos ⋅ Michael Hobley ⋅ Pietro Perona ⋅ Markus Marks

Low signal-to-noise ratio (SNR) videos—such as those from underwater sonar, ultrasound, and microscopy—pose a significant challenge for computer vision models, especially in the absence of paired clean imagery for denoising. We present Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD), a novel self-supervised method that denoises low-SNR sensor videos and is trained using only the raw noisy data. By leveraging distinctions between foreground and background motion and exaggerating objects with stronger motion signal, SAVeD enhances foreground object visibility and reduces background and camera noise while not requiring any clean video. SAVeD also has a set of architectural optimizations that lead to much faster throughput, training, and inference than existing deep learning methods. We also introduce a new denoising metric, FBD, which indicates foreground-background divergence without requiring clean imagery. Our approach achieves state-of-the-art results for classification, detection, tracking, and counting tasks and it does so with fewer training resource requirements than existing deep-learning-based denoising methods.


101
ATM: Enhanced Alignment for Text-to-Motion Generation

Ke Han ⋅ Yueming Lyu ⋅ Weichen Yu ⋅ Nicu Sebe

Existing text-to-motion (T2M) generation methods primarily rely on regression-based objectives, such as minimizing positional errors. However, they lack effective semantic supervision and correction mechanisms, often leading to substantial misalignment between text and motion. To address this, we propose $\textbf{Aligned Text-to-Motion (ATM)}$, a semantics-aware generation framework that automatically identifies and corrects text-motion misalignment. ATM incorporates two key components: (1) $\textbf{Inter-motion alignment}$, which detects semantic contradictions across motions and applies adaptive corrections based on the degree of semantic discrepancy, flexibly handing diverse misalignments and ensuring global text-motion consistency; (2) $\textbf{Intra-motion alignment}$, which refines locally missing or inaccurate motion semantics in an unsupervised manner by inferring semantic proxies, effectively addressing the absence of localized textual annotations. ATM is model-agnostic and can be seamlessly integrated into various T2M methods as a plug-and-play module. Extensive experiments on HumanML3D and KIT demonstrate that ATM consistently improves both generation quality and text-motion alignment. Code will be released upon acceptance.


102
Memoire: Learning User Personas from Gallery Tags for Personalized Photo Curation

Praful Mathur ⋅ Mohsin Iftekhar ⋅ Aman Sharma ⋅ Sarvesh Tiwari ⋅ Meghali Deka ⋅ Sathish Cherukuri ⋅ Roopa Sheshadri ⋅ Rakesh Valusa

We introduce Memoire, a fully automatic, on-device system for personalized photo selection that learns a user’s persona directly from gallery tags—people & relations, locations, and events—and ranks images by personal impact rather than generic aesthetics. Memoire constructs a per user tag graph and trains PERSONA-GAT to produce tag importance scores summarizing user preferences across the gallery images. These scores are projected to pixels via PAT, a deterministic grounding module that fuses tags and their importance into personal attention maps. To obtain scalable supervision without collecting user labels, we synthesize virtual-user galleries (diverse identities, events, and locations) and use a vision–language model to annotate image pairs as High vs. Low personal impact conditioned on their personal attention maps. An impact predictor is then trained with a pairwise ranking loss and coupled with a diversity-aware selector to deliver non-redundant top-k image selection. To maintain user privacy, persona learning and inference run entirely on device. Synthetic data and VLM supervision are used only for training of impact predictor. On a real 100-user gallery study, Memoire outperforms strong aesthetics, memorability and multimodal baselines.


103
CSGaussian: Progressive Rate-Distortion Compression and Segmentation for 3D Gaussian Splatting

Yu-Jen Tseng ⋅ Chia-Hao Kao ⋅ Jing-Zhong Chen ⋅ Alessandro Gnutti ⋅ Shao-Yuan Lo ⋅ Yen-Yu Lin ⋅ Wen-Hsiao Peng

We present the first unified framework for rate-distortion-optimized compression and segmentation of 3D Gaussian Splatting (3DGS). While 3DGS has proven effective for both real-time rendering and semantic scene understanding, prior works have largely treated these tasks independently, leaving their joint consideration unexplored. Inspired by recent advances in rate-distortion-optimized 3DGS compression, this work integrates semantic learning into the compression pipeline to support decoder-side applications--such as scene editing and manipulation--that extend beyond traditional scene reconstruction and view synthesis. Our scheme features a lightweight implicit neural representation-based hyperprior, enabling efficient entropy coding of both color and semantic attributes while avoiding costly grid-based hyperprior as seen in many prior works. To facilitate compression and segmentation, we further develop compression-guided segmentation learning, consisting of quantization-aware training to enhance feature separability and a quality-aware weighting mechanism to suppress unreliable Gaussian primitives. Extensive experiments on the LERF and 3D-OVS datasets demonstrate that our approach significantly reduces transmission cost while preserving high rendering quality and strong segmentation performance.

Existing video highlight detection methods, although advanced, struggle to generalize well to all test videos. These methods typically employ a generic highlight detection model for each test video, which is suboptimal as it fails to account for the unique characteristics and variations of individual test videos. Such fixed models do not adapt to the diverse content, styles, or audio and visual qualities present in new, unseen test videos, leading to reduced highlight detection performance. In this paper, we propose Highlight-TTA, a test-time adaptation framework for video highlight detection that addresses this limitation by dynamically adapting the model during testing to better align with the specific characteristics of each test video, thereby improving generalization and highlight detection performance. Highlight-TTA is jointly optimized with an auxiliary task, cross-modality hallucinations, alongside the primary highlight detection task. We utilize a meta-auxiliary training scheme to enable effective adaptation through the auxiliary task while enhancing the primary task. During testing, we adapt the trained model using the auxiliary task on the test video to further enhance its highlight detection performance. Extensive experiments with three state-of-the-art highlight detection models and three benchmark datasets show that the introduction of Highlight-TTA to these models improves their performance, yielding superior results.


105
SD-CSFL: A Synthetic Data-Driven Conformity Scoring Framework for Robust Federated Learning

Ebtisaam Alharbi ⋅ Abdulrahman Kerim ⋅ Leandro Soriano Marcolino ⋅ Qiang Ni

Federated Learning (FL) enables collaborative model training without sharing raw data, but remains highly vulnerable to gradient manipulation and backdoor attacks, particularly under heterogeneous client distributions. Most existing defenses either target a narrow class of attacks, rely on client data, or fail to adapt in heterogeneous settings. We propose SD-CSFL (Synthetic Data-Driven Conformity Scoring for Federated Learning), a unified and privacy-preserving defense algorithm. SD-CSFL leverages a synthetic calibration dataset, independent of client data, to compute entropy-based nonconformity scores that capture irregularities in client updates. An adaptive percentile thresholding mechanism with stratified calibration dynamically distinguishes benign from malicious updates across training rounds. We establish a conformal prediction–based guarantee showing that percentile thresholds bound false positives under arbitrary score distributions. Experiments on CIFAR-10 and Birds-525 demonstrate up to 35% higher detection of gradient manipulation and an 80% reduction in backdoor success rates, outperforming recent defenses in heterogeneous environments.

Human Action Recognition (HAR) in real-world scenarios is significantly challenged by unseen domain shifts, such as variations in the camera viewpoint, illumination, lighting, or background. Although recent advancements in video domain generalization have shown promise in HAR by introducing models that are robust to these shifts, existing methods often fall short. They typically depend on a single modality or employ static frame-level fusion approaches, which inherently limit the capture of multi-scale temporal dependencies and the alignment of the asynchronous modalities frequently present in video data. To address these limitations, we propose a Multimodal Alignment and Distillation for Domain Generalization (MAD‑DG), a novel framework that explicitly models multi-scale temporal relationships through a segment-wise temporal binding window contrastive alignment mechanism by effectively aligning asynchronous modalities. Furthermore, we integrate online self-distillation to extract robust domain-invariant representations. Extensive experiments conducted on widely recognized benchmarks demonstrate that MAD-DG achieves state-of-the-art performance and exhibits better generalization capabilities across both single-source and multi-source domain generalization settings.


107
Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport

Syed Mahmood ⋅ Ali Ali ⋅ Umer Ahmed ⋅ Fawad Fateh ⋅ Zeeshan Zia ⋅ Quoc-Huy Tran

We study the problem of self-supervised procedure learning, which discovers key steps and establishes their order from a set of unlabeled procedural videos. Previous procedure learning methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised procedure learning framework, which utilizes a fused Gromov-Wasserstein optimal transport formulation with a structural prior for computing frame-to-frame mapping between videos. However, optimizing exclusively for the above temporal alignment term may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and hence every video is associated with only one key step. To address that limitation, we further integrate a contrastive regularization term, which maps different frames to different points in the embedding space, avoiding the collapse to trivial solutions. Finally, we conduct extensive experiments on large-scale egocentric (i.e., EgoProceL) and third-person (i.e., ProceL and CrossTask) benchmarks to demonstrate superior performance by our approach against previous methods, including OPEL which relies on a traditional Kantorovich optimal transport formulation with an optimality prior.

The reliability of computer vision systems in the construction industry is critically undermined by motion blur, a complex and non-uniform degradation that conventional deblurring models with static architectures fail to address effectively. To overcome this challenge, we introduce the Spatial-Adaptive Channel Network (SAC-Net), a dynamic architecture designed specifically for this demanding environment. SAC-Net features three synergistic innovations. First, a Spatial-Adaptive Channel Module (SACM) generates content-aware spatial filters, allowing the network to adaptively focus on the most salient features for restoration. Second, our Hierarchical Feature Transfer with Wavelet (HFTW) method ensures robust propagation of core structural information by refining features in the wavelet domain, effectively suppressing noise. Finally, a Selective Feature Integration (SFI) module intelligently merges multi-scale features, combining semantic context with fine-grained detail. Evaluated on a large-scale, domain-specific construction dataset, SAC-Net significantly outperforms state-of-the-art methods, setting a new benchmark in both quantitative metrics and visual quality.


109
SegMo: Segment-aligned Text to 3D Human Motion Generation

Bowen Dang ⋅ Lin Wu ⋅ Xiaohang Yang ⋅ Zheng Yuan ⋅ Zhixiang Chen

Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align textual descriptions with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as the atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text–motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text–Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.


110
From Darkness to Detail: Frequency-Aware SSMs for Low-Light Vision

Eashan Adhikarla ⋅ Kai Zhang ⋅ Gong Chen ⋅ John Nicholson ⋅ Brian Davison

Low-light image enhancement remains a persistent challenge in computer vision, where state-of-the-art models are often hampered by hardware constraints and computational inefficiency, particularly at high resolutions. While foundational architectures like transformers and diffusion models have advanced the field, their computational complexity limits their deployment on edge devices. We introduce ExpoMamba, a novel architecture that integrates a frequency-aware state-space model within a modified U-Net. ExpoMamba is uniquely designed to tackle mixed-exposure challenges by decoupling the modeling of amplitude (intensity) and phase (structure) in the frequency domain. This allows for targeted enhancement, making it highly effective for real-time applications, including downstream tasks like object detection and segmentation. Our experiments on six benchmark datasets show that ExpoMamba is up to 2-3x faster than competing models and achieves a PSNR improvement of 15-20%, establishing a new state-of-the-art in efficient, high-quality low-light enhancement. Source code will be released upon publication.

Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness, investigating diverse augmentation techniques. Our analysis shows that, for a more effective defense, augmented image-text pairs should be well-aligned, diverse, yet avoid distribution shift---conditions overlooked by prior research. Our experiments show that MAT can effectively be applied to different VL models and tasks to improve adversarial robustness, outperforming previous efforts. Our code is publicly available at \url{https://github.com//}.


112
ForestSplats: Deformable transient field for Gaussian Splatting in the Wild

Wongi Park ⋅ Myeongseok Nam ⋅ Siwon Kim ⋅ Sangwoo Jo ⋅ Soomok Lee

Recently, 3D Gaussian Splatting (3D-GS) has emerged, showing real-time rendering speeds and high-quality results in static scenes. Although 3D-GS shows effectiveness in static scenes, their performance significantly degrades in real-world environments due to transient objects, lighting variations, and diverse levels of occlusion. To tackle this, existing methods estimate occluders or transient elements by leveraging pre-trained models or integrating additional transient field pipelines. However, these methods still suffer from two defects: 1) Using semantic features from the Vision Foundation Model (VFM) limits generalization on unseen data due to reliance on prior knowledge. 2) The transient field requires significant memory to handle transient elements with per-view Gaussians and struggles to define clear boundaries for occluders, solely relying on photometric errors. To address these problems, we propose ForestSplats, a novel approach that leverages the deformable transient field and a superpixel-aware mask to efficiently represent transient elements in the 2D scene across unconstrained image collections and effectively decompose static scenes from transient distractors without VFM. We designed the transient field to be deformable, capturing per-view transient elements. Furthermore, we introduce a superpixel-aware mask that clearly defines the boundaries of occluders by considering photometric errors and superpixels. Additionally, we propose uncertainty-aware densification to avoid generating Gaussians within the boundaries of occluders during densification. Through extensive experiments across several benchmark datasets, we demonstrate that ForestSplats outperforms existing methods without VFM and shows significant memory efficiency in representing transient elements.


113
Graph Query Networks for Object Detection with Automotive Radar

Loveneet Saini ⋅ Hasan Tercan ⋅ Tobias Meisen

Object detection with 3D radar is essential for 360$^{\circ}$ automotive perception, but radar's long wavelengths produce sparse and irregular reflections that challenge traditional grid and sequence-based convolutional and transformer detectors. This paper introduces Graph Query Networks (GQN), an attention-based framework that models objects sensed by radar as graphs, to extract individualized relational and contextual features. GQN employs a novel concept of graph queries to dynamically attend over the bird's-eye view (BEV) space, constructing object-specific graphs processed by two novel modules: EdgeFocus for relational reasoning and DeepContext Pooling for contextual aggregation. On the NuScenes dataset, GQN improves relative mAP by up to +53\%, including a +8.2\% gain over the strongest prior radar method, while reducing peak graph construction overhead by 80\% with moderate FLOPs cost.


114
FlowCLAS: Enhancing Normalizing Flow-Based Anomaly Segmentation Via Contrastive Learning

Chang Won (John) Lee ⋅ Selina Leveugle ⋅ Paul Grouchy ⋅ Chris Langley ⋅ Svetlana Stolpner ⋅ Jonathan Kelly ⋅ Steven Waslander

Anomaly segmentation is a critical capability for safety-critical robotics applications that must be aware of unexpected events. Normalizing flows (NFs), a class of generative models, are a promising approach for this task due to their ability to model the inlier data distribution efficiently. However, their performance falters in dynamic scenes, where complex, multi-modal data distributions cause them to struggle with out-of-distribution samples, leaving a performance gap to leading discriminative methods.To address this limitation, we introduce FlowCLAS, a hybrid framework that enhances the traditional maximum likelihood objective of NFs with a discriminative, contrastive loss. Leveraging Outlier Exposure, this objective explicitly enforces a separation between normal and anomalous features in the latent space, retaining the probabilistic foundation of NFs while embedding the discriminative power they lack.The strength of this approach is demonstrated by FlowCLAS establishing new state-of-the-art (SOTA) performance across multiple challenging anomaly segmentation benchmarks for robotics, including Fishyscapes Lost &amp; Found, Road Anomaly, SegmentMeIfYouCan-ObstacleTrack, and ALLO. Our experiments also show that this contrastive approach is more effective than other outlier-based training strategies for NFs, successfully bridging the performance gap to leading discriminative methods.


115
ScoreNet: Netting Lightweight Quality Scores for Better Visual Assessment with Large Multi-Modality Models

Bahador Rashidi ⋅ Kiarash Aghakasiri ⋅ Shupei Zhang ⋅ Amirmohsen Sattarifard ⋅ Yue zhang ⋅ Chao Gao

The advancement of general large multi-modal models (LMMs) has transformed many computer vision tasks, shifting image quality assessment (IQA) from specialized algorithms to models built on pre-trained LMM backbones. This evolution raises the question of whether dedicated IQA metrics remain relevant or are becoming obsolete in the age of LMMs. In this paper, we address this challenge by introducing ScoreNet, a novel framework that fuses the strengths of traditional metrics to elevate the IQA capabilities of LMMs. ScoreNet employs a soft prompting mechanism, learning prompts from a curated set of lightweight IQA scores and image embeddings. This context-driven learning strategy enhances the adaptability of LMMs for IQA tasks with a small additional computation cost. We show that ScoreNet serves as a general-purpose extension applicable to modern LMM-based IQA models. We integrate ScoreNet into two high-performing methods—CLIP-IQA and Q-Align—and observe consistent improvements. Experimental results show that ScoreNet not only boosts both models but also surpasses other state-of-the-art IQA approaches. Source code for ScoreNet will be released.


116
VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics

Daniel Cher ⋅ Brian Wei ⋅ Srikumar Sastry ⋅ Nathan Jacobs

We introduce VectorSynth, a diffusion-based framework for pixel-accurate satellite image synthesis conditioned on polygonal geographic annotations with semantic attributes. Unlike prior text- or layout-conditioned models, VectorSynth learns dense cross-modal correspondences that align imagery and semantic vector geometry, enabling fine-grained, spatially grounded edits. A vision language alignment module produces pixel-level embeddings from polygon semantics; these embeddings guide a conditional image generation framework to respect both spatial extents and semantic cues. VectorSynth supports interactive workflows that mix language prompts with geometry-aware conditioning, allowing rapid what-if simulations, spatial edits, and map-informed content generation. For training and evaluation, we assemble a collection of satellite scenes paired with pixel-registered polygon annotations spanning diverse urban scenes with both built and natural features. We observe strong improvements over prior methods in semantic fidelity and structural realism, and show that our trained vision language model demonstrates fine-grained spatial grounding. Code and data will be released.

In recent years, artificial intelligence (AI) has significantly impacted digital forensics, yet its broader deployment remains limited due to the difficulty of explaining AI decisions. Explainable AI (XAI) presents a potential solution to increasing transparency and trust, but its application in digital forensics is still underexplored. In this work, we present a practical and structured explainable digital forensics AI (xDFAI) approach tailored to the forensic task of video source camera identification (VSCI). Our method enables forensic examiners to interpret the behavior of AI models, assess whether decisions are driven by intended logic or arise from random or content-dependent artifacts, and establish the integrity and reliability of explanations. We implement and evaluate this approach on two state-of-the-art VSCI models, providing step-by-step analyses of explanation quality, spatial consistency of high-impact features, and content dependence. Our results reveal that while models achieve strong classification accuracy, their explanations lack spatial stability and are impacted by video content, raising concerns about forensic reliability. To support reproducibility and future research, we provide an open-source implementation. This work underscores the potential of XAI to improve transparency in digital forensics and highlights the challenges of interpretation and presentation of results. Our study takes an important step toward the operational deployment of xDFAI in multimedia forensics.


118
Modeling and Learning Multiple Hypotheses for Monocular 3D Object Detection

Hyeonjeong Park ⋅ Peixi Xiong ⋅ Pei Yu ⋅ Wei Tang

Detecting objects in 3D space using a monocular image is inherently a highly ill-posed problem: multiple plausible 3D bounding boxes can explain the same 2D observation of an object. Existing approaches typically follow a single-point prediction paradigm, failing to capture this multimodal nature and often regressing to an implausible mean solution. This paper introduces MonoMH, a novel multi-hypothesis framework for monocular 3D object detection. By explicitly modeling and learning the multimodal distribution of plausible 3D object configurations, MonoMH not only significantly improves detection performance but also provides richer information to support downstream decision-making. MonoMH introduces three key innovations: (1) a novel multi-hypothesis predictor that leverages spatially diverse features across different windows within an RoI to generate a rich variety of hypotheses without increasing model complexity; (2) a new multi-hypothesis learning approach that derives diverse and relevant hypotheses from single-modal ground truth by integrating uncertainty modeling with best-of-many learning; and (3) a novel adaptive hypothesis filtering mechanism that enhances detection capability by dynamically retaining a variable number of plausible hypotheses based on each object's uncertainty. Experimental results demonstrate the effectiveness of our approach. Notably, MonoMH achieves 29.12 (easy), 20.88 (mod.), and 17.93 (hard) Car AP$_{3D}$ on the KITTI test set, a significant boost over the previous state-of-the-art methods. We will make the code publicly available.

Effective data augmentation for domain-specific image classification must balance three competing objectives: diversity, faithfulness, and label clarity. However, current methods, including state-of-the-art diffusion models, struggle to achieve this balance and are further limited by issues such as stochastic outputs under strong transformations. We propose SGD-Mix, a novel framework that systematically reconciles these objectives. Our approach employs saliency-guided mixing to preserve foreground semantics while introducing diverse backgrounds, followed by a domain-specific fine-tuned diffusion model that refines the output to ensure high fidelity and strict label consistency. Extensive experiments across fine-grained, long-tail, few-shot, and background robustness tasks demonstrate that SGD-Mix achieves state-of-the-art performance, surpassing existing diffusion-based and non-generative methods by notable margins.

Recent advances in customizing Text-to-Image models allow users to generate personalized images with just a few samples. As demand for multi-concept generation grows, methods using weight fusion and test-time optimization have emerged, integrating multiple concepts within a single image. However, these approaches inject concept knowledge into the parametric space, leading to high overhead in multi-concept generation. We introduce DreamCatcher, an efficient framework based on representation finetuning. Our key innovation embeds conceptual information into the feature space, achieving up to 5× faster multi-concept generation while reducing learnable storage per concept by 88\%, all without quality loss. Besides, our method is highly versatile, enabling personalized inpainting without training.


121
Enhancing Monocular 3D Hand Reconstruction with Learned Texture Priors

Giorgos Karvounas ⋅ Nikolaos Kyriazis ⋅ Iason Oikonomidis ⋅ Georgios Pavlakos ⋅ Antonis Argyros

We revisit the role of texture in monocular 3D hand reconstruction, not as an afterthought for photorealism, but as a dense, spatially grounded cue that can actively support pose and shape estimation. Our observation is simple: even in high-performing models, the overlay between predicted hand geometry and image appearance is often imperfect, suggesting that texture alignment may be an underused supervisory signal. We propose a lightweight texture module that embeds per-pixel observations into UV texture space and enables a novel dense alignment loss between predicted and observed hand appearances. Our approach assumes access to a differentiable rendering pipeline and a model that maps images to 3D hand meshes with known topology, allowing us to back-project a textured hand onto the image and perform pixel-based alignment. The module is self-contained and easily pluggable into existing reconstruction pipelines. To isolate and highlight the value of texture-guided supervision, we augment HaMeR, a high-performing yet unadorned transformer architecture for 3D hand pose estimation. The resulting system improves both accuracy and realism, demonstrating the value of appearance-guided alignment in hand reconstruction.


122
PALMS+: Modular Image-Based Floor Plan Localization Leveraging Depth Foundation Model

Yunqian Cheng ⋅ Benjamin Princen ⋅ Roberto Manduchi

Indoor localization in GPS-denied environments is crucial for applications like emergency response and assistive navigation. Vision-based methods such as PALMS enable infrastructure-free localization using only a floor plan and a stationary scan, but are limited by the short range of smartphone LiDAR and ambiguity in indoor layouts. We propose PALMS+, a modular, image-based system that addresses these challenges by reconstructing scale-aligned 3D point clouds from posed RGB images using a monocular depth model (Depth Pro), followed by geometric layout matching via convolution with the floor plan. PALMS+ outputs a posterior over the location and orientation, usable for direct or sequential localization. Evaluated on the Structured3D and a custom campus dataset consisting of 80 observations across four large campus buildings, PALMS+ outperforms PALMS and F$^3$Loc in stationary localization accuracy—without requiring any training—highlighting its potential for scalable, infrastructure-free deployment. Notably, PALMS+ achieves performance that is better or at least comparable to baselines, even when using just a single image as input, while baselines rely on full panoramic views. Our code and data will be released upon acceptance.


123
IPCD: Intrinsic Point-Cloud Decomposition

Shogo Sato ⋅ Takuhiro Kaneko ⋅ Shoichiro Takeda ⋅ Tomoyasu Shimada ⋅ Kazuhiko Murasaki ⋅ Taiga Yoshida ⋅ Ryuichi Tanida ⋅ Akisato Kimura

Point clouds are widely used in various fields, including augmented reality (AR) and robotics, where relighting and texture editing are crucial for realistic visualization. Achieving these tasks requires accurately separating albedo from shade. However, performing this separation on point clouds presents two key challenges: (1) the non-grid structure of point clouds makes conventional image-based decomposition models ineffective, and (2) point-cloud models designed for other tasks do not explicitly consider global-light direction, resulting in inaccurate shade. In this paper, we introduce Intrinsic Point-Cloud Decomposition (IPCD), a new task that decomposes colored point clouds into albedo and shade. To overcome challenge (1), we propose IPCD-Net, a novel model that extends image-based decomposition with point-wise feature aggregation for non-grid data processing. For challenge (2), we introduce Projection-based Luminance Distribution (PLD), capturing global-light feature via multi-view projection. For comprehensive evaluation, we create a synthetic outdoor-scene dataset. Experimental results demonstrate that IPCD-Net reduces cast shadows in albedo and enhances color accuracy in shade. Furthermore, we showcase its applications in texture editing, relighting, and point-cloud registration under varying illumination. Finally, we verify the real-world applicability of IPCD-Net.

Multimodal representation learning is becoming increasingly important due to the growing availability of diverse multimodal data across various domains. Particularly, the ability to adapt to arbitrary numbers or types of modalities is useful for improving flexibility. We propose CLARGA, a general-purpose multimodal representation learning architecture that builds a learned attention-weighted graph over modality features and uses Graph Attention Networks to fuse them. CLARGA is trained end-to-end with combined supervised and contrastive loss, which aligns modalities while maintaining each modality's own strength. We demonstrate CLARGA's effectiveness in diverse multimodal representation learning tasks across 7 datasets spanning finance, human-computer interaction, general multimedia classification, and complex affective computing. It consistently outperforms baselines, ablations, and recent state-of-the-art approaches. Particularly, we demonstrate the highest known performance on the DAIC-WoZ dataset for multimodal depression identification. Our results show that CLARGA is an accurate and robust general-purpose fusion framework suitable for a wide range of complex multimodal learning tasks.

We present FNOpt, a self-supervised cloth simulation framework that formulates time integration as an optimization problem and trains a resolution-agnostic neural optimizer parameterized by a Fourier neural operator (FNO). Prior neural simulators often rely on extensive ground-truth data or sacrifice fine-scale detail, and generalize poorly across resolutions and motion patterns. In contrast, FNOpt learns to simulate physically plausible cloth dynamics and achieves stable and accurate rollouts across diverse mesh resolutions and motion patterns without retraining. Trained only on a coarse grid with physics-based losses, FNOpt generalizes to finer resolutions, capturing fine-scale wrinkles and preserving rollout stability. Extensive evaluations on a benchmark cloth simulation dataset demonstrate that FNOpt outperforms prior learning-based approaches in out-of-distribution settings in both accuracy and robustness. These results position FNO‑based meta‑optimization as a compelling alternative to previous neural simulators for cloth; thus reducing the need for curated data and improving cross‑resolution reliability. The full codebase will be made available upon acceptance.

Underwater images suffer from color distortion, haziness, and low-contrast due to light absorption and scattering. Despite deep learning advances in enhancement, challenges persist in efficiency, global context modeling, spatial-spectral consistency, and perceptually accurate detail recovery. To address these challenges, we design a novel underwater image enhancement framework, D2Mamba, adopting a dual-domain information (spatial and frequency) with state space models (SSMs), enabling efficient global context modeling while preserving local details. Unlike conventional SSMs that rely on raster, bidirectional, cross or diagonal scans, D2Mamba uses an A* search guided by physics-based Geodesic Information-Field Heuristic (GIFH) scan for feature traversal based on input degradation characteristics. GIFH combines feature gradients, high-frequency heterogeneity, and low-frequency semantic distance to compute adaptive costs, enabling the capture of both spatial and spectral dependencies. Further, a Spectral Wasserstein Attenuation Loss (SWAL) is introduced to enforce distributional alignment in the spectral domain, enabling perceptually consistent and physically consistent color restoration in enhanced underwater images. Extensive experiments on benchmark datasets demonstrate that D2Mamba achieves state-of-the-art performance with only 788K parameters and 7.06 GFLOPs.


127
Image-Guided Semantic Pseudo-LiDAR Point Generation for 3D Object Detection

MINSEUNG LEE ⋅ Seokha Moon ⋅ Seung Lee ⋅ Reza Mahjourian ⋅ Jinkyu Kim

In autonomous driving scenarios, accurate perception is becoming an even more critical task for safe navigation. While LiDAR provides precise spatial data, its inherent sparsity makes it difficult to detect small or distant objects. Existing methods try to address this by generating additional points within a Region of Interest (RoI), but relying on LiDAR alone often leads to false positives and a failure to recover meaningful structures. To address these limitations, we propose Image-Guided Semantic Pseudo-LiDAR Point Generation model, called ImagePG, a novel framework that leverages rich RGB image features to generate dense and semantically meaningful 3D points. Our framework includes an Image-Guided RoI Points Generation (IG-RPG) module, which creates pseudo-points guided by image features, and an Image-Aware Occupancy Prediction Network (I-OPN), which provides spatial priors to guide point placement. A multi-stage refinement (MR) module further enhances point quality and detection robustness. To the best of our knowledge, ImagePG is the first method to directly leverage image features for point generation. Extensive experiments on the KITTI and Waymo datasets demonstrate that ImagePG significantly improves the detection of small and distant objects like pedestrians and cyclists, reducing false positives by nearly 50\%. On the KITTI benchmark, our framework improves mAP by +1.38\%p (car), +7.91\%p (pedestrian), and +5.21\%p (cyclist) on the test set over the baseline, achieving state-of-the-art cyclist performance on the KITTI leaderboard.


128
HABIT: Human Action Benchmark for Interactive Traffic in CARLA

Mohan Ramesh ⋅ Mark Azer ⋅ Fabian Flohr

Current autonomous driving (AD) simulations are critically limited by their inadequate representation of realistic and diverse human behavior, which is essential for ensuring safety and reliability. Existing benchmarks often simplify pedestrian interactions, failing to capture complex, dynamic intentions and varied responses critical for robust system deployment. To overcome this, we introduce HABIT (Human Action Benchmark for Interactive Traffic), a high-fidelity simulation benchmark. HABIT integrates real-world human motion, sourced from mocap and videos, into CARLA via a modular, extensible, and physically consistent motion retargeting pipeline. From an initial pool of approximately 30,000 retargeted motions, we curate 4,730 traffic-compatible pedestrian motions, standardized in SMPL format for physically consistent trajectories. HABIT seamlessly integrates with CARLA's Leaderboard, enabling automated scenario generation and rigorous agent evaluation. Our safety metrics, including Abbreviated Injury Scale (AIS) and False Positive Braking Rate (FPBR), reveal critical failure modes in state-of-the-art AD agents missed by prior evaluations. Evaluating zero-shot performance on pose estimation, segmentation, and tracking underscores the visual realism inherent in our benchmark.While modern end-to-end planning methods like Interfuser achieve zero collisions per kilometer on the CARLA Leaderboard, they perform notably worse on HABIT, with $5.24$ collisions/km and a $10.96$\% AIS 3+ injury risk. In scenes with idle pedestrians, they brake unnecessarily in 63.9\% of cases. All components are publicly released to support reproducible, pedestrian-aware AI research.


129
EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models

Wenhui Zhu ⋅ Xiwen Chen ⋅ Zhipeng Wang ⋅ Shao Tang ⋅ Sayan Ghosh ⋅ XUANZHAO DONG ⋅ Rajat Koner ⋅ Yalin Wang

Instructed Visual Segmentation (IVS) tasks require segmenting objects in images or videos based on natural language instructions. While recent multimodal large language models (MLLMs) have achieved strong performance on IVS, their inference cost remains a major bottleneck, particularly in video. We empirically analyze visual token sampling in MLLMs and observe a strong correlation between subset token coverage and segmentation performance. This motivates our design of a simple and effective token pruning method that selects a compact yet spatially representative subset of tokens to accelerate inference. In this paper, we introduce a novel visual token pruning method for IVS, called EVTP-IV, which builds upon the $k$-center by integrating spatial information to ensure better coverage. We further provide an information-theoretic analysis to support our design. Experiments on standard IVS benchmarks show that our method achieves up to 5× speed-up on video tasks and 3.5× on image tasks, while maintaining comparable accuracy using only 20\% of the tokens. Our method also consistently outperforms state-of-the-art pruning baselines under varying pruning ratios.

With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the perceptual integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., visual lip movements) based on the other (e.g., audio waveform). This cross-modal reconstruction becomes significantly more challenging, leading to amplified discrepancies, in manipulated regions, thereby providing robust discriminative cues for precise forgery localization. AuViRe outperforms the State-of-the-Art by +8.9 AP\@0.95 on LAV-DF, +9.6 AP\@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code will be publicly available upon acceptance.


131
Exploring the Boundaries of Diffusion Models for Offline Writer Identification with Sparse and Intra-Variable Data

Aritra Dey ⋅ Chandranath Adak ⋅ Kumari Priya ⋅ Soumi Chattopadhyay ⋅ Sukalpa Chanda

Offline writer identification poses significant challenges when training data is scarce and handwriting styles exhibit high intra-writer variability. This scenario is common in practical applications such as forensic analysis and historical document authentication, where only a limited number of handwritten samples are available per writer. In this paper, we explore the viability of using diffusion models to capture writer-specific traits under such challenging conditions. Specifically, we investigate their performance in both text-dependent and text-independent setups, where lexical similarity varies across samples. We propose a novel diffusion-based writer identification framework that integrates a style encoder and handcrafted textural features in a joint training pipeline. Our approach is evaluated on a newly curated dataset with high intra-writer variability as well as two benchmark datasets (IAM and CERUG-EN). Experimental results demonstrate that while diffusion models excel in text-dependent scenarios (Top-1 accuracy: 90.77\%), their generalization capability diminishes in text-independent settings due to entanglement of content and style features. This study highlights both the promise and the current limitations of generative diffusion models for fine-grained handwriting style modeling. We identify avenues for improving generalization through disentangled representations, domain adaptation, and hybrid discriminative-generative architectures. The proposed framework contributes to the growing efforts toward scalable, style-aware writer identification in real-world, unconstrained handwriting scenarios.

Detection Transformers (DETRs) have advanced object detection but are resource-intensive, limiting deployment in embedded settings like self-driving cars. While knowledge distillation (KD) effectively compresses CNN detectors, it’s underexplored for DETRs, and most KD methods fail to capture global context. Also, existing KD methods often blindly trust the teacher model, which can be misleading. To bridge the gaps, this paper proposes Consistent Location-and-Context-aware Knowledge Distillation (CLoCKDistill) for DETR detectors, which includes two components: (1) Feature distillation targets the context-rich transformer encoder output (memory) and enriches it with ground truth object cues, enabling the student to focus on relevant regions with balanced attention across object sizes.(2) Logit distillation uses ground truth to generate target-aware decoder queries, ensuring both teacher and student attend to consistent and accurate parts of encoder memory. Experiments on KITTI and COCO show that CLoCKDistill improves a wide range of DETRs (e.g., single-scale DAB-DETR, multi-scale deformable DETR, and denoising-based DINO) by 2.2\%–6.4\%.


133
MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

Pengyi Li ⋅ Irina Abdullaeva ⋅ Alexander Gambashidze ⋅ Andrei Kuznetsov ⋅ Ivan Oseledets

Modern Video Large Language Models (VLLMs) often rely on uniform frame sampling for video understanding, but this approach frequently fails to capture critical information due to frame redundancy and variations in video content. We propose MaxInfo, the first training-free method based on the maximum volume principle, which is available in Fast and Slow versions and a Chunk-based version that selects and retains the most representative frames from a video. By maximizing the geometric volume formed by selected embeddings, MaxInfo ensures that the chosen frames cover the most informative regions of the embedding space, effectively reducing redundancy while preserving diversity. This method enhances the quality of input representations and improves long video comprehension performance across benchmarks. For instance, MaxInfo achieves a 3.28% improvement on LongVideoBench and a 6.40% improvement on EgoSchema for LLaVA-Video-7B. Moreover, MaxInfo boosts LongVideoBench performance by 3.47% on LLaVA-Video-72B and 3.44% on MiniCPM4.5. The approach is simple to implement and works with existing VLLMs without the need for additional training and very lower latency, making it a practical and effective alternative to traditional uniform sampling methods.