Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 3

Mon 9 Mar 10:45 a.m. PDT — 12:30 p.m. PDT
Abstract:
Chat is not available.


1
Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

Saarthak Kapse ⋅ Robin Betz ⋅ Srinivasan Sivanandan

State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from $L$ sequential steps to $log(L)$ parallel steps with respect to the number of input tokens ($L$). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2$\times$ reduction in the number of parallel steps in SSM block. Our model offers up to 72.5% speedup in inference speed compared to baseline Vision Mamba models on high resolution (2048$\times$2048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection.


2
Extreme Amodal Face Detection

Changlin Song ⋅ Yunzhong Hou ⋅ Michael Barnes ⋅ Rahul Shome ⋅ Dylan Campbell

Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded.In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches.


3
ENCORE : A Neural Collapse Perspective on Out-of-Distribution Detection in Deep Neural Networks

A. Q. M. Sazzad Sayyed ⋅ Nathaniel Bastian ⋅ Francesco Restuccia

Out-of-Distribution (OOD) detection is of paramountimportance in guaranteeing safe and reliable deploymentof a Deep Neural Network (DNN) model in real-world set-tings. However, most OOD detection approaches still lackmotivation rooted in established properties of the DNNs.This disconnect between the proposed approach and theo-retical underpinning to measurable DNN properties makesthese approaches unreliable. To bridge this gap, we takea different perspective to using energy scoring for OODdetection. Specifically, we look at energy score throughthe lens of the properties of neural collapse and observethat simple feature scaling can improve the separation be-tween In-Distribution (ID) and OOD inputs. Based on thisobservation, we propose ENCORE , which scales featuresof each input adaptively and uses them to obtain modi-fied logits based on insights from theory of neural collapse.We show that ENCORE outperforms state-of-the-art ap-proaches across a variety of benchmarks; for example, by1.37% on CIFAR10 and by 1.07% on Imagenet benchmarks.

Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty remains limited.


5
Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination

Ziqiang Shi ⋅ Rujie Liu ⋅ Shanshan Yu ⋅ Satoshi Munakata ⋅ Koichi Shirahata

Rapid progress in large vision-language models(LVLMs) has achieved unprecedented performancein vision-language tasks. However, due to the strong prior of large language models (LLMs) and misaligned attention across modalities,LVLMs often generate outputs inconsistent withvisual content - termed hallucination. To addressthis, we propose \textbf{Scalpel}, a method that reduceshallucination by refining attention activationdistributions toward more credible regions. Scalpelpredicts trusted attention directions for each headin Transformer layers during inference and adjustsactivations accordingly. It employs a Gaussian mixturemodel to capture multi-peak distributions ofattention in trust and hallucination manifolds, anduses entropic optimal transport (equivalent to Schr{\"o}dinger bridge problem) to map Gaussian componentsprecisely.During mitigation, Scalpel dynamicallyadjusts intervention strength and direction basedon component membership and mapping relationshipsbetween hallucination and trust activations.Extensive experiments across multiple datasets andbenchmarks demonstrate that Scalpel effectivelymitigates hallucinations, outperforming previousmethods and achieving state-of-the-art performance.Moreover, Scalpel is model- and data-agnostic,requiring no additional computation, only a singledecoding step.


6
Unified Alignment Protocol: Making Sense of the Unlabeled Data in New Domains

Sabbir Ahmed ⋅ Mamshad Nayeem Rizve ⋅ Abdullah Al Arafat ⋅ Jacqueline Liu ⋅ Rahim Hossain ⋅ Mohaiminul Nahian ⋅ Adnan Siraj Rakin

Semi-Supervised Federated Learning (SSFL) is gaining popularity over conventional Federated Learning in many real-world applications. Due to the practical limitation of limited labeled data on the client side, SSFL considers that participating clients train with unlabeled data, and only the central server has the necessary resources to access limited labeled data, making it an ideal fit for real-world applications (e.g., healthcare). However, traditional SSFL assumes that the data distributions in the training phase and testing phase are the same. In practice, however, domain shifts frequently occur, making it essential for SSFL to incorporate generalization capabilities and enhance their practicality. The core challenge is improving model generalization to new, unseen domains while the client participate in SSFL. However, the decentralized setup of SSFL and unsupervised client training necessitates innovation to achieve improved generalization across domains. To achieve this, we propose a novel framework called the Unified Alignment Protocol (UAP), which consists of an alternating two-stage training process. The first stage involves training the server model to learn and align the features with a parametric distribution, which is subsequently communicated to clients without additional communication overhead. The second stage proposes a novel training algorithm that utilizes the server feature distribution to align client features accordingly. Our extensive experiments on standard domain generalization benchmark datasets across multiple model architectures reveal that proposed UAP successfully achieves SOTA generalization performance in SSFL setting.


7
Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning

Arani Roy ⋅ Marco P. E. Apolinario ⋅ Shristi Biswas Biswas ⋅ Kaushik Roy

Training deep neural networks (DNNs) with backpropagation (BP) achieves state-of-the-art accuracy but requires global error propagation and full parameterization, leading to substantial memory and computational overhead. Direct Feedback Alignment (DFA) enables local, parallelizable updates with lower memory requirements but is limited by unstructured feedback and poor scalability in deeper architectures, specially convolutional neural networks. To address these limitations, we propose a structured local learning framework that operates directly on low-rank manifolds defined by the Singular Value Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed form, with updates applied to the SVD components using a composite loss that integrates cross-entropy, subspace alignment, and orthogonality regularization. Feedback matrices are constructed to match the SVD structure, ensuring consistent alignment between forward and feedback pathways. Our method reduces the number of trainable parameters relative to the original DFA model, without relying on pruning or post hoc compression. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method achieves accuracy comparable to that of BP. Ablation studies confirm the importance of each loss term in the low-rank setting. These results establish local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training.


8
Learning from Unknown for Open-Set Test-Time Adaptation

Taki Hasan Rafi ⋅ Amit Agarwal ⋅ Hitesh Patel ⋅ Dong-Kyu Chae

Deep learning models often struggle to maintain performance when the training and testing data come from different distributions. Test-time adaptation (TTA) addresses this by adapting a pre-trained model to an unlabeled target domain under distribution shifts. A more challenging setting is open-set TTA (OSTTA), where the target domain may contain unknown samples outside the source classes.Existing OSTTA methods primarily detect and discard such unknowns, relying only on known samples for adaptation. In this work, we argue that unknown samples can also provide valuable cues for improving adaptation. We propose LU-OSTTA (learning from unknown for OSTTA), a simple yet effective framework that leverages both in-distribution and semantically useful out-of-distribution samples. Our approach introduces: (i) a class-conditioned dynamic energy threshold to separate OOD samples more reliably, (ii) an optimal transport–based pseudo-label refinement to mitigate noise under distribution shifts, and (iii) an adaptive prototype weighting strategy that emphasizes semantically aligned target samples while down-weighting harmful ones. Extensive experiments on CIFAR-C and Tiny-ImageNet-Cbenchmarks demonstrate that LU-OSTTA consistently outperforms state-of-the-art TTA and OSTTA methods, highlighting the benefits of utilizing rather than discarding unknown samples.


9
StreetView-Waste: A Multi-Task Dataset for Urban Waste Management

Diogo J. Paulo ⋅ João Martins ⋅ Hugo Proenca ⋅ João Neves

Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers — particularly from images captured by garbage trucks — has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation.Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation.Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometric-aware strategy improves segmentation mAP by 27%, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.


10
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

Sunghyun Ahn ⋅ Youngwan Jo ⋅ Kijung Lee ⋅ Sein Kwon ⋅ Inpyo Hong ⋅ Sanghyun Park

Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. However, existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. Consequently, users should retrain models or develop separate AI models for new environments, which requires expertise in machine learning, high-performance hardware, and extensive data collection, limiting the practical usability of VAD. To address these challenges, this study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. C-VAD considers user-defined text as an abnormal event and detects frames containing a specified event in a video. We effectively implemented AnyAnomaly using a context-aware visual question answering without fine-tuning the large vision language model. To validate the effectiveness of the proposed model, we constructed C-VAD datasets and demonstrated the superiority of AnyAnomaly. Furthermore, our approach showed competitive performance on VAD benchmark datasets, achieving state-of-the-art results on the UBnormal dataset and outperforming other methods in generalization across all datasets.


11
SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering

Laura Bragagnolo ⋅ Leonardo Barcellona ⋅ Stefano Ghidoni

Accurate 3D human pose estimation is fundamental for applications such as autonomous driving, augmented reality, and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs.To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. SkelSplat models human pose as a skeleton of 3D Gaussians, one for each joint, leveraging Gaussian Splatting to obtain a volumetric pose representation. The Gaussians are then optimized by minimizing a differentiable rendering loss, allowing seamless integration of information from arbitrary numbers and configurations of cameras without requiring any training on 3D ground truth. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of individual human joints. SkelSplat outperforms prior approaches, achieving 20.3\,mm error on Human3.6M and 20.9\,mm on CMU Panoptic Studio, without relying on any 3D ground-truth supervision. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate its robustness to occlusions, without scenario-specific fine-tuning.Our code is available here: [removed for blind review].


12
Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation

Huaying Zhang ⋅ Atsushi Hashimoto ⋅ Tosho Hirasawa

Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question‑generation models is essential.Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers.Their evaluation typically focuses on the ability to answer questions, rather than the quality of generated questions.In contrast, we focus on the question quality in eliciting unseen knowledge from human experts.For a continuous improvement of VQG models, we propose a protocol that evaluates the ability by simulating question-answering communication with experts using a question-to-answer retrieval.We obtain the retriever by constructing a novel dataset, EgoExoAsk, which comprises 27,666 QA pairs generated from Ego-Exo4D's expert commentary annotation.The EgoExoAsk training set is used to obtain the retriever, and the benchmark is constructed on the validation set with Ego-Exo4D video segments.Experimental results demonstrate our metric reasonably aligns with question generation settings: models accessing richer context are evaluated better, supporting that our protocol works as intended. The EgoExoAsk dataset is included in the supplementary materials and will be publicly available upon publication.


13
Dragonite: Single-Step Drag-based Image Editing with Geometric-Semantic Guidance

Meng-Ting Jhong ⋅ Tai-Ming Huang ⋅ Shang-Fu Chen ⋅ Wen-Huang Cheng ⋅ Kailung Hua

Precision and efficiency are crucial in image editing, while existing methods face certain trade-offs. Drag-based image editing techniques enable precise pixel-level manipulation but often suffer from semantic ambiguity and require iterative optimization, which is time-consuming. Conversely, text-based editing methods provide global semantic guidance but lack spatial precision. To address this fundamental trade-off, we introduce Dragonite, a unified single-step image editing framework that seamlessly integrates geometric and semantic control. Our key innovation is a Dual Guidance Module that computes geometric guidance vectors through latent deformation mapping and projects semantic guidance from CLIP losses into the same vector space. An angle-aware fusion strategy then combines these guidance vectors, yielding a unified representation that preserves both semantic cues and geometric constraints. Meanwhile, we propose a Latent Optimization Module that performs single-step latent relocation followed by mean-adjusted interpolation, enhancing editing quality while minimizing distortions. Furthermore, we employ a Latent Stability Control mechanism to ensure image consistency throughout the diffusion inversion process. Comprehensive evaluations on the DragBench benchmark demonstrate that Dragonite successfully resolves the conventional trade-off between semantic accuracy and geometric precision, providing an intuitive, real-time solution for image editing.


14
Augmenting with NeRFs: Fast Relocalization on Densified Datasets

Michael Tomadakis ⋅ Rebecca Borissova ⋅ Yuxuan Zhang ⋅ Sanjeev Koppal

We reinterpret NeRFs as a resource for extreme data augmentation to improve the field of camera relocalization. Our approach enables us to automatically render a massive, densified dataset of novel views, even if given only sparse ground-truth viewpoints. Compared to other realtime relocalization methods, training a lightweight off-the-shelf vision backbone as a pose regressor on our expanded datasets significantly improves accuracy, uniquely enables relocalization of spatially-novel views, and performs well on portable-scale hardware.


15
GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving

William Ljungbergh ⋅ Adam Lilja ⋅ Adam Tonderski ⋅ Arvid Ling ⋅ Carl Lindström ⋅ Willem Verbeke ⋅ Junsheng Fu ⋅ Christoffer Petersson ⋅ Lars Hammarstrand ⋅ Michael Felsberg

Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving. Code will be released.


16
Towards Reliable Test-Time Adaptation: Style Invariance as a Correctness Likelihood

Gilhyun Nam ⋅ Taewon Kim ⋅ Joonhyun Jeong ⋅ Eunho Yang

Test-time adaptation (TTA) enables efficient adaptation of deployed models, yet it often leads to poorly calibrated predictive uncertainty—a critical issue in high-stakes domains such as autonomous driving, finance, and healthcare. Existing calibration methods typically assume fixed models or static distributions, resulting in degraded performance under real-world, dynamic test conditions. To address these challenges, we introduce Style Invariance as a Correctness Likelihood (SICL), a framework that leverages style-invariance for robust uncertainty estimation. SICL estimates instance-wise correctness likelihood by measuring prediction consistency across style-altered variants, requiring only the model’s forward pass. This makes it a plug-and-play, backpropagation-free calibration module compatible with any TTA method. Comprehensive evaluations across four baselines, five TTA methods, and two realistic scenarios with three model architecture demonstrate that SICL reduces calibration error by an average of 13%p compared to conventional calibration approaches.


17
TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model

Alireza Javanmardi ⋅ Pragati Jaiswal ⋅ Tewodros Habtegebrial ⋅ Christen Millerdurai ⋅ Shaoxiang Wang ⋅ Alain Pagani ⋅ Didier Stricker

Recent advancements in diffusion models have significantly improved the realism and generalizability of character-driven animation, enabling the synthesis of high-quality motion from just a single RGB image and a set of driving poses. Nevertheless, generating temporally coherent long-form content remains challenging. Existing approaches are constrained by computational and memory limitations, as they are typically trained on short video segments, thus performing effectively only over limited frame lengths and hindering their potential for extended coherent generation. To address these constraints, we propose TalkingPose, a novel diffusion-based framework specifically designed for producing long-form, temporally consistent human upper-body animations. TalkingPose leverages driving frames to precisely capture expressive facial and hand movements, transferring these seamlessly to a target actor through a stable diffusion backbone. To ensure continuous motion and enhance temporal coherence, we introduce a feedback-driven mechanism built upon image-based diffusion models. Notably, this mechanism does not incur additional computational costs or require secondary training stages, enabling the generation of animations with unlimited duration. Additionally, we introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation. Our code and dataset will be publicly available upon publication.


18
Large Sign Language Models: Toward 3D American Sign Language Translation

Sen Zhang ⋅ Sen Zhang ⋅ Di Liu ⋅ Zhaoyang Xia ⋅ Mingyu Zhao ⋅ Chaowei Tan ⋅ Vivian Li ⋅ Bo Liu ⋅ Dimitri Metaxas ⋅ Mubbasir Kapadia

We present Large Sign Language Models (LSLM), a novel framework for translating 3D American Sign Language (ASL) by leveraging Large Language Models (LLMs) as the backbone, which can benefit hearing-impaired individuals' virtual communication. Unlike existing sign language recognition methods that rely on 2D video, our approach directly utilizes 3D sign language data to capture rich spatial, gestural, and depth information in 3D scenes. This enables more accurate and resilient translation, enhancing digital communication accessibility for the hearing-impaired community. Beyond the task of ASL translation, our work explores the integration of complex, embodied multimodal languages into the processing capabilities of LLMs, moving beyond purely text-based inputs to broaden their understanding of human communication. We investigate both direct translation from 3D gesture features to text and an instruction-guided setting where translations can be modulated by external prompts, offering greater flexibility. This work provides a foundational step toward inclusive, multimodal intelligent systems capable of understanding diverse forms of language.

We present GC-KBVQA, a zero-shot framework for knowledge-based visual question answering (KB-VQA) that requires no additional training. GC-KBVQA leverages pre-trained models together with carefully designed, context-aware descriptive information. The framework integrates three modules—(i) question-guided visual grounding, (ii) semantics-based caption filtering, and (iii) inter-stage feedback—that work together to generate concise, relevant prompts while reducing hallucinations and noisy auxiliary text. Despite its lightweight design, GC-KBVQA outperforms strong zero-shot baselines by up to +10.97% on OK-VQA, A-OKVQA, and VQAv2, and approaches the performance of few-shot systems without labeled data. The framework is model-agnostic, maintaining effectiveness across LLMs from TinyLLaMA-1B to Llama3-8B with minimal degradation. Ablation studies confirm that grounding, dual-caption generation, and both intra-stage filtering and inter-stage feedback each contribute to accuracy improvements. By combining efficiency, robustness, and modularity, GC-KBVQA provides a practical and scalable direction for zero-shot KB-VQA.


20
Learning Action Hierarchies via Hybrid Geometric Diffusion

Arjun Kaushik Kaushik ⋅ Nalini Ratha ⋅ Venu Govindaraju

Temporal action segmentation is a critical task in video understanding, where the goal is to assign action labels to each frame in a video. While recent advances leverage iterative refinement-based strategies, they fail to explicitly utilize the hierarchical nature of human actions. In this work, we propose HybridTAS - a novel framework that incorporates a hybrid of Euclidean and hyperbolic geometries into the denoising process of diffusion models to exploit the hierarchical structure of actions. Hyperbolic geometry naturally provides tree-like relationships between embeddings, enabling us to guide the action label denoising process in a coarse-to-fine manner: higher diffusion timesteps are influenced by abstract, high-level action categories (root nodes), while lower timesteps are refined using fine-grained action classes (leaf nodes). Extensive experiments on three benchmark datasets—GTEA, 50Salads, and Breakfast—demonstrate that our method achieves state-of-the-art performance, validating the effectiveness of hyperbolic-guided denoising for the temporal action segmentation task.


21
Learning Group Actions In Disentangled Latent Image Representations

Farhana Hossain Swarnali ⋅ Miaomiao Zhang ⋅ TONMOY HOSSAIN

Modeling group actions on latent representations enables controllable transformations of high-dimensional image data. Prior works applying group-theoretic priors or modeling transformations typically operate in the high-dimensional data space, where group actions apply uniformly across the entire input, making it difficult to disentangle the subspace that varies under transformations. While latent-space methods offer greater flexibility, they still require manual partitioning of latent variables into equivariant and invariant subspaces, limiting the ability to robustly learn and operate group actions within the latent space. To address this, we introduce a novel end-to-end framework that for the first time learns group actions on latent image manifolds, automatically discovering transformation-relevant structures without manual intervention. Our method uses learnable binary masks with straight-through estimation to dynamically partition latent representations into transformation-sensitive and invariant components. We formulate this within a unified optimization framework that jointly learns latent disentanglement and group transformation mappings. The framework can be seamlessly integrated with any standard encoder-decoder architecture. We validate our approach on five 2D/3D image datasets, demonstrating its ability to automatically learn disentangled latent factors for group actions, while downstream classification tasks confirm the effectiveness of the learned representations.


22
GAITGen: Disentangled Motion-Pathology Impaired Gait Generative Model -- Bringing Motion Generation to the Clinical Domain

Vida Adeli ⋅ Soroush Mehraban ⋅ Majid Mirmehdi ⋅ Alan Whone ⋅ Benjamin Filtjens ⋅ Amirhossein Dadashzadeh ⋅ Alfonso Fasano ⋅ Andrea Iaboni ⋅ Babak Taati

Gait analysis is crucial for the diagnosis and monitoring of movement disorders like Parkinson's Disease. While computer vision models have shown potential for objectively evaluating parkinsonian gait, their effectiveness is limited by scarce clinical datasets and the challenge of collecting large and well-labelled data, impacting model accuracy and risk of bias.To address these gaps, we propose GAITGen, a novel framework that generates realistic gait sequences conditioned on specified pathology severity levels. GAITGen employs a Conditional Residual Vector Quantized Variational Autoencoder to learn disentangled representations of motion dynamics and pathology-specific factors, coupled with Mask and Residual Transformers for conditioned sequence generation. GAITGen generates realistic, diverse gait sequences across severity levels, enriching datasets and enabling large-scale model training in parkinsonian gait analysis. Experiments on our new PD-GaM (real) dataset demonstrate that GAITGen outperforms adapted state-of-the-art models in both reconstruction fidelity and generation quality, accurately capturing critical pathology-specific gait features. A clinical user study confirms the realism and clinical relevance of our generated sequences. Moreover, incorporating GAITGen-generated data into downstream tasks improves parkinsonian gait severity estimation, highlighting its potential for advancing clinical gait analysis. Code, models, PD-GaM dataset, and synthetic samples will be publicly available.


23
Gradient-Free Classifier Guidance for Diffusion Model Sampling

Rahul Shenoy ⋅ Zhihong Pan ⋅ Kaushik Balakrishnan ⋅ Qisen Cheng ⋅ Yongmoon Jeon ⋅ Heejune Yang ⋅ Jaewon Kim

Unguided sampling in diffusion models is known to generate images of wide variations, albeit with a trade-off in image fidelity. Guided sampling methods, such as classifier guidance (CG) and classifier-free guidance (CFG), focus on sampling in well-learned high-probability regions to generate images of high fidelity, but each has its limitations. CG is computationally expensive due to the costly classifier gradient descent process, while CFG, being gradient-free, is more efficient but compromises class label alignment compared to CG. In this work, we introduce Gradient-Free Classifier Guidance (GFCG), a novel method utilizing a pre-trained classifier solely in inference mode, entirely avoiding gradient calculations. GFCG introduces two innovative mechanisms: (1) adaptive reference class selection, dynamically determining an appropriate undesired reference class; and (2) adaptive guidance scaling, dynamically adjusting the guidance strength based on classifier confidence during sampling. Experiments on both class-conditioned and text-to-image generation demonstrate that the proposed GFCG method consistently improves class prediction accuracy while preserving diversity. We also show that GFCG is complementary to other guided sampling methods. When combined with Autoguidance (ATG), GFCG attains record performance on ImageNet 512 $\times$ 512, with a $FD_{DINOv2}$ of 23.39, while maintaining a high Precision of 94.0% compared to ATG's 89.9%.


24
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Kai-Po Chang ⋅ Wei-Yuan Cheng ⋅ Chi-Pin Huang ⋅ Fu-En Yang ⋅ Frank Wang

Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks. Our code will be available upon acceptance.


25
Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models

Junyi Zhu ⋅ Ruicong Yao ⋅ Taha Ceritli ⋅ Savas Ozkan ⋅ Matthew Blaschko ⋅ Eunchung Noh ⋅ Jeongwon Min ⋅ Cho Min ⋅ Mete Ozay

Current network training paradigms primarily focus on either centralized or decentralized data regimes. However, in practice, data availability often exhibits a hybrid nature, where both regimes coexist. This hybrid setting presents new opportunities for model training, as the two regimes offer complementary trade-offs: decentralized data is abundant but subject to heterogeneity and communication constraints, while centralized data—though limited in volume and potentially unrepresentative—enables better curation and high-throughput access. Despite its potential, effectively combining these paradigms remains challenging, and few frameworks are tailored to hybrid data regimes. To address this, we propose a novel framework that constructs a model atlas from decentralized models and leverages centralized data to refine a global model within this structured space. The refined model is then used to reinitialize the decentralized models. Our method synergizes federated learning (to exploit decentralized data) and model merging (to utilize centralized data), enabling effective training under hybrid data availability. Theoretically, we show that our approach achieves faster convergence than methods relying solely on decentralized data, due to variance reduction in the merging process. Extensive experiments demonstrate that our framework consistently outperforms purely centralized, purely decentralized, and existing hybrid-adaptable methods. Notably, our method remains robust even when the centralized and decentralized data domains differ or when decentralized data contains noise, significantly broadening its applicability.

In most multi-modal weakly supervised video anomaly detection (WSVAD) methods, cross-modal alignment relies on cosine similarity measure, which captures only directional consistency but neglects feature magnitude information. Moreover, the existing multi-instance learning frameworks usually adopt the Top-$k$ selection strategy, which is difficult to adapt to the anomalous proportions of different videos, resulting in missed detection of anomalous segments and introduction of label noise. To address these problems, an alignment and selection enhanced WSVAD (ASE-WSVAD) method is proposed. ASE-WSVAD combines cross-modal alignment method based on the fused similarity measure (CA-FSM) with dual-scale adaptive selection (DSAS) to improve semantic consistency and detection performance. Specifically, visual-textual alignment is implemented by a fusion of complementary similarity measures (i.e., cosine and Euclidean distance similarity measures) so that the alignment objective jointly leverages both direction and magnitude. DSAS combines local instance-level selection (LILS) with global batch-aware selection (GBAS) to efficiently detect anomalous segments and handle videos with varying proportions of anomalies. Experimental results demonstrate that ASE-WSVAD achieves the state-of-the-art performance with AUC value of 89.00\% and AP value of 85.44\% on UCF-Crime and XD-Violence, respectively.

Understanding dynamic 4D environments—3D space evolving over time—is critical for robotic and interactive systems. These applications demand systems that can process streaming point cloud video in real-time, often under resource constraints, while also benefiting from past and present observations when available. However, current 4D backbone networks rely heavily on spatiotemporal convolutions and Transformers, which are often computationally intensive and poorly suited to real-time applications. We propose PointNet4D, a lightweight 4D backbone optimized for both online and offline settings. At its core is a Hybrid Mamba-Transformer temporal fusion block, which integrates the efficient state-space modeling of Mamba and the bidirectional modeling power of Transformers. This enables PointNet4D to handle variable-length online sequences efficiently across different deployment scenarios. To enhance temporal understanding, we introduce 4DMAP, a frame-wise masked auto-regressive pretraining strategy that captures motion cues across frames. Our extensive evaluations across 9 tasks on 7 datasets, demonstrating consistent improvements across diverse domains. We further demonstrate PointNet4D's utility by building two robotic application systems: 4D Diffusion Policy (DP4) and 4D Imitation Learning (4DIL), achieving substantial gains on the RoboTwin and HandoverSim benchmarks.


28
Cross-Modal Event Encoder: Bridging Image–Text Knowledge to Event Streams

SungHeon Jeong ⋅ Hanning Chen ⋅ Sanggeon Yun ⋅ Suhyeon Cho ⋅ Wenjun Huang ⋅ Xiangjian Liu ⋅ Mohsen Imani

We introduce a robust event-centric encoder that expands the practical reach of event-based data across diverse tasks. To overcome the scarcity of large-scale event datasets, our approach adapts CLIP’s representation space to the event domain, preserving zero-shot learning and text alignment while mitigating catastrophic forgetting. By explicitly aligning event and image embeddings, the proposed encoder retains CLIP’s core strengths and delivers competitive performance on object recognition as well as zero-shot and few-shot benchmarks. Moreover, it generalizes effectively to event streams derived from video datasets without additional training. Finally, we show that the encoder integrates seamlessly into cross-modal architectures, enabling accurate event–image retrieval and unlocking new applications for the event modality.

Intrinsic image decomposition aims at separating an image into its underlying albedo and shading components, isolating the base color from lighting effects to enable downstream applications such as virtual relighting and scene editing.Despite the rise and success of learning-based approaches, intrinsic image decomposition from real-world images remains a significantly challenging task due to the scarcity of labeled ground-truth data.Most existing solutions rely on synthetic data as supervised setups, limiting their ability to generalize to real-world scenes. Self-supervised methods, on the other hand, often produce albedo-like maps that contain reflections and lack consistency under different lighting conditions.To address this, we propose SAIL, an approach designed to estimate illumination-invariant representations from single-view real-world images to specifically target plausible relighting. We repurpose the prior knowledge of a latent diffusion model for unconditioned scene relighting as a surrogate objective for learning light-invariant estimates. To achieve this, we introduce a novel intrinsic image decomposition fully formulated in the latent space.To guide the training of our latent diffusion model, we introduce regularization terms that constrain both the lighting-dependent and -independent components of our latent image decomposition.Through our experiments, we demonstrate that SAIL produces stable albedo-like representations under varying lighting conditions and generalizes to multiple scenes, using only unlabeled multi-illumination data available online.


30
PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

Dingbang Huang ⋅ Wenbo Li ⋅ Yifei Zhao ⋅ Xinyu Pan ⋅ Yanhong Zeng ⋅ Bo Dai

Transparent image layer generation plays a significant role in digital art and design workflows. Existing methods typically decompose transparent layers from a single RGB image using a set of tools or generate multiple transparent layers sequentially. Despite some promising results, these methods often limit their ability to model global layout, physically plausible interactions, and visual effects such as shadows and reflections with high alpha quality due to limited shared global context among layers. To address this issue, we propose PSDiffusion, a unified diffusion framework that leverages image composition priors from pre-trained image diffusion model for simultaneous multi-layer text-to-image generation. Specifically, our method introduces a global layer interaction mechanism to generate layered images collaboratively, ensuring both individual layer quality and coherent spatial and visual relationships across layers. We include extensive experiments on benchmark datasets to demonstrate that PSDiffusion is able to outperform existing methods in generating multi-layer images with plausible structure and enhanced visual fidelity.

Despite recent advances in Open-Vocabulary Semantic Segmentation (OVSS), existing training-free methods face several limitations: use of computationally expensive affinity refinement strategies, ineffective fusion of transformer attention maps due to equal weighting or reliance on fixed-size Gaussian kernels to reinforce local spatial smoothness, enforcing isotropic neighborhoods. We propose a strong baseline for training-free OVSS termed as NERVE (Neighbourhood & Entropy-guided Random-walk for open-Vocabulary sEgmentation), which uniquely integrates global and fine-grained local information, exploiting the neighbourhood structure from the self-attention layer of a stable diffusion model. We also introduce a stochastic random walk for refining the affinity rather than relying on fixed-size Gaussian kernels for local context. This spatial diffusion process encourages propagation across connected and semantically related areas, enabling it to effectively delineate objects with arbitrary shapes. Whereas most existing approaches treat self-attention maps from different transformer heads or layers equally, our method uses entropy-based uncertainty to select the most relevant maps. Notably, our method does not require any conventional post-processing techniques like Conditional Random Fields (CRF) or Pixel-Adaptive Mask Refinement (PAMR). Experiments are performed on 7 popular semantic segmentation benchmarks, yielding an overall state-of-the-art zero-shot segmentation performance, providing an effective approach to open-vocabulary semantic segmentation.


32
CLIP’s Visual Embedding Projector is a Few-shot Cornucopia

Mohammad Fahes ⋅ Tuan-Hung VU ⋅ Andrei Bursuc ⋅ Patrick Perez ⋅ Raoul de Charette

We introduce ProLIP, a simple and architecture-agnostic method for adapting contrastively pretrained vision-language models, such as CLIP, to few-shot classification. ProLIP fine-tunes the vision encoder’s projection matrix with Frobenius norm regularization on its deviation from the pretrained weights. It achieves state-of-the-art performance on 11 few-shot classification benchmarks under both ''few-shot validation'' and ''validation-free'' settings. Moreover, by rethinking the non-linear CLIP-Adapter through ProLIP’s lens, we design a regularized linear adapter that performs better, requires no hyperparameter tuning, and is less sensitive to learning rate values. Beyond few-shot classification, ProLIP excels in cross-dataset transfer, domain generalization, base-to-new class generalization, and test-time adaptation—where it outperforms prompt tuning while being an order of magnitude faster to train. Code will be made publicly available.

Establishing object correspondence between egocentric (ego) and exocentric (exo) views is a critical capability for robotic learning and human-robot interaction. The core task involves segmenting an object in one view given a query mask from the opposing view. This is notoriously difficult due to cluttered scenes with many task-irrelevant objects and drastic appearance changes across perspectives. To address this, we introduce RegionAligner, a unified text-visual framework that strategically focuses learning on task-relevant regions. Our method first uses a large vision-language model to identify and name salient objects, effectively filtering out visual distractors. These object phrases are then fused with visual features from both views. We introduce a novel region-guided supervision strategy that promotes focus, enforces spatial alignment, and minimizes appearance disparity between the ego-exo views. Furthermore, our framework seamlessly adapts to unsupervised settings by automatically generating pseudo-labels from matched mask proposals, drastically reducing annotation costs. Extensive experiments on the challenging Ego-Exo4D dataset show RegionAligner significantly outperforms existing baselines, improving IoU by 10.16\% (ego-to-exo) and 6.04\% (exo-to-ego).


34
Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences

Mellon Zhang ⋅ Glen Chou ⋅ Saibal Mukhopadhyay

Accurate and low-latency 3D object detection is essential for autonomous driving, where safety hinges on both rapid response and reliable perception. While rotating LiDAR sensors are widely adopted for their robustness and fidelity, current detectors face a trade-off: streaming methods process partial polar sectors on the fly for fast updates but suffer from limited visibility, cross-sector dependencies, and distortions from retrofitted Cartesian designs, whereas full-scan methods achieve higher accuracy but are bottlenecked by the inherent latency of a LiDAR revolution. We propose \textbf{Polar-Fast-Cartesian-Full (PFCF)}, a hybrid detector that combines fast polar processing for intra-sector feature extraction with accurate Cartesian reasoning for full-scene understanding. Central to PFCF is a custom Mamba SSM-based streaming backbone with dimensionally-decomposed convolutions that avoids distortion-heavy planes, enabling parameter-efficient, translation-invariant, and distortion-robust polar representation learning. Local sector features are extracted via this backbone, then accumulated into a sector feature buffer to enable efficient inter-sector communication through a full-scan backbone. PFCF establishes a new Pareto frontier on the Waymo Open dataset, surpassing prior streaming baselines by 10\% mAP and matching full-scan accuracy at twice the update rate.


35
Show Me: Unifying Instructional Image and Video Generation with Diffusion Models

Yujiang Pu ⋅ Zhanbo Huang ⋅ Vishnu Boddeti ⋅ Yu Kong

Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.


36
VOCAL: Visual Odometry via ContrAstive Learning

Chi-Yao Huang ⋅ Zeel Bhatt ⋅ “YZ” Yezhou Yang

Breakthroughs in visual odometry (VO) have fundamentally reshaped the landscape of robotics, enabling ultra-precise camera state estimation that is crucial for modern autonomous systems. Despite these advances, many learning-based VO techniques rely on rigid geometric assumptions, which often fall short in interpretability and lack a solid theoretical basis within fully data-driven frameworks. To overcome these limitations, we introduce \textbf{VOCAL} (Visual Odometry via ContrAstive Learning), a novel framework that reimagines VO as a label ranking challenge. By integrating Bayesian inference with a representation learning framework, VOCAL organizes visual features to mirror camera states. The ranking mechanism compels similar camera states to converge into consistent and spatially coherent representations within the latent space. This strategic alignment not only bolsters the interpretability of the learned features but also ensures compatibility with multimodal data sources. Extensive evaluations on the KITTI dataset highlight VOCAL’s enhanced interpretability and flexibility, pushing VO toward more general and explainable spatial intelligence.


37
Pose-Diverse Multi-View Virtual Try-on from a Single Frontal Image via Diffusion Transformer

Seonghee Han ⋅ Minchang Chung ⋅ Gyeongsu Cho ⋅ Kyungdon Joo ⋅ Taehwan Kim

This study addresses the challenge of creating a virtual try-on system where a user provides a single frontal image of themselves and a single frontal image of a garment. While most existing approaches focus on single-view synthesis, their reliance on a single, fixed viewpoint limits their application in immersive environments that require diverse poses and viewpoints. The ability to generate a multi-view virtual try-on result is crucial for a comprehensive user experience, as it allows the user to inspect the garment from all sides, including the back and sides, providing a similar experience to a real fitting room. In this paper, we propose a novel framework for pose-controllable, multi-view virtual try-on from a single image. Unlike conventional methods that require multiple images of the user or the garment from various angles, our model eliminates this burden by synthesizing multi-view results from a single input image pair. Our method not only generates realistic try-on images but also enables users to virtually inspect the fit and arrangement of the garment from multiple angles without the need for additional data. Our extensive experiments demonstrate that our framework showcases the outperforming image quality and pose diversity.


38
SimForce: Force and Surface Electromyography from Full Body Video with Graph Neural Nets

Esha Dasgupta ⋅ Boeun Kim ⋅ Sang-Hoon Yeo ⋅ Hyung Jin Chang

We propose a novel framework, named SimForce, for simultaneously estimating skeletal pose, ground reaction force and surface electromyography from an input video. Simforce predicts the more biomechanically accurate 3D human pose and shape of a given subject, along with their proposed muscle activations and resultant ground reaction force which leads to their input motion. Previous research has either focused on estimating these attributes singly and not treated them as a related task by taking into account the inherent shared motion between the three. In contrast, SimForce is designed to take advantage of the shared biological structure of the human body and its intrinsic connections to infer these attributes jointly using past, current, and future frames. SimForce features a newly introduced temporal and attention aware GCN-based architecture. To learn the subtle links between the body parts and how it affects the distribution of weight on the muscles over time, we introduce the Spatially Aware Attention Module.

LiDAR-based tracking-by-attention (TBA) frameworks inherently suffer from high false negative errors, leading to a significant performance gap compared to traditional LiDAR tracking-by-detection (TBD) methods. This paper introduces SCATR, a novel LiDAR-based TBA model designed to address this fundamental challenge systematically. SCATR leverages recent progress in vision-based tracking and incorporates targeted training strategies specifically adapted for LiDAR. Our work's core innovations are two architecture-agnostic training strategies for TBA methods: Second Chance Assignment and Track Query Dropout. Second Chance Assignment is a novel ground truth assignment that concatenates unassigned track queries to the proposal queries before bipartite matching, giving these track queries a second chance to be assigned to a ground truth object and effectively mitigating the conflict between detection and tracking tasks inherent in tracking-by-attention. Track Query Dropout is a training method that diversifies supervised object query configurations to efficiently train the decoder to handle different track query sets, enhancing robustness to missing or newborn tracks. Experiments on the nuScenes tracking benchmark demonstrate that SCATR achieves state-of-the-art performance among LiDAR TBA methods, outperforming previous works by 7.6\% AMOTA and successfully bridging the long-standing performance gap between LiDAR-based TBA and TBD methods. Ablation studies further validate the individual and combined effectiveness of Second Chance Assignment and Track Query Dropout, highlighting their combined impact on improving tracking performance. Anonymized code can be found at the following link: \href{https://anonymous.4open.science/r/scatr-anon-C54A/}{https://anonymous.4open.science/r/scatr-anon-C54A/}

This paper introduces AirLock+, an end-to-end vision system for scalable UAV-to-satellite image registration, enabling two key downstream tasks: (i) precise target geolocalization in geodetic coordinates and (ii) geospatial augmented reality to elevate situation awareness.AirLock+ comprises three modules: A predictive tracker first localizes targets in UAV image frames, while a cross-view image matcher generates robust UAV-to-satellite homographies that withstand severe domain gaps, outdated satellite imagery, and generalize to unseen environments without finetuning. The resulting pixel-to-world correspondences enable target pixel coordinates to be mapped into geodetic space, yielding continuous trajectory estimates and supporting geospatial augmentation of UAV video feeds. Our system achieves an average target localization error of 20.23 m across 7.8 km real-world trajectories, demonstrating robustness in high-altitude, oblique-view conditions where existing methods typically fail. AirLock+ addresses the operational demands of search-and-rescue missions and is actively deployed in the semifinal stage of a global wildfire response competition (competition name omitted for double-blind review).


41
Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement

Chia Lai ⋅ I-Hsuan Lo ⋅ Yen-Ku Yeh ⋅ Thanh-Nguyen Truong ⋅ Ching-Chun Huang

The creation of lifelike human avatars capable of realistic pose variation and viewpoint flexibility remains a fundamental challenge in computer vision and graphics. Current approaches typically yield either geometrically inconsistent multi-view images or sacrifice photorealism, resulting in blurry outputs under diverse viewing angles and complex motions. To address these issues, we propose Blur2Sharp, a novel framework integrating 3D-aware neural rendering and diffusion models to generate sharp, geometrically consistent novel-view images from only a single reference view. Our method employs a dual-conditioning architecture: initially, a Human NeRF model generates geometrically coherent multi-view renderings for target poses, explicitly encoding 3D structural guidance. Subsequently, a diffusion model conditioned on these renderings refines the generated images, preserving fine-grained details and structural fidelity. We further enhance visual quality through hierarchical feature fusion, incorporating texture, normal, and semantic priors extracted from parametric SMPL models to simultaneously improve global coherence and local detail accuracy. Extensive experiments demonstrate that Blur2Sharp consistently surpasses state-of-the-art techniques in both novel pose and view generation tasks, particularly excelling under challenging scenarios involving loose clothing and occlusions.


42
Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues

Tuan-Anh Vu ⋅ Nguyen Hai ⋅ Ziqiang Zheng ⋅ Binh-Son Hua ⋅ Qing Guo ⋅ Ivor Tsang ⋅ Sai-Kit Yeung

Glass is a prevalent material among solid objects in everyday life, but segmentation methods struggle to distinguish it from opaque materials due to its transparency and reflection. While it is known that human perception relies on boundary and reflective object features to tell glass objects, the existing literature has yet to sufficiently capture both properties in handling transparent objects. Hence, we propose to incorporate both of these powerful visual cues via Boundary Feature Enhancement and Reflection Feature Enhancement modules in a mutually beneficial way. Our proposed framework, $\textbf{TransCues}$, is a pyramidal transformer encoder-decoder architecture to segment transparent objects. We empirically show that these two modules can be used together effectively, improving overall performance on various benchmark datasets, including semantic segmentation of glass object datasets, mirror object datasets, and generic segmentation datasets of both. Our method outperforms the state-of-the-art by a large margin, achieving $\textbf{+4.2}$% mIoU on Trans10K-v2, $\textbf{+5.6}$% mIoU on MSD, $\textbf{+10.1}$% mIoU on RGBD-Mirror, $\textbf{+13.1}$% mIoU on TROSD, and $\textbf{+8.3}$% mIoU on Stanford2D3D, showing the effectiveness of our method against glass objects.

Personalized Federated Learning (PFL) aims to deliver effective client-specific models under heterogeneous data distributions while preserving privacy. Despite recent progress, existing approaches face two key limitations: (i) shallow, one-way prototype alignment that underutilizes hierarchical semantics and risks suppressing client-specific cues, and (ii) brittle server-side knowledge transfer that propagates teacher bias and destabilizes global updates.We propose HEART-PFL, a dual-sided framework that addresses these challenges through Hierarchical Directional Alignment (HDA) and Adversarial Knowledge Transfer (AKT). On the client side, HDA performs depth-aware alignment by enforcing cosine similarity in early layers for directional consistency and mean-squared matching in deeper layers for semantic precision, thereby leveraging hierarchical features without erasing personalization. On the server side, AKT strengthens ensemble knowledge transfer with bidirectional, symmetric-KL distillation on both clean and adversarial proxy samples, mitigating personalization bias and enhancing the stability of global model updates.Implemented with lightweight adapters requiring only 1.46M trainable parameters, HEART-PFL achieves state-of-the-art personalized accuracy on CIFAR-100, Flowers-102, and Caltech-101 (63.42\%, 84.23\%, and 95.67\%, respectively) under Dirichlet non-IID partitions, while remaining robust to out-of-domain proxy data. Ablations corroborate that HDA yields hierarchy-aware gains, AKT improves robustness and server-side stability, and their combination delivers the strongest—and most stable—personalization.


44
Detecting Social Engagement of Elderly From Lifelog Image-streams to Identify Effective Cues for Autobiographic Recall

Vengateswaran Subramaniam ⋅ Vigneshwaran Subbaraju ⋅ Debaditya Roy ⋅ Pramath Krishna ⋅ Thivya Kandappu ⋅ Qianli Xu

Lifelog images captured automatically by wearable cameras serve as effective cues that induce Autobiographic Memory Recall (AMR), during personalized memory interventions. However, manual selection of images for such therapy imposes significant load on the caregivers. To reduce this load, automated tools that identify moments involving significant engagement of the camera wearer in social interactions are needed. To achieve this, we re-annotate images extracted from public lifelog datasets for the presence of non-verbal social signals and the perceived engagement of the life-logger during interactions. We use this data to develop deep learning models and explore how social signals and the detected intensity of social engagement influences the predictions of AMR from lifelogs. We show that understanding \textit{visual social engagement} can enhance AMR prediction, demonstrating the potential of the models in reducing caregivers' effort.

Foundation models (FM) in digital pathology have revolutionized the field of whole slide image (WSI) analysis, with models such as UNI, Virchow, Prov-GigaPath, and many more outperforming the previously established benchmarks set by the ImageNet-based backbones. However, despite several benchmarking studies, there has been no clear consensus on the choice of a single FM that is best suited for a variety of histopathology datasets and/or tasks. With more than 25 pathology FMs in the literature so far, the challenge of model selection is a growing concern. Although an ensemble of FMs can circumvent this issue, given the bulky nature of individual FMs, the inference time and computational cost drastically increase with the addition of each FM into the ensemble. To this end, we propose $\textbf{HistoMILKD}$, the first multi-teacher knowledge distillation (MKD) framework for WSI classification. To handle the gigapixel resolution of WSIs, we use multiple instance learning (MIL), making this also the first work to integrate MIL and MKD frameworks into a single model. Our approach leverages the complementary representations of different FMs to distill collective task-specific knowledge into a single trainable MIL adapter on top of the student FM, which is utilized during inference. Benchmarked on three public datasets, the proposed approach significantly ($p<0.05$) outperforms the individual FMs, their ensemble, and previous MKD approaches in WSI classification.

Vectorized glyphs are widely used in poster design, network animation, art display, and various other fields due to their scalability and flexibility. In typography, they are often seen as special sequences composed of ordered strokes. This concept extends to the token sequence prediction abilities of large language models (LLMs), enabling vectorized character generation through stroke modeling. In this paper, we propose a novel Large Vectorized Glyph Model (LVGM) designed to generate vectorized Chinese glyphs by predicting the next stroke. Initially, we encode strokes into discrete latent variables called stroke embeddings. Subsequently, we train our LVGM via fine-tuning DeepSeek LLM by predicting the next stroke embedding. With limited strokes given, it can generate complete characters, semantically elegant words, and even unseen verses in vectorized form. Moreover, we release a new large-scale Chinese SVG dataset containing 907,267 samples based on strokes for dynamically vectorized glyph generation. Experimental results show that our model has scaling behaviors on data scales. Our generated vectorized glyphs have been validated by experts and relevant individuals.

This work investigates text-to-texture synthesis using diffusion models to generate physically-based texture maps.We aim to achieve realistic model appearances under varying lighting conditions.A prominent solution for the task is score distillation sampling.It allows recovering a complex texture using gradient guidance given a differentiable rasterization and shading pipeline.However, in practice, the aforementioned solution in conjunction with the widespread latent diffusion models produces severe visual artifacts and requires additional regularization such as implicit texture parameterization.As a more direct alternative, we propose an approach using cascaded diffusion models for texture synthesis (CasTex).In our setup, score distillation sampling yields high-quality textures out-of-the box.In particular, we were able to omit implicit texture parameterization in favor of an explicit parameterization to improve the procedure.In the experiments, we show that our approach significantly outperforms state-of-the-art optimization-based solutions on public texture synthesis benchmarks.


48
DATTA: Domain-Adversarial Test-Time Adaptation for Cross-Domain WiFi-Based Human Activity Recognition

Julian Strohmayer ⋅ Rafael Sterzinger ⋅ Matthias Wödlinger ⋅ Martin Kampel

WiFi-based human activity recognition (HAR) faces significant challenges in cross-domain generalization due to dynamic environmental variations, device heterogeneity, and subtle changes in human behavior. In this paper, we introduce DATTA – Domain-Adversarial Test-Time Adaptation – a novel framework that combines domain-adversarial training (DAT) with test-time adaptation (TTA) and a random weight-resetting mechanism. Unlike previous approaches that apply these techniques in isolation, DATTA is specifically tailored for WiFi-based HAR: it leverages DAT to learn robust, domain-invariant features while TTA continuously refines the model on streaming data. To mitigate catastrophic forgetting during adaptation, we incorporate a weight-resetting mechanism, ensuring sustained performance over prolonged domain shifts. Our extensive experiments on the Widar3.0-G6D dataset demonstrate that DATTA not only outperforms state-of-the-art methods by up to 8.1\% in F1-Score but also achieves real-time inference with a lightweight architecture, making it a compelling solution for practical WiFi sensing applications. The PyTorch implementation of DATTA is publicly available at: https://github.com/redactedForDoubleBlindReview.


49
MixER: From Cross-Modal to Mixed-Modal Visible-Infrared Re-Identification

Alehdaghi ⋅ Rajarshi Bhattacharya ⋅ Dai Yannick ⋅ Pourya Shamsolmoali ⋅ Rafael M. O. Cruz ⋅ Eric Granger

Visible-infrared person re-identification (VI-ReID) aims to match individuals across different camera modalities, a critical task in modern surveillance systems. While current VI-ReID methods focus on cross-modality matching, real-world applications often involve mixed galleries containing both V and I images, where state-of-the-art methods show significant performance limitations due to large domain shifts and low discrimination across mixed modalities. This is because gallery images from the same modality may have lower domain gaps but correspond to different identities. This paper introduces a novel mixed-modal ReID setting, where galleries contain data from both modalities. To address the domain shift among inter-modal and low discrimination capacity in intra-modal matching, we propose the Mixed Modality-Erased and -Related (MixER) method. The MixER learning approach disentangles modality-specific and modality-shared identity information through orthogonal decomposition, modality-confusion, and ID-modality-related objectives. MixER enhances feature robustness across modalities, improving cross-modal and mixed-modal settings performance. Our extensive experiments on the SYSU-MM01, RegDB and LLMC datasets indicate that our approach can provide state-of-the-art results using a single backbone, and showcase the flexibility of our approach in mixed gallery applications.


50
LighthouseGS: Indoor Structure-aware 3D Gaussian Splatting for Panorama-Style Mobile Captures

Seungoh Han ⋅ Jaehoon Jang ⋅ Hyunsu Kim ⋅ Jaeheung Surh ⋅ Junhyung Kwak ⋅ Hyowon Ha ⋅ Kyungdon Joo

We introduce LighthouseGS, a practical novel view synthesis framework based on 3D Gaussian Splatting that utilizes simple panorama-style captures from a single mobile device. While convenient, this rotation-dominant motion and narrow baseline make accurate camera pose and 3D point estimation challenging, especially in textureless indoor scenes. To address these challenges, LighthouseGS leverages rough geometric priors, such as mobile device camera poses and monocular depth estimation, and utilizes indoor planar structures. Specifically, we propose a new initialization method called plane scaffold assembly to generate consistent 3D points on these structures, followed by a stable pruning strategy to enhance geometry and optimization stability. Additionally, we present geometric and photometric corrections to resolve inconsistencies from motion drift and auto-exposure in mobile devices. Tested on real and synthetic indoor scenes, LighthouseGS delivers photorealistic rendering, outperforming state-of-the-art methods and enabling applications like panoramic view synthesis and object placement.

3D scene understanding is a critical yet challenging task in autonomous driving due to the irregularity and sparsity of LiDAR data, as well as the computational demands of processing large-scale point clouds. Recent methods leverage range-view representations to enhance efficiency, but they often adopt higher azimuth resolutions to mitigate information loss during spherical projection, where only the closest point is retained for each 2D grid. However, processing wide panoramic range-view images remains inefficient and may introduce additional distortions. Our empirical analysis shows that training with multiple range images, obtained from splitting the full point cloud, improves both segmentation accuracy and computational efficiency. However, this approach also poses new challenges of exacerbated class imbalance and increase in projection artifacts. To address these, we introduce FLARES, a novel training paradigm that incorporates two tailored data augmentation techniques and a specialized post-processing method designed for multi-range settings. Extensive experiments demonstrate that FLARES is highly generalizable across different architectures, yielding 2.1%–7.9% mIoU improvements on SemanticKITTI and 1.8%–3.9% mIoU on nuScenes, while delivering over 40% speed-up in inference. The code will be released based on the acceptance.


52
Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression

Roy Jennings ⋅ Genady Paikin ⋅ Roy Shaul ⋅ Evgeny Soloveichik

Multimodal Large Language Models (MLLMs) show promise for image-based regression tasks, but current approaches face key limitations. Recent methods fine-tune MLLMs using preset output vocabularies and generic task-level prompts (e.g., "How would you rate this image?"), assuming this mimics human rating behavior.Our analysis reveals these approaches provide no benefit over image-only training. Models using preset vocabularies and generic prompts perform equivalently to image-only models, failing to leverage semantic understanding from textual input.We propose Regression via Transformer-Based Classification (RvTC), which replaces vocabulary-constrained classification with a flexible bin-based approach.Unlike approaches that address discretization errors through complex distributional modeling, RvTC eliminates manual vocabulary crafting through straightforward bin increase, achieving state-of-the-art performance on four image assessment datasets using only images.More importantly, we demonstrate that data-specific prompts dramatically improve performance. Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding. On the AVA dataset, adding challenge titles to prompts improves correlations from 0.83 to 0.90, a new state-of-the-art. We demonstrate through empirical evidence from the AVA and AGIQA-3k datasets that MLLMs benefit from semantic prompt information surpassing mere statistical biases. This underscores the importance of incorporating meaningful textual context in multimodal regression tasks.


53
MANTA: Physics-Informed Generalized Underwater Object Tracking

Suhas Srinath ⋅ Hemang Jamadagni ⋅ Aditya Chandrasekar ⋅ Prathosh AP

Underwater object tracking is challenging due to wavelength-dependent attenuation and scattering, which severely distort appearance across depths and water conditions. Existing trackers, typically trained on terrestrial data, fail to generalize to these physics-driven degradations. We present MANTA, a physics-informed framework that involves both representation learning and tracking design for underwater scenarios. We propose a dual-positive contrastive learning strategy that couples temporal consistency with Beer–Lambert augmentations, yielding generalizable features robust to temporal and underwater distortions. We further introduce a multi-stage tracking pipeline where a motion-based primary tracker is augmented with a physics-informed secondary association algorithm that integrates geometric consistency and appearance similarity for efficient re-identification under occlusion, disappearance, and drift. To complement standard IoU metrics, we propose Center–Scale Consistency (CSC) and Geometric Alignment Score (GAS) to assess geometric fidelity. Experiments on four underwater benchmarks (WebUOT-1M, UOT32, UTB180, UWCOT220) show that MANTA achieves state-of-the-art performance, improving Success AUC by up to 6\%, while ensuring stable long-term generalized underwater tracking and efficient runtime.


54
KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird’s-Eye-View Segmentation

Wenke E ⋅ Yixin Sun ⋅ Jiaxu Liu ⋅ Hubert P. H. Shum ⋅ Amir Atapour-Abarghouei ⋅ Toby Breckon

We introduce KD360-VoxelBEV, the first architecture that applies cross-modality knowledge distillation to the Bird's-Eye-View (BEV) segmentation task. Our approach leverages a novel LiDAR image representation fused from range, intensity and ambient channels, together with a voxel-aligned view transformer that preserves spatial fidelity while enabling efficient BEV processing. During training, a high-capacity LiDAR and camera fusion teacher network extracts both rich spatial and semantic features for cross-modality knowledge distillation into a lightweight student network that relies solely on a single 360-degree panoramic camera image. Extensive experiments on the Dur360BEV dataset demonstrate that our teacher model significantly outperforms existing camera-based BEV segmentation methods, achieving a 25.6\% IoU improvement. Meanwhile, the distilled student network attains competitive performance with an 8.5\% IoU gain and state-of-the-art inference speed of 31.2 FPS. Moreover, evaluations on KITTI-360 (two fisheye cameras) confirm that our distillation framework generalises to diverse camera setups, underscoring its feasibility and robustness. This approach reduces sensor complexity and deployment costs while providing a practical solution for efficient, low-cost BEV segmentation in real-world autonomous driving.


55
MAFM³: Modular Adaptation of Foundation Models for Multi-Modal Medical AI

Mohammad Qazi ⋅ Munachiso Nwadike ⋅ Ibrahim Almakky ⋅ Mohammad Yaqub ⋅ Numan Saeed

Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Instead of building separate models, we propose MAFM³ (Modular Adaptation of Foundation Models for Multi-Modal Medical AI), a framework that enables a single foundation model to expand into diverse domains, tasks, and modalities through lightweight modular components. These components serve as specialized skill sets that allow the system to flexibly activate the appropriate capability at the inference time, depending on the input type or clinical objective. Unlike conventional adaptation methods that treat each new task or modality in isolation, MAFM³ provides a unified and expandable framework for efficient multitask and multimodality adaptation. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification into prognosis and segmentation modules. Our results show improved performance on both tasks. Furthermore, by incorporating PET scans, MAFM³ achieved an improvement in the Dice score 5\% compared to the respective baselines. These findings establish that foundation models, when equipped with modular components, are not inherently constrained to their initial training scope but can evolve into multitask, multimodality systems for medical imaging. The code implementation of this work will be made available upon acceptance.


56
SPOC: Spatially-Progressing Object State Change Segmentation in Video

Priyanka Mandikal ⋅ Tushar Nagarajan ⋅ Alex Stoken ⋅ Zihui Xue ⋅ Kristen Grauman

Object state changes in video reveal critical information about human and agent activity. However, existing methods are limited to temporal localization of when the object is in its initial state (e.g., cheese block) versus when it has completed a state change (e.g., grated cheese), which limits applicability for any task requiring detailed information about the progress of the action and its spatial localization. We propose to deepen the problem by introducing the spatially-progressing object state change segmentation task. The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed. We introduce the first model to address this task, designing a VLM-based pseudo-labeling approach, state-change dynamics constraints, and a novel WhereToChange benchmark built on in-the-wild Internet videos. Experiments on two datasets validate both the challenge of the new task as well as the promise of our model for localizing exactly where and how fast objects are changing in video. We further demonstrate useful implications for tracking activity progress to benefit robotic agents.


57
Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection

Xi Xiao ⋅ Zhuxuanzi Wang ⋅ Mingqiao Mo ⋅ Chen Liu ⋅ Chenrui Ma ⋅ Yanshu Li ⋅ Smita Krishnaswamy ⋅ Xiao Wang ⋅ Tianyang Wang

The deployment of automated pavement defect detection is often hindered by poor cross-domain generalization. Supervised detectors achieve strong in-domain accuracy but require costly re-annotation for new environments, while standard self-supervised methods capture generic features and remain vulnerable to domain shift. We propose \ours, a self-supervised framework that \emph{visually probes} target domains without labels. \ours introduces a Self-supervised Prompt Enhancement Module (SPEM), which derives defect-aware prompts from unlabeled target data to guide a frozen ViT backbone, and a Domain-Aware Prompt Alignment (DAPA) objective, which aligns prompt-conditioned source and target representations. Experiments on four challenging benchmarks show that \ours consistently outperforms strong supervised, self-supervised, and adaptation baselines, achieving robust zero-shot transfer, improved resilience to domain variations, and high data efficiency in few-shot adaptation. These results highlight self-supervised prompting as a practical direction for building scalable and adaptive visual inspection systems.


58
Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image

Sapir Esther Yiflach ⋅ Yuval Atzmon ⋅ Gal Chechik

Text-to-image diffusion models can generate stunning visuals, yet they often fail at tasks children find trivial—like placing a dog to the right of a teddy bear rather than to the left. When combinations get more unusual—a giraffe above an airplane—these failures become even more pronounced. Existing methods attempt to fix these spatial reasoning failures through model fine-tuning or test-time optimization with handcrafted losses that are suboptimal. Rather than imposing our assumptions about spatial encoding, we propose learning these objectives directly from the model's internal representations.We introduce Learn-to-Steer, a novel framework that learns data-driven objectives for test-time optimization rather than handcrafting them. Our key insight is to train a lightweight classifier that decodes spatial relationships from the diffusion model's cross-attention maps, then deploy this classifier as a learned loss function during inference. Training such classifiers poses a surprising challenge: they can take shortcuts by detecting linguistic traces rather than learning true spatial patterns. We solve this with a dual-inversion strategy that enforces geometric understanding. Our method dramatically improves spatial accuracy: from 20\% to 61\% on FLUX.1-dev and from 7\% to 54\% on SD2.1 across standard benchmarks.


59
Multi-Modal Soccer Scene Analysis with Masked Pre-Training

Marc Peral ⋅ Guillem Capellera ⋅ Luis Ferraz ⋅ Antonio Romano ⋅ Antonio Agudo

In this work we propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage, with a focus on three core tasks: ball trajectory inference, game state classification, and ball possessor identification. To this end, our solution integrates three distinct input modalities (player trajectories, player types and image crops of individual players) into a unified framework that processes spatial and temporal dynamics using a cascade of sociotemporal transformer blocks. Unlike prior methods, which rely heavily on accurate ball tracking or handcrafted heuristics, our approach infers the ball trajectory without direct access to its past or future positions, and robustly identifies the game state and ball possessor under noisy or occluded conditions from real top league matches. We also introduce CropDrop, a modality-specific masking pretraining strategy that prevents over-reliance on image features and encourages the model to rely on cross-modal patterns during pretraining. We demonstrate the effectiveness of our approach on a large-scale dataset showing substantial improvements over state-of-the-art baselines in all tasks. Our results highlight the benefits of combining structured and visual cues in a transformer-based architecture, and the importance of realistic masking strategies in multi-modal learning.


60
Visual Detector Compression via Location-Aware Discriminant Analysis

Qizhen Lan ⋅ Jung Choi Choi ⋅ Qing Tian

Deep neural networks are powerful, yet their high complexity greatly limits their potential to be deployed on billions of resource-constrained edge devices. Pruning is a crucial network compression technique, yet most existing methods focus on classification models, with limited attention to detection. Even among those addressing detection, there is a lack of utilization of essential localization information. Also, many pruning methods passively rely on pre-trained models, in which useful and useless components are intertwined, making it difficult to remove the latter without harming the former at the neuron/filter level. To address the above issues, in this paper, we propose a proactive detection-discriminants-based network compression approach for deep visual detectors, which alternates between two steps: (1) maximizing and compressing detection-related discriminants and aligning them with a subset of FPN neurons/filters across different scales, and (2) tracing the detection-related discriminating power across the layers and discarding features of lower importance. Object location information is exploited in both steps. Extensive experiments, employing four advanced detection models and four state-of-the-art competing methods on the KITTI and COCO datasets, highlight the superiority of our approach. Remarkably, our compressed models can even beat the original base models with a substantial reduction in complexity.


61
Latent Uncertainty-Aware Multi-View SDF Scan Completion

Faezeh Zakeri ⋅ Lukas Ruppert ⋅ Raphael Braun ⋅ Hendrik Lensch

Imperfect reconstructions arising from occlusions, shadows, reflections, and other factors during 3D scanningoften result in incomplete sections of the scanned object,with missing parts scattered randomly across its surface.We introduce an uncertainty-aware signed distance field (SDF) latent transformerthat leverages uncertainty to identify and reconstruct missing parts based on the global shape of the incomplete scanned objectand the immediate neighborhood of the affected regions.To our knowledge, we are the first to utilize uncertainties for SDF shape completion in the latent space.Our model has been trained on the entire Objaverse 1.0 datasetand demonstrates that our uncertainty-aware SDF completion method significantlyoutperforms previous works both numerically and visually.


62
Comp4D: Compositional 4D Scene Generation

Hanwen Liang ⋅ Dejia Xu ⋅ Neel Bhatt ⋅ Hezhen Hu ⋅ Hanxue Liang ⋅ Konstantinos Plataniotis

The advancements in diffusion models for 2D and 3D content creation have sparked a surge of interest in generating 4D content. However, the scarcity of 3D scene datasets constrains current methodologies to primarily object-centric generation. To overcome this limitation, we present Comp4D, a novel framework for text-to-compositional 4D scene generation. Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively employs a decompose-then-recompose strategy, constructing each 4D component within the scene separately.The framework first decomposes a textual input prompt into multiple object components and delineates their moving trajectories. After initializing the static 3D objects, we construct the compositional 4D scene by accurately positioning these objects along their designated paths. To refine the scene and motion, our method proposes a novel compositional score distillation technique involving trajectory-guided and object-centric sampling, utilizing pre-trained diffusion models across text-to-image, text-to-video, and text-to-3D domains for optimization. Extensive experiments demonstrate our superior 4D content creation capability compared to prior arts, showcasing superior visual quality, motion fidelity, and enhanced object interactions.


63
SENCA-st: Integrating Spatial Transcriptomics and Histopathology with Cross Attention Shared Encoder for Region Identification in Cancer Pathology

Shanaka Liyanaarachchi ⋅ Chathurya Wijethunga ⋅ Shihab Ahamed ⋅ Akthas Absar ⋅ Ranga Rodrigo

Spatial transcriptomics is an emerging field that enables the identification of functional regions based on the spatial distribution of gene expression. Integrating this functional information present in transcriptomic data with structural data from histopathology images is an active research area with applications in identifying tumor substructures associated with cancer drug resistance. Current histopathology-spatial-transcriptomic region segmentation methods suffer due to either making spatial transcriptomics prominent by using histopathology features just to assist processing spatial transcriptomics data or using vanilla contrastive learning that make histopathology images prominent due to only promoting common features losing functional information. In both extremes, the model gets either lost in the noise of spatial transcriptomics or overly smoothed, losing essential information. Thus, we propose our novel architecture SENCA-st (Shared Encoder with Neighborhood Cross Attention) that preserves the features of both modalities. More importantly, it emphasizes regions that are structurally similar in histopathology but functionally different on spatial transcriptomics using cross-attention. We demonstrate the superior performance of our model that surpasses state-of-the-art methods in detecting tumor heterogeneity and tumor micro-environment regions, a clinically crucial aspect.

We propose a framework for real-time virtual makeup transfer that achieves high-fidelity, identity-preserving results with strong temporal consistency. Existing methods often struggle to disentangle semi-transparent makeup from skin tones and other identity features, leading to identity shifts and fairness concerns. Furthermore, they also lack real-time capabilities and fail to maintain temporal consistency, limiting adoption in practical virtual try-on applications. To address these challenges, we decouple makeup transfer into two stages: transparent makeup mask extraction and graphics-based real-time makeup rendering. Once extracted, makeup masks can be applied in real time, enabling live video try-on. We generate pseudo-ground-truth data via a hybrid graphics-based rendering pipeline and an unsupervised clustering method, enabling robust training without real paired before-and-after makeup data. To further enhance transparency estimation and color fidelity, we propose transparency-aware reconstruction and lip color objectives. Our method consistently transfers fine-grained makeup details across diverse skin tones and expressions while maintaining temporal smoothness. Experiments demonstrate superior accuracy, stability, and efficiency over state-of-the-art baselines, making our approach practical for live virtual try-on applications. Video demonstrations are available in supplementary material.


65
Feature Inversion as a Lens on Vision Encoders

Eduard Allakhverdov ⋅ Dmitrii Tarasov ⋅ Elizaveta Goncharova ⋅ Andrei Kuznetsov

Vision encoders power modern vision-only and vision-language systems, yet the geometry of their internal features remains opaque. In this work, we introduce a simple, general approach for vision latent analysis: reconstruct images from frozen encoder features and treat reconstructability as a proxy for retained information and feature organization. Concretely, we train a lightweight reconstructor to invert feature tensors and use it to compare various vision encoders — CLIP-based ViT, SigLIP, SAM, and InternViT. We rank models by the informativeness of their features and observe consistent gains with image-centric objectives and higher spatial resolution. Beyond measurement, controlled manipulations in feature space produce predictable pixel-level edits: orthogonal rotations (rather than spatial transformations) implement channel permutations and drive systematic color changes; linear contractions implement channel suppression; and a learned linear map enables plausible colorization of grayscale inputs. VLM-based experiments confirm that feature-space color swaps translate into semantic color changes in reconstructions. Our approach is encoder-agnostic in principle (demonstrated on ViT-based models), requires only access to features, and offers a practical diagnostic of what encoders remember, how that information is organized, and how it can be manipulated.


66
MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training

Muhammad Osama Zeeshan ⋅ Natacha Gillet ⋅ Alessandro Lameiras Koerich ⋅ Marco Pedersoli ⋅ Francois Bremond ⋅ Eric Granger

Personalized expression recognition (ER) involves adapting a machine learning model to subject-specific data for improved recognition of expressions with considerable interpersonal variability. Subject-specific ER can benefit significantly from multi-source domain adaptation (MSDA) methods -- where each domain corresponds to a specific subject -- to improve model accuracy and robustness. Despite promising results, state-of-the-art MSDA approaches often overlook multimodal information or blend sources into a single domain, limiting subject diversity and failing to explicitly capture unique subject-specific characteristics. To address these limitations, we introduce MuSACo, a multimodal subject-specific selection and adaptation method for ER based on co-training. It leverages complementary information across multiple modalities and multiple source domains for subject-specific adaptation. This makes MuSACo particularly relevant for affective computing applications in digital health, such as patient-specific assessment for stress or pain, where subject-level nuances are crucial. MuSACo selects source subjects relevant to the target and generates pseudo-labels using the dominant modality for class-aware learning, in conjunction with a class-agnostic loss to learn from less confident target samples. Finally, source features from each modality are aligned, while only confident target features are combined. Our experimental results on challenging multimodal ER datasets -- BioVid and StressID -- show that MuSACo can outperform UDA (blending) and state-of-the-art MSDA methods.


67
SCALEX: Scalable Concept and Latent Exploration for Diffusion Models

Emily Zhixuan Zeng ⋅ Yuhao Chen ⋅ Alexander Wong

mage generation models frequently encode social biases, including stereotypes tied to gender, race, and profession. Existing methods for analyzing these biases in diffusion models either focus narrowly on predefined categories or depend on manual interpretation of latent directions. These constraints limit scalability and hinder the discovery of subtle or unanticipated patterns.We introduce SCALEX, a framework for scalable and automated exploration of diffusion model latent spaces. SCALEX extracts semantically meaningful directions from H-space using only natural language prompts, enabling zero-shot interpretation without retraining or labelling. This allows systematic comparison across arbitrary concepts and large-scale discovery of internal model associations.We show that SCALEX detects gender bias in profession prompts, ranks semantic alignment across identity descriptors, and reveals clustered conceptual structure without supervision. By linking prompts to latent directions directly, SCALEX makes bias analysis in diffusion models more scalable, interpretable, and extensible than prior approaches.


68
DCSHARP: 3D Gaussian Splatting with Direction Cosine Spherical Harmonics and Shape-Aware Pruning

Ahmed Hasssan ⋅ Jian Meng ⋅ Yuanbo Xiangli ⋅ Jae-sun Seo

3D Gaussian Splatting (3DGS) shows outstanding rendering quality for novel view synthesis.Despite its performance, the massive amount of Gaussian blobs leads to expensive run-time sorting and irregular memory access during rendering. Although 3DGS-based pruning algorithm has been widely explored, most of the current research has mainly focused on designing a proper pruning metric and the root cause behind the inevitable quality degradation remains underexplored for highly-sparse 3DGS. In particular, our investigation shows that the Spherical Harmonics (SH) of 3DGS is insufficient to capture high-frequency anisotropic reflections and specular highlights during rendering, especially with sparsified Gaussians. Motivated by that, this work proposes Direction Cosine Spherical Harmonics with Shape-Aware Pruning (DCSHARP). Specifically, the proposed Direction Cosine Spherical Harmonics (DCSH) replaces the vanilla spherical harmonics by facilitating the expressiveness of 3DGS on high-frequency and highly reflective scenes. Unlike recent works that rely on trainable masks or pseudo-rendering scores, the proposed Shape-aware Pruning method enables ``pruning on-the-fly'' while achieving high quality rendering. As a combined scheme, the proposed DCSHARP reduces the number of active Gaussians by up to 3.9$\times$ and improves rendering throughput by 1.9$\times$ with ZERO quality degradation compared to the vanilla 3DGS. Furthermore, the proposed DCSH scheme outperforms the vanilla 3DGS on all the mainstream benchmarks by simply replacing the vanilla SH with the DCSH. The source code of the proposed method will be open-sourced.

Few-shot semantic segmentation aims to build robust models that segment unseen objects using only a few labeled examples. Existing FSS approaches, which rely on semantic feature matching, often suffer from Background Bias, Pose-Scale Discrepancy Bias, and the inability to capture fine object details. These limitations hinder their ability to generalize to novel categories, especially in scenarios with high intra-class variability and fine-grained object structures. To overcome these challenges, we propose DOT-Graph, a novel framework that designs CLIP-driven feature Disentanglement and Optimal Transport-based Graph learning for robust few-shot segmentation. We evaluate DOTGraph on PASCAL-5𝑖 and COCO-20𝑖, achieving state-of-the-art performance with improvements in various few-shot settings. Our results demonstrate that DOTGraph effectively mitigates background bias, improves feature alignment, and enhances fine-grained segmentation. The code will be released soon.

Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost of processing dense visual tokens remains a major bottleneck. A critical component in this pipeline is the visual projector, which bridges the vision encoder and the language model. Standard designs often employ a simple multi-layer perceptron for direct token mapping, but this approach scales poorly with high-resolution inputs, introducing significant redundancy. We present Delta-LLaVA, a token-efficient projector that employs a low-rank DeltaProjection to align multi-level vision features into a compact subspace before further interaction. On top of this base alignment, lightweight Transformer blocks act as specialization layers, capturing both global and local structure under constrained token budgets. Extensive experiments and ablations demonstrate that this base-then-specialize design yields consistent gains across diverse benchmarks with only 144 tokens, highlighting the importance of token formation prior to scaling interaction capacity. With Delta-LLaVA, inference throughput improves by up to 55%, while end-to-end training accelerates by nearly 4-5x in pretraining and over 1.5x in finetuning, highlighting the dual benefits of our design in both efficiency and scalability.


71
Sketch-guided Cage-based 3D Gaussian Splatting Deformation

Tianhao Xie ⋅ Noam Aigerman ⋅ Eugene Belilovsky ⋅ Tiberiu Popa

3D Gaussian Splatting (GS) is one of the most promising novel 3D representations that has received great interest in computer graphics and computer vision. While various systems have introduced editing capabilities for 3D GS, such as those guided by text prompts, fine-grained control over deformation remains an open challenge. In this work, we present a novel sketch-guided 3D GS deformation system that allows users to intuitively modify the geometry of a 3D GS model by drawing a silhouette sketch from a single viewpoint. Our approach introduces a new deformation method that combines cage-based deformations with a variant of Neural Jacobian Fields, enabling precise, fine-grained control. Additionally, it leverages large-scale 2D diffusion priors and ControlNet to ensure the generated deformations are semantically plausible. Through a series of experiments, we demonstrate the effectiveness of our method and showcase its ability to animate static 3D GS models as one of its applications.


72
Autoregressive Styled Text Image Generation, but Make it Reliable

Carmine Zaccagnino ⋅ Fabio Quattrini ⋅ Vittorio Pippi ⋅ Silvia Cascianelli ⋅ Alessio Tonioni ⋅ Rita Cucchiara

Generating faithful and readable styled text images (especially for Styled Handwritten Text generation - HTG) is an open problem with several possible applications across graphic design, document understanding, and image editing.A lot of research effort in this task is dedicated to developing strategies that reproduce the stylistic characteristics of a given writer, with promising results in terms of style fidelity and generalization achieved by the recently proposed Autoregressive Transformer paradigm for HTG. However, this method requires additional inputs, lacks a proper stop mechanism, and might end up in repetition loops, generating visual artifacts.In this work, we rethink the autoregressive formulation by framing HTG as a multimodal prompt-conditioned generation task, and tackle the content controllability issues by introducing special textual input tokens for better alignment with the visual ones. Moreover, we devise a Classifier-Free-Guidance-based strategy for our autoregressive model. Through extensive experimental validation, we demonstrate that our approach, dubbed Eruku, compared to previous solutions requires fewer inputs, generalizes better to unseen styles, and follows more faithfully the textual prompt, improving content adherence.

We introduce Universal Neural Architecture Space (UniNAS), a generic search space for neural architecture search (NAS) which unifies convolutional networks, transformers, and their hybrid architectures under a single, flexible framework. Our approach enables discovery of novel architectures as well as analyzing existing architectures in a common framework. We also propose a new search algorithm that allows traversing the proposed search space, and demonstrate that the space contains interesting architectures, which, when using identical training setup, outperform state-of-the-art hand-crafted architectures. Finally, a unified toolkit including a standardized training and evaluation protocol is introduced to foster reproducibility and enable fair comparison in NAS research. Overall, this work opens a pathway towards systematically exploring the full spectrum of neural architectures with a unified graph-based NAS perspective.


74
Sea-CLIP: Mining Semantic-Aware Representations for Few-Shot Anomaly Detection with CLIP

Xiao Guo ⋅ Zhimin Chen ⋅ Carlos Castillo ⋅ Hongcheng Wang ⋅ Xiaoming Liu

Few-shot Anomaly Detection (FSAD) is a classic computer vision task, and recent FSAD methods utilize the pre-trained Vision-Language model, \textit{i.e.}, CLIP, to achieve remarkable performance. However, existing CLIP-based methods overlook object semantics, a crucial element that could guide comparisons between semantically consistent patches and enhance FSAD performance. To address this limitation, we propose a novel method, Sea-CLIP, that incorporates semantic-aware representations from DINOv2 to improve FSAD representation learning. Specifically, Sea-CLIP uses semantic-aware representations obtained from DINOv2 in a patch-matching module for segmenting anomalies. Secondly, a lightweight anomaly matching decoder is introduced to convert CLIP and DINOv2 features into the anomaly mask, formulating FSAD as a feature matching task. The Stable Diffusion is leveraged for data augmentation, enhancing the Sea-CLIP to capture diverse anomaly patterns. Our Sea-CLIP achieves state-of-the-art FSAD performance on MvTec and VisA AD datasets. The source code will be released upon acceptance.


75
CLIP-IT: CLIP-based Pairing of Histology Images with Privileged Textual Information

Banafsheh Karimian ⋅ Giulia Avanzato ⋅ Soufiane Belharbi ⋅ Alexis Guichemerre ⋅ Luke McCaffrey ⋅ Mohammadhadi Shateri ⋅ Eric Granger

Multimodal learning has shown promise in medical image analysis, combining complementary modalities like histology images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based inference, limiting their practicality due to annotation cost, privacy, and compute demands. Crucially, available free unpaired external text, like pathology reports, can still provide complementary diagnostic cues if semantically relevant content is retrievable per image. To address this, we introduce CLIP-IT, a novel framework that relies on rich unpaired text reports, eliminating paired data requirement. Specifically, CLIP-IT uses a CLIP model pre-trained on histology image–text pairs from a separate dataset to retrieve the most relevant unpaired textual report for each image in the target unimodal dataset. These reports, sourced from the same disease domain and tissue type, form pseudo-pairs that reflect shared clinical semantics rather than exact alignment. Knowledge from these texts is distilled into the vision model during training, while LoRA-based adaptation mitigates the semantic gap between unaligned modalities. At inference time, only the improved vision model is used, with minimal computational overhead, enabling efficient pairing-free multimodal deployment. Experiments on histology image datasets confirm that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without the burden of paired data training or inference-time complexity. Our code is provided in suppl. materials.


76
LightGazeNet: A Lightweight GNN-based Architecture for Gaze Estimation

Heena Patel ⋅ Anirban Chowdhury ⋅ Pooja Choksy ⋅ Samiksha Pachade ⋅ Ajinkya Puar

Gaze estimation remains a fundamental yet challenging task, requiring a careful balance between accuracy and efficiency for real-world deployment. We introduce LightGazeNet, a lightweight Graph Neural Network (GNN) framework designed for appearance-based gaze estimation. LightGazeNet effectively integrates multi-modal inputs—including facial features, eye cues, 3D eye centers, head pose, and calibration data—within a compact graph-based architecture. To enhance feature fusion across heterogeneous inputs, the framework leverages multi-head attention to model complex spatial dependencies. Extensive evaluations on multiple benchmark datasets show that LightGazeNet achieves competitive or superior accuracy with significantly fewer parameters than existing methods. Furthermore, it demonstrates strong cross-dataset generalization, with calibration-based adaptation improving robustness under domain shift. By combining accuracy, efficiency, and adaptability, LightGazeNet offers a practical solution for gaze estimation in real-world settings while advancing graph-based modeling in computer vision.


77
Codebook Knowledge with Mamba-Transformer For Low-Light Image Enhancement

Runhua Deng ⋅ Aiwen Jiang ⋅ Long Peng ⋅ Qiuhai Yan

Low-light image enhancement is a critical task which aims to improve the quality of images captured in bad lighting conditions, and contribute to more robust and reliable computer vision systems. Existing methods failed to account for multifaceted and intertwined degradations typically encountered in low-light scenarios. In this paper, we have reconsider the application of vector-quantized codebook in low-light image enhancement task as a domain adaptation paradigm and proposed an effective method called CodeMTNet to solve aforementioned issues. Specifically, we leverage codebook learning from collections of norm-light images to provide unified high-quality knowledge guidance. We have further developed two learning schemes, namely Domain Adaptation Encoder with implicit neural representation regularization across multiple scales, and hybrid Mamba-Transformer blocks for nearest neighbor matching, to tackle distribution mismatch between features of low-quality low-light images and high-quality normal-light images. Additionally, to solve structural information loss during codebook retrieval, we have introduced a controllable feature fusion modules for well texture detail preservation. Experiments conducted on public datasets have demonstrated that CodeMTNet consistently outperforms many state-of-the-art methods and restore images better in line with human perception. Related source codes and pretrained parameters will be publicly available on github.


78
TiCLS : Tightly Coupled Language Text Spotter

Leeje Jang ⋅ Yijun Lin ⋅ Yao-Yi Chiang ⋅ Jerod Weinman

Scene text spotting aims to detect and recognize text in real-world images, where instances are often short, fragmented, or visually ambiguous. Existing methods primarily rely on visual cues and implicitly capture local character dependencies, but they overlook the benefits of external linguistic knowledge. Prior attempts to integrate language models either adapt language modeling objectives without external knowledge or apply pretrained models that are misaligned with the word-level granularity of scene text. We propose TiCLS, an end-to-end text spotter that explicitly incorporates external linguistic knowledge from a character-level pretrained language model. TiCLS introduces a linguistic decoder that fuses visual and linguistic features, yet can be initialized by a pretrained language model, enabling robust recognition of ambiguous or fragmented text. Experiments on ICDAR 2015 and Total-Text demonstrate that TiCLS achieves state-of-the-art performance, validating the effectiveness of PLM-guided linguistic integration for scene text spotting.


79
Perception-Inspired Color Space Design for Photo White Balance Editing

Yang Cheng ⋅ Ziteng Cui ⋅ Lin Gu ⋅ Shenghan Su ⋅ Zenghui Zhang

White Balance (WB) is a critical component of the image signal processor (ISP) pipeline, designed to mitigate color casts introduced by diverse illumination conditions and restore the scene's true colors. Currently, sRGB-based WB editing has been widely adopted in cases where color correction errors occur in the absence of an ISP or when the original camera RAW is unavailable. However, its inherent limitations—such as fixed nonlinear transformations and entangled color channels—often hinder its ability to generalize under complex lighting scenarios. To address these challenges, we propose a novel framework for WB editing that leverages a learnable HSI (LHSI) color space. By disentangling luminance from chromatic components, the LHSI representation facilitates more effective modeling of illumination changes. The proposed framework incorporates a specifically designed neural network tailored for the LHSI color space, which optimizes a learnable illumination axis within this adaptive representation, enabling precise and flexible illumination correction. Experimental results on multiple benchmark datasets show that our method achieves remarkable performance, highlighting the importance of adaptive color space design in computational photography and pointing to a promising direction for learning-based WB methods.


80
CLUE: Bringing Machine Unlearning to Mobile Devices

A. Q. M. Sazzad Sayyed ⋅ Nathaniel Bastian ⋅ Michael Lucia ⋅ Ananthram Swami ⋅ Francesco Restuccia

Class-level machine unlearning has been proposed toaddress security and privacy issues of deep neural net-works (DNNs). However, existing approaches either ex-hibit low performance or have excessive computation/stor-age requirements. This makes them inapplicable in mobilecomputing scenarios, where computation and memory areseverely constrained yet unlearning has to be performedfrequently and effectively. This limitation is mainly due tothe usage of a retain dataset, i.e., a sub-dataset contain-ing the knowledge that the DNN should maintain after theunlearning. In this paper, we propose CLUE, an unlearn-ing algorithm that does not require a retain dataset. Ourkey idea is to treat inputs coming from the forget class asout-of-distribution data and to use knowledge distillationto impose this constraint on the updated DNN. We haveexperimentally evaluated CLUE on Resnet-20, ViT-Base,and ViT-Large DNNs trained on CIFAR10, CIFAR100, andVGGFace2 datasets. We have also implemented CLUE onRaspberry PI and compared the power consumption andlatency of CLUE with respect to several existing baselines.We show that CLUE improves power consumption by 68%and latency by 90% while improving the unlearning perfor-mance by up to 4.74%.

Semantic Scene Completion (SSC) aims to predict the semantic occupancy of each voxel within a 3D scene using sensor data, a critical task for autonomous driving and robotics. Despite recent progress, camera-based SSC remains challenging due to various difficulties, including voxel class imbalance, occlusion, and depth ambiguity. This paper introduces FairScene, a novel approach that learns class-disentangled 2D/3D representations to improve SSC. By ensuring balanced representations across classes, FairScene mitigates the dominance of majority classes and promotes fairer voxel categorization. Additionally, FairScene explicitly models spatial dependencies between different classes through a novel inter-class occupancy reasoning mechanism. Such explicit modeling helps alleviate occlusion and depth ambiguities in SSC. To address the scarcity of SSC training data, we propose OccMix, a novel augmentation strategy that generalizes MixUp from 2D to 2.5D and 3D metric spaces while maintaining geometric consistency. Extensive quantitative and qualitative experiments demonstrate that FairScene outperforms prior methods on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks. We will make the code publicly available.

Vision Language Models (VLMs) can be trained more efficiently if training sets can be reduced in size. Recent work has shown the benefits of masking text during VLM training using a variety of strategies (truncation, random masking, block masking and syntax masking) and has reported syntax masking as the top performer. In this paper, we analyze the impact of different text masking strategies on the word frequency distribution of the training data, and show that this impact is connected to model success. Motivated by this finding, we propose a new frequency-based text masking approach, Contrastive Language-Image Pre-training with Word Frequency Masking (CLIPF). Extensive experiments demonstrate the advantages of CLIPF over syntax masking and other existing approaches, particularly when the number of input tokens decreases. We show that not only CLIPF, but also other existing masking strategies, outperform syntax masking when enough epochs are used during training, a finding of practical importance for selecting a text masking method for VLM training. Our code is available online.


83
Event-based Graph Representation with Spatial and Motion Vectors for Asynchronous Object Detection

Aayush Verma ⋅ Arpitsinh Vaghela ⋅ Bharatesh Chakravarthi ⋅ Kaustav Chanda ⋅ “YZ” Yezhou Yang

Event-based sensors offer high temporal resolution and low latency by generating sparse, asynchronous data. However, converting this irregular data into dense tensors for use in standard neural networks diminishes these inherent advantages, motivating research into graph representations. While such methods preserve sparsity and support asynchronous inference, their performance on downstream tasks remains limited due to suboptimal modeling of spatiotemporal dynamics.In this work, we propose a novel spatiotemporal multigraph representation to better capture spatial structure and temporal changes. Our approach constructs two decoupled graphs: a spatial graph leveraging B-spline basis functions to model global structure, and a temporal graph utilizing motion vector-based attention for local dynamic changes. This design enables the use of efficient 2D kernels in place of computationally expensive 3D kernels. We evaluate our method on the Gen1 automotive and eTraM datasets for event-based object detection, achieving over a 6% improvement in detection accuracy compared to previous graph-based works, with a $5\times$ speedup, reduced parameter count, and no increase in computational cost. These results highlight the effectiveness of structured graph modeling for asynchronous vision.The project is available at https://github.com/maskedforreview.

Unsupervised anomaly detection aims to identify anomalies without pixel-level annotations. Synthetic anomaly-based methods exhibit a unique capacity to introduce controllable irregularities with known masks, enabling explicit supervision during training. However, existing methods often produce synthetic anomalies that are visually distinct from real pathological patterns and ignore anatomical structure. This paper presents a novel Anatomy-aware Realistic Texture-based Anomaly Synthesis framework (ART-ASyn) for chest X-rays that generates realistic and anatomically consistent abnormalities using texture-based augmentation guided by our proposed Progressive Binary Thresholding Segmentation method (PBTSeg) for lung segmentation. The generated paired samples of synthetic anomalies and their corresponding precise pixel-level anomaly mask for each normal sample enable both explicit segmentation supervision and localized training. In contrast to prior work limited to one-class classification, ART-ASyn is further evaluated for zero-shot anomaly segmentation, demonstrating generalizability on an unseen dataset without requiring target-domain annotations.


85
High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

Masih Aminbeidokhti ⋅ Heitor Medeiros ⋅ Srikanth Muralidharan ⋅ Eric Granger ⋅ Marco Pedersoli

Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks—PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet—using ResNet and ViT architectures show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.


86
MuseDance: A Diffusion-based Music-Driven Image Animation System

Zhikang Dong ⋅ Weituo Hao ⋅ Ju-Chiang Wang ⋅ Peng Zhang ⋅ Pawel Polak

Image animation is a rapidly developing area in multimodal research, with a focus on generating videos from reference images. While much of the work has emphasized generic video generation guided by text, music-driven dance image animation remains underexplored. In this paper, we introduce MuseDance, an end-to-end model that animates reference images using both music and text inputs. By integrating music as a conditioning modality, MuseDance generates personalized videos that not only adhere to textual descriptions but also synchronize character movements with the rhythm and dynamics of the music. Unlike existing methods, MuseDance eliminates the need for explicit motion guidance, such as pose sequences or depth maps, reducing the complexity of video generation while enhancing accessibility and flexibility. To support further research in this field, we present a new multimodal dataset comprising of 2,904 dance videos, each paired with the corresponding background music and text descriptions. Our approach leverages diffusion-based methods to achieve robust generalization, precise control, and temporal consistency, setting a new benchmark for music-driven image animation task.

Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with three VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3–5% in MRR@5 for smaller VLMs, and 1–3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.


88
Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings

Jinyung Hong ⋅ Yearim Kim ⋅ Keun Hee Park ⋅ Sangyu Han ⋅ Nojun Kwak ⋅ Theodore Pavlic

Inner interpretability is a promising field focused on uncovering the internal mechanisms of AI systems and developing scalable automated methods to understand these systems at a mechanistic level.While significant research has been conducted on large language models, limited attention has been paid to applying inner interpretability to large-scale image tasks, focusing primarily on architectural and functional levels to visualize learned concepts.In this paper, we first present a conceptual framework that supports inner interpretability and multilevel analysis for large-scale image classification tasks.Specifically, we introduce the Bi-directional Interaction between Concept and Input Embeddings (Bi-ICE) module, which facilitates interpretability across the computational, algorithmic, and implementation levels.This module enhances transparency by generating predictions based on human-understandable concepts, quantifying their contributions, and localizing them within the inputs.Finally, we showcase enhanced transparency in image classification, measuring concept contributions, and pinpointing their locations within the inputs. Our approach highlights algorithmic interpretability by demonstrating the process of concept learning and its convergence.


89
DTMIR-Pro: Domain Translation with Prompt-based Latent-Space Generalization for Multi-Weather Image Restoration

Ashutosh Kulkarni ⋅ Prashant Patil ⋅ SANTOSH VIPPARTHI ⋅ Subrahmanyam Murala ⋅ Balasubramanian Raman

Multi-weather image restoration seeks to recover scene visibility under rainy, snowy, and hazy conditions, thereby enhancing high-level vision tasks. Existing methods typically train on combined datasets with single-type weather degradations, limiting their generalization to real-world scenarios involving mixed degradations. Domain translation has emerged as a viable solution by generating diverse weather-degraded variants of the same scene. However, current approaches require separate models for each degradation type, resulting in increased system complexity. To address this, we propose DTMIR-Pro, a prompt-based domain translation framework with latent space generalization for multi-weather image restoration. A single trainable network performs multi-domain translation using domain-adaptive prompts and dynamic kernel selection via a proposed Dynamic Multi-Head Attention block to learn diverse degradation patterns. The restoration network takes translated outputs and employs a Multi-Weather Fusion Block with global-local feature streams to capture complex degradations. Furthermore, we introduce a Similarity-Based Encoder Routing mechanism to transfer domain-specific features from the translation encoder to the restoration stage. Extensive experiments on both synthetic and real-world weather-degraded datasets demonstrate the effectiveness and generalizability of the proposed method. Testing code is provided as a part of the supplementary material and will be publicly released upon acceptance of paper.

Anomaly Detection is an important problem in industrial processes. Two new subfields have recently emerged: logical anomaly detection and few-shot anomaly detection. The combined task, few-shot logical anomaly detection, has proven exceptionally difficult and highly important for industrial processes. Previous few-shot methods do not capture the composition information necessary for detecting logical anomalies, and previous full-shot methods require a large training set. To solve both problems, we propose ObjectCore, a few-shot logical anomaly detection model that captures the composition information from only a few images without any category-specific information. The composition information of an image is modelled as a collection of object representations. Logical anomalies are detected using bipartite matching between object representations in the test image and object representations in the most similar support image. ObjectCore significantly improves over state-of-the-art methods on two standard benchmarks for few-shot logical anomaly detection, MVTec LOCO and CAD-SD, attaining an image-level AUROC of 80.8\% and 96.5\%, respectively, in the 4-shot setting. Code: \textcolor{magenta}{Upon acceptance}


91
A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis

Antonio Scardace ⋅ Lemuel Puglisi ⋅ Francesco Guarnera ⋅ Sebastiano Battiato ⋅ Daniele Ravi

Deep generative models have emerged as a transformative tool in medical imaging, offering substantial potential for synthetic data generation. However, recent empirical studies highlight a critical vulnerability: these models can memorize sensitive training data, posing significant risks of unauthorized patient information disclosure. Detecting memorization in generative models remains particularly challenging, necessitating scalable methods capable of identifying training data leakage across large sets of generated samples. In this work, we propose DeepSSIM, a novel self-supervised metric for quantifying memorization in generative models. DeepSSIM is trained to: i) project images into a learned embedding space and ii) force the cosine similarity between embeddings to match the ground-truth SSIM (Structural Similarity Index) scores computed in the image space. To capture domain-specific anatomical features, training incorporates structure-preserving augmentations, allowing DeepSSIM to estimate similarity reliably without requiring precise spatial alignment. We evaluate DeepSSIM in a case study involving synthetic brain MRI data generated by a Latent Diffusion Model (LDM) trained under memorization-prone conditions, using 2,195 MRI scans from two publicly available datasets (IXI and CoRR). Compared to state-of-the-art memorization metrics, DeepSSIM achieves superior performance, improving F1-scores by an average of +52.86\% over the best existing method. Additionally, our results show that existing metrics often overestimate memorization, challenging prior assumptions about the severity of this risk. Code and data of our approach are publicly available at the following anonymized link: https://github.com/AnonymAuthorr/DeepSSIM.


92
Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation

Raül Pérez-Gonzalo ⋅ Riccardo Magro ⋅ Andreas Espersen ⋅ Antonio Agudo

Reliable operation of wind turbines requires frequent inspections, as even minor surface damages can degrade aerodynamic performance, reduce energy output, and accelerate blade wear. Central to automating these inspections is the accurate segmentation of turbine blades from visual data. This task is traditionally addressed through dense, pixel-wise deep learning models. However, such methods demand extensive annotated datasets, posing scalability challenges. In this work, we introduce an annotation-efficient segmentation approach that reframes the pixel-level task into a binary region classification problem. Image regions are generated using a fully unsupervised, interpretable Modular Adaptive Region Growing technique, guided by image-specific Adaptive Thresholding and enhanced by a Region Merging process that consolidates fragmented areas into coherent segments. To improve generalization and classification robustness, we introduce RegionMix, an augmentation strategy that synthesizes new training samples by combining distinct regions. Our framework demonstrates state-of-the-art segmentation accuracy and strong cross-site generalization by consistently segmenting turbine blades across distinct windfarms.


93
Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy

Hao Yu ⋅ Rupayan Mallick ⋅ Margrit Betke ⋅ Sarah Bargal

Different forms of customized 2D avatars are widely used in gaming applications, virtual communication, education, and content creation. However, existing approaches often fail to capture fine-grained facial expressions and struggle to preserve identity across different expressions. We propose Gen-AFFECT, a novel framework for personalized avatar generation that generates expressive and identity-consistent avatars with a diverse set of facial expressions. Our framework proposes conditioning a multimodal diffusion transformer on an extracted identity-expression representation. This enables identity preservation and representation of a wide range of facial expressions. Gen-AFFECT additionally employs consistent attention at inference for information sharing across the set of generated expressions, enabling the generation process to maintain identity consistency over the array of generated fine-grained expressions. Gen-AFFECT demonstrates superior performance compared to previous state-of-the-art methods on the basis of the accuracy of the generated expressions, the preservation of the identity and the consistency of the target identity across an array of fine-grained facial expressions.

The increasing availability of Earth observation data offers unprecedented opportunities for large-scale environmental monitoring and analysis. However, these datasets are inherently heterogeneous, stemming from diverse sensors, geographical regions, acquisition times, and atmospheric conditions. Distribution shifts between training and deployment domains severely limit the generalization of pretrained remote sensing models, making unsupervised domain adaptation (UDA) crucial for real-world applications.We introduce FlowEO, a novel framework that leverages generative models for image-space UDA in Earth observation. We leverage flow matching to learn a semantically preserving mapping that transports from the source to the target image distribution. This allows us to tackle challenging domain adaptation configurations for classification and semantic segmentation of Earth observation images. We conduct extensive experiments across four datasets covering adaptation scenarios such as SAR to optical translation and temporal and semantic shifts caused by natural disasters.Experimental results demonstrate that FlowEO outperforms existing image translation approaches for domain adaptation while achieving on-par or better perceptual image quality, highlighting the potential of flow-matching-based UDA for remote sensing.


95
Efficient Vision Transformers via Token Merging with Head-wise Attention Correction

Yuki Ichikawa ⋅ Masato Motomura ⋅ Thiem Chu ⋅ Daichi Fujiki

Vision Transformers (ViTs) offer strong performance by modeling global relationships across image patches, but their scalability is limited by the quadratic cost of self-attention. To mitigate this, Token Merging (ToMe) reduces computation by merging similar tokens. This approach relies on proportional attention to preserve the original balance of attention weights after merging. Yet, proportional attention does not fully resolve the attention distortion. It only compensates for a merged token's influence on other tokens, while ignoring the fundamental distortion of the token's own self-attention score. This change creates unaddressed distortions that vary across attention heads. In this work, we conduct a detailed analysis of these attention distortions and reveal their dependence on the query–key projection weights of each head. Based on this finding, we propose Head-wise Attention Correction (HAC), a method that adjusts attention scores after token merging by accounting for head-specific characteristics. HAC effectively mitigates the distortions overlooked by proportional attention, maintaining model accuracy while significantly reducing computation. Experiments on ImageNet demonstrate that our method effectively improves the trade-off between efficiency and performance, advancing the development of efficient Vision Transformers via token merging.

Contrast-enhanced Computed Tomography (CT) plays a vital role in modern medical diagnostics, particularly in cardiovascular assessment. However, the use of intravenous contrast agents can pose potential health risks for vulnerable patient populations. To address this limitation, we present 2S-CEDiff, a clinically-inspired image translation framework that synthesizes high-fidelity contrast-enhanced 3D CT volumes from non-contrast inputs. Our method adopts a two-stage architecture. The first stage employs a 2.5D diffusion model to generate anatomically accurate slice-wise predictions, guided by cross-attention-based positional conditioning and structural priors derived from segmentation masks produced by a pre-trained TotalSegmentator model. The second stage involves applying a 3D UNet model trained with a multi-objective loss that jointly optimizes pixel-level fidelity and volumetric coherence.We evaluate 2S-CEDiff on an in-house paired 3D cardiac CT dataset, where it achieves state-of-the-art performance in PSNR (29.87), SSIM (0.89), and inter-slice coherence. Moreover, the synthesized contrast-enhanced images significantly enhance downstream anatomical segmentation accuracy, improving the overall Dice score to 0.9447 and yielding a substantial +0.0798 increase in myocardium segmentation performance with TotalSegmentator — underscoring their clinical utility and translational potential.

The quality of captured images significantly impacts the performance of downstream perception tasks. Recent works that co-design camera systems with perception tasks have demonstrated improved task performance. However, prior approaches primarily focus on optimising fixed camera parameters determined at manufacturing, whereas many parameters, such as exposure settings, require adaptive control at runtime. This paper presents a method that jointly optimises camera hardware and adaptive camera control algorithms alongside downstream vision tasks. We propose a unified optimisation framework that combines gradient-based and derivative-free methods to support continuous and discrete parameters, non-differentiable image formation processes, and a neural network-based adaptive camera control algorithm. To handle non-differentiable rendering of some image effects, such as motion blur, we propose DF-Grad, a hybrid optimisation method that supervises the neural adaptive control algorithm using signals from the derivative-free optimiser, in addition to unsupervised task-driven learning. Experiments show that the proposed method outperforms baselines that optimise camera hardware and camera control algorithm separately, particularly under challenging conditions such as low light and fast motion. We demonstrate that joint optimisation of both static camera hardware parameters and adaptive control algorithms leads to improved perception task performance, offering a unified approach to task-driven camera system design.


98
CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

Fevziye Irem Eyiokur ⋅ Dogucan Yaman ⋅ Hazım Ekenel ⋅ Alexander Waibel

We address the problem of Embodied Reference Understanding, which involves predicting the object that a person in the scene is referring to through both pointing gesture and language. Accurately identifying the referent requires multimodal understanding: integrating textual instructions, visual pointing, and scene context. However, existing methods often struggle to effectively leverage visual clues for disambiguation. We also observe that, while the referent is often aligned with the head-to-fingertip line, it occasionally aligns more closely with the wrist-to fingertip line. Therefore, relying on a single line assumption can be overly simplistic and may lead to suboptimal performance. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We further introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To combine the strengths of both models, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features. Additionally, we propose an object center prediction head as an auxiliary task to further enhance referent localization. We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold.


99
Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Zaira Manigrasso ⋅ Matteo Dunnhofer ⋅ Antonino Furnari ⋅ Moritz Nottebaum ⋅ Antonio Finocchiaro ⋅ Marana Davide ⋅ Rosario Forte ⋅ Giovanni Farinella ⋅ Christian Micheloni

Episodic memory retrieval enables wearable cameras to recall objects or events previously observed in video.However, existing formulations assume an ``offline'' setting with full video access at query time, limiting their applicability in real-world scenarios with power and storage-constrained wearable devices.Towards more application-ready episodic memory systems, we introduce \textit{Online Visual Query 2D} (OVQ2D), a task where models process video streams online, observing each frame only once, and retrieve object localizations using a compact memory instead of full video history.We address OVQ2D with ESOM (Egocentric Streaming Object Memory), a novel framework integrating an \textit{object discovery module}, an \textit{object tracking module}, and a \textit{memory module} that find, track, and store spatio-temporal object information for efficient querying.Experiments on Ego4D demonstrate ESOM's superiority over other online approaches, though OVQ2D remains challenging, with top performance at only ~4\% success. ESOM’s accuracy increases markedly with perfect object tracking (31.91\%), discovery (40.55\%), or both (81.92\%), underscoring the need of applied research on these components.


100
Mobile-Oriented Video Diffusion: Enabling Text-to-Video Generation on Mobile Devices Without Retraining, Compression, or Pruning

Bosung Kim ⋅ Kyuhwan Lee ⋅ Isu Jeong ⋅ Jungmin Cheon ⋅ Yeojin Lee ⋅ Seulki Lee

We present Mobile-Oriented Video Diffusion (MOVD) framework, the first diffusion-based text-to-video generation framework designed for efficient on-device execution on smartphone-grade hardware, without requiring retraining, compression or pruning. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed solution, MOVD applies two novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Furthermore, by integrating Concurrent Inference with Dynamic Loading (CI-DL), which enables large models to be split into smaller segments for execution in limited memory environments, into the MOVD framework, we enable text-to-video diffusion generative model to run on an iPhone 15 Pro, achieving performance comparable to that of high-end GPUs. These results show that MOVD enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed MOVD as a significant first step toward democratizing state‑of‑the‑art generative technologies, enabling video generation on mobile and embedded devices without resource‑intensive optimization procedures (e.g., retraining, compression, or pruning).


101
Workzone3D: A Multimodal Dataset for 3D Work Zone Perception in Autonomous Driving

Shounak Sural ⋅ Nishad Sahu ⋅ Ragunathan Rajkumar

Work zones are essential to maintain, repair and upgrade our roadways. However, they introduce complex, dynamic and challenging environments for autonomous vehicles to navigate safely. To help address this challenge, we introduce the first publicly available, large-scale, multimodal work zone dataset collected with an autonomous vehicle consisting of multiple synchronized lidars and high-resolution cameras. The dataset covers various work zone elements like cones, barrels and other channelizers. Our dataset, referred to as WorkZone3D, consists of 3D annotation boxes for these objects. We also propose a detailed auto-annotation pipeline that can produce high-quality 3D labels, even for rare classes which do not have pre-trained 3D object detection models to start with. We evaluate a camera+lidar-based deep learning model on this dataset, highlighting the critical role of sensor fusion in accurate 3D localization of such small objects at a distance, often having very few lidar points on them. Our results demonstrate the usefulness of our dataset for generalization to real-world scenarios. We will release both our code and data.


102
Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

Prin Phunyaphibarn ⋅ Phillip Lee ⋅ Jaihoon Kim ⋅ Minhyuk Sung

Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.


103
Training-free Detection of Text-to-video Generations via Over-coherence

Jonathan Brokman ⋅ Oren Rachmil ⋅ Omer Hofman ⋅ Roy Betser ⋅ Amit Giloni ⋅ Roman Vainshtein ⋅ Hisashi Kojima

Text-to-video generative models have emerged as powerful tools in content creation, capable of synthesizing highly realistic videos from textual prompts. However, the rapid advancement of these models introduces significant security and trust concerns, as traditional detection methods struggle to generalize to unseen generative techniques. Existing approaches commonly rely on supervised learning, requiring continuous dataset curation and retraining, which is impractical given the fast-paced evolution of generative models. In this work, we introduce the first training-free detection method for AI-generated videos, eliminating the need for labeled training data or prior exposure to generation techniques. Our approach exploits a fundamental weakness in text-to-video models: Unnatural temporal over-coherence in frame transitions. By leveraging a novel time-coherence detection criterion, our method identifies distinct artifacts in video embeddings, which are absent in real videos. We extensively evaluate our approach, demonstrating that it significantly outperforms existing baselines - maintaining robustness to unseen generative models. This work establishes a new direction for training-free detection of text-to-video generated content, providing a scalable and time resilient solution.


104
ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion

Junming Liu ⋅ Yifei Sun ⋅ Weihua Cheng ⋅ Yujin Kang ⋅ Yirong Chen ⋅ Ding Wang ⋅ Guosun Zeng

Magnetic Resonance Imaging (MRI) plays a crucial role in brain disease diagnosis, but it is not always feasible for certain patients due to physical or clinical constraints. Recent studies attempt to synthesize MRI from Computed Tomography (CT) scans; however, low-dose protocols often result in highly sparse CT volumes, making accurate reconstruction of the full brain MRI volume particularly challenging. To address this, we propose ReBrain, a retrieval-augmented diffusion framework for cross-modal brain MRI reconstruction. Given any 3D CT scan with sparse slices, we first employ a Brownian Bridge Diffusion Model (BBDM) to synthesize MRI slices from the available CT data. Simultaneously, we retrieve structurally and pathologically similar CT slices from a comprehensive prior database via a fine-tuned retrieval model. These retrieved slices are used as references, incorporated through a ControlNet branch to guide the generation of intermediate MRI slices and ensure structural continuity. We further account for rare retrieval failures when the database lacks suitable references and apply spherical linear interpolation to provide supplementary guidance. Extensive experiments on SynthRAD2023 and BraTS demonstrate that ReBrain achieves state-of-the-art performance in cross-modal reconstruction under sparse conditions. The code will be released upon acceptance.


105
Efficient Text-Guided Convolutional Adapter for the Diffusion Model

Aryan Das ⋅ Koushik Biswas ⋅ Swalpa Roy ⋅ Badri Patro ⋅ Vinay Verma

We propose Nexus Adapters, a novel family of text-guided and efficient adapters designed for diffusion-based Structure Preserving Conditional Generation (SPCG). Existing structure-preserving methods typically rely on using a base diffusion model conditioned on a prompt and a separate adapter for structural inputs like sketches or depth maps. However, such approaches are computationally expensive, often requiring adapter sizes comparable to the base model, making training impractical due to the already high cost of diffusion models. Moreover, traditional adapters operate independently of the input prompt, making them suboptimal for capturing the semantic alignment between textual and structural cues. To address these limitations, we introduce Nexus Prime and Nexus Slim, two prompt-aware adapters that jointly leverage both text and structural inputs. These adapters are composed of modular Nexus Blocks that utilize cross-attention mechanisms to enable effective multimodal conditioning. As a result, the adapters can maintain structural fidelity while aligning more closely with the intended textual prompt. Experimental results demonstrate that Nexus Prime significantly improves performance with only 8M additional parameters over the minimal T2I baseline. Additionally, Nexus Slim, a lightweight variant with 18M fewer parameters than T2I, still achieves competitive, state-of-the-art results—validating the efficiency and effectiveness of our proposed design.

Medical image segmentation (MIS) is challenged by anatomical variability, ambiguous boundaries, and subtle textures, demanding an efficient balance between fine local details and global context. Existing architectures often suffer from suboptimal fusion of spatial and frequency-domain features, limiting their ability to capture richer structural and textural representations. To specifically address these challenges, we introduce FLoMo-Net, a MIS architecture designed to simultaneously achieve superior performance and efficiency by adaptively routing multi-scale local and global context information. The proposed Local-Global Mixture of Experts encoder dynamically integrates specialized convolutional branches to selectively capture relevant spatial scales, while the Dual-Attention Selective Aggregator module further refines deep encoder and decoder features by integrating frequency-guided channel attention and adaptive spatial attention. Additionally, the Frequency-Aware Multi-Scale Refinement module enhances structural precision by explicitly modeling frequency-domain features at the bridge of the architecture. Our False Positive/Negative Corrective Attention Module leverages uncertainty measures, derived from entropy and cosine dissimilarity, to specifically address semantic drift and improve boundary delineation in the decoder stages. Extensive experiments across four benchmark MIS datasets demonstrate that FLoMo-NetB2 achieves superior performance with significantly fewer parameters, outperforming state-of-the-art counterparts in both dice score and inference speed. Moreover, our architecture scales effectively, with FLoMo-NetB0 (2.06M) and FLoMo-NetB1 (7.99M) delivering competitive results, underscoring the practical viability of our design for real-time clinical applications.

Understanding human skill performance is essential for intelligent assistive systems, with struggle recognition offering a natural cue for identifying user difficulties. While prior work focuses on offline struggle classification and localization, real-time applications require models capable of detecting and anticipating struggle online. We reformulate struggle localization as an online detection task and further extend it to anticipation—predicting struggle moments before they occur.We adapt two off-the-shelf models as baselines for online struggle detection and anticipation. Online struggle detection achieves 70–80\% per-frame mAP, while struggle anticipation up to 2 seconds ahead yields comparable performance with slight drops. We further examine generalization across tasks and activities and analyse the impact of skill evolution. Despite larger domain gaps in activity-level generalization, models still outperform random baselines by 4–20\%.Our feature-based models run at up to 143 FPS, and the whole pipeline, including feature extraction, operates at around 3 FPS — sufficient for real-time assistive applications.

Autonomous vehicles must excel in safety-critical perception tasks, especially in adverse weather conditions. In addition, transitional weather shifts in nature, such as sunny to rainy, rainy to cloudy, etc., pose abrupt illumination changes that can distort object boundaries and degrade segmentation performance. Existing research focuses mainly on segmentation in clear and discrete weather conditions, leaving a gap in addressing the issues of transitional weather scenarios. Hence, we propose a novel method called causal road and rest segmentation (CaRS) that utilizes causal intervention to mitigate the confounding bias due to transitional weather changes. We use dual complementary attention modules, one for causal and another for confounding feature extraction. These modules complement each other and are fine-tuned via an adversarial min-max approach to reduce confounding bias and enhance segmentation performance. Also, our CaRS method concurrently performs road semantic segmentation and instance segmentation of vehicles and pedestrians. Further, we introduce a transitional weather-driving dataset for segmentation (TWDS16) using a spurious correlation generator that leverages data interpolation to produce 16 weather transitions. We evaluate the performance of CaRS on TWDS16, along with three other benchmark datasets, namely, Foggy Cityscapes, RainCityscapes, and BDD100K. The experimental results validate the efficacy of the proposed method in mitigating confounding influences, leading to improved \textit{mIoU} for semantic segmentation and \textit{mAP} for instance segmentation across diverse datasets


109
Domain Generalizing DINO for Visual Regression via Latent Distractor Subspace Consistency

Nikhil Kumar Jangamreddy ⋅ Chetan Arora ⋅ Mahsa Baktashmotlagh

Vision Foundation Models, such as \dino\cite{Dinov2}, have demonstrated remarkable generalization in classification; however, their application to out-of-domain visual regression tasks remains a significant and underexplored challenge. Unlike classification, domain generalization in regression poses distinct challenges: regression produces continuous outputs and is particularly sensitive to high-variance, label-irrelevant factors (e.g., illumination, blur, or contrast). These factors can entangle with task-relevant features and induce spurious correlations. While recent regression methods~\cite{c-mixup,ranksim,circe,fds,conr} have shown promise, they often rely on CNN backbones and require the pre-specification of known distractors. This demands significant domain expertise and fails to address spurious correlations that emerge during training. To address these challenges, we propose \proposedapproach, a \textbf{L}atent \textbf{D}istractor \textbf{S}ubspace \textbf{C}onsistency framework that disentangles intermediate feature representation into task-relevant and latent distractor subspaces, and regularizes the latter under photometric perturbations to suppress spurious correlations while preserving discriminative features during training. Our proposed method, \proposedapproach, is the first to effectively adapt the powerful \dino backbone for domain generalized visual regression. \proposedapproach achieves state-of-the-art results on seven benchmark regression datasets, demonstrating its strong performance in domain generalization for visual regression with percentage improvements of (41.75\%, 20.12\%, 52.05\%, 8.27\%, 22.21\%, 3.55\%) over state-of-the-art \DG regression methods, respectively. Source code is provided in the supplementary.


110
WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

Sajjad Pakdamansavoji ⋅ Yintao Ma ⋅ Amir Rasouli ⋅ TONGTONG CAO

Accurate 6D object pose estimation is vital for robotics, augmented reality, and scene understanding. For seen objects, high accuracy is often attainable via per-object fine-tuning but generalizing to unseen objects remains a challenge. To address this problem, past arts assume access to CAD models at test time and typically follow a multi-stage pipeline to estimate poses: detect and segment the object, propose an initial pose, and then refine it. Under occlusion, however, the early-stage of such pipelines are prone to errors, which can propagate through the sequential processing, and consequently degrade the performance. To remedy this shortcoming, we propose four novel extensions to model-based 6D pose estimation methods: (i) a dynamic non-uniform dense sampling strategy that focuses computation on visible regions, reducing occlusion-induced errors; (ii) a multi-hypothesis inference mechanism that retains several confidence-ranked pose candidates, mitigating brittle single-path failures; (iii) iterative refinement to progressively improve pose accuracy; and (iv) series of occlusion-focused training augmentations that strengthen robustness and generalization. Furthermore, we propose a new weighted by visibility metric for evaluation under occlusion to minimize the bias in the existing protocols. Via extensive empirical evaluations, we show that our proposed approach achieves more than 5% improvement in accuracy on ICBIN and more than 2% on BOP dataset benchmarks, while achieving approximately 3 times faster inference.


111
Decoupling Shape and Texture in SAM-2 via Controlled Texture Replacement

Inbal Cohen ⋅ Boaz Meivar ⋅ Peihan Tu ⋅ Shai Avidan ⋅ Gal Oren

Segment Anything Models (SAM) have demonstrated strong generalization in object segmentation across diverse datasets. However, their training on large-scale semantic segmentation data induces a shape bias, leading to over-segmentation in texture-dominant scenes and severely limiting performance. This limitation is particularly pronounced in domains such as remote sensing and metallographic imaging, where meaningful boundaries are defined by texture variations rather than semantic structure. In this study, we investigate SAM’s shape bias and show that a simple fine-tuning strategy—based on incremental texture augmentations of semantically labeled data—can effectively calibrate this bias and guide the model toward texture-aware segmentation. By interpolating and replacing textures within \textbf{semantically} labeled regions, we generate texture-diverse instances of the same semantic category, enabling effective fine-tuning without requiring additional manual annotations. We release both the texture-oriented variant of SAM (“TextureSAM”) and the texture-augmented dataset used in our experiments to support reproducibility and facilitate further research on shape–texture bias in foundation models. We show that this fine-tuning approach mitigates SAM-2’s shape bias, improving segmentation performance on both real-world (RWTD, +0.20 mIoU) and synthetic (STMD, +0.18 mIoU) texture segmentation benchmarks.

Despite rapid advances, text-to-image (T2I) models still falter in generating anatomically coherent and semantically grounded humans. We introduce HumanBench, a large-scale (35K-image), privacy-friendly benchmark that rigorously evaluates T2I models across four axes: template consistency, spatial reasoning, action understanding, and texture recognition. To quantify alignment, we propose two novel metrics—Agreement and Distinction—capturing both fidelity to prompts and semantic contrast with counterfactuals and negations.Evaluating six leading models, we uncover persistent failures including disfigurements, species leakage, texture-object mismatches, and counting errors, especially under compound prompts. A complementary human study reveals that image realism and correctness degrade with prompt complexity, validating our automated assessments. HumanBench offers the first comprehensive audit of human-centric T2I generation, setting a new standard for benchmarking anatomical accuracy, compositional reasoning, and trustworthiness in generative models.


113
SOAF: Scene Occlusion-aware Neural Acoustic Field

Huiyu Gao ⋅ Jiahao Ma ⋅ David Ahmedt-Aristizabal ⋅ Chuong Nguyen ⋅ Miaomiao Liu

This paper tackles the problem of novel view acoustic synthesis along an arbitrary trajectory in an indoor scene, given the audio-video recordings from other known trajectories of the scene. Existing methods often overlook the effect of room geometry, particularly wall occlusions on sound propagation, making them less accurate in multi-room environments. In this work, we propose a new approach called Scene Occlusion-aware Acoustic Field (SOAF) for accurate sound generation. Our approach derives a global prior for the sound field learning through distance-aware parametric sound propagation modeling and then transforms it based on the scene structure learned from the input video. We extract features from the local acoustic field centered at the receiver using a Fibonacci Sphere to generate binaural audio for novel views with a direction-aware attention mechanism. Extensive experiments on the real dataset RWAVS and the synthetic dataset SoundSpaces demonstrate that our method achieves superior performance in spatial audio generation.


114
FedSCAl: Leveraging Server and Client Alignment for Unsupervised Federated Source-Free Domain Adaptation

M Yashwanth ⋅ Sampath Koti ⋅ Arunabh Singh ⋅ Shyam Marjit ⋅ Anirban Chakraborty

We address the Federated source-Free Domain Adaptation (FFreeDA) problem, with clients holding unlabelled data with significant inter-client domain gaps. The FFreeDA setup constrains the FL frameworks to employ only a pre-trained server model as the setup restricts access to the source dataset during the training rounds. Often, this source domain dataset has a distinct distribution to the clients' domains. To address the challenges posed by the FFreeDA setup, adaptation of the Source-Free Domain Adaptation (SFDA) methods to FL struggles with client-drift in real-world scenarios due to extreme data heterogeneity caused by the aforementioned domain gaps, resulting in unreliable pseudo-labels. In this paper, we introduce FedSCAl, an FL framework leveraging our proposed Server-Client Alignment (SCAl) mechanism to regularize client updates by aligning the clients' and server model's predictions. We observe an improvement in the clients' pseudo-labeling accuracy post alignment, as the SCAl mechanism helps to mitigate the client-drift. Further, we present extensive experiments on benchmark vision datasets showcasing how FedSCAl consistently outperforms state-of-the-art FL methods in the FFreeDA setup for classification tasks.

Open-vocabulary semantic segmentation (OVSS) aims to segment images using arbitrary text queries without retraining. Recent approaches leverage vision-language models like CLIP to enable training-free segmentation. However, CLIP is primarily trained for global image-text alignment, which can lead to challenges in capturing fine-grained regional semantics and result in inconsistent predictions across image regions. This work introduces a new feature rectification strategy that incorporates localised structural priors via spherical linear interpolation on a supersphere. Specifically, we construct a regional adjacency graph guided by a combination of low-level image features—such as colour differences, gradients, and textures—to encode localised appearance cues as priors. This encourages more region-aware feature alignment, complementing CLIP’s global alignment bias. Extensive experimental analysis shows the effectiveness of the proposed method, in reducing segmentation noise and improving the preservation of fine-grained structures. Further generalisation analysis confirms that our approach maintains strong performance across diverse training-free open-vocabulary semantic segmentation benchmarks.


116
One Model, Many Behaviors: Training-Induced Effects on Out-of-Distribution Detection

Gerhard Krumpl ⋅ Henning Avenhaus ⋅ Horst Possegger

Out-of-distribution (OOD) detection is crucial for deploying robust and reliable machine-learning systems in open-world settings. Despite steady advances in OOD detectors, their interplay with modern training pipelines that maximize in‑distribution (ID) accuracy and generalization remains under-explored.We investigate this link through a comprehensive empirical study.Fixing the architecture to the widely adopted ResNet‑50, we benchmark 21 post-hoc, state-of-the-art OOD detection methods across 54 ImageNet-trained models obtained via diverse training strategies and evaluate them on eight OOD test sets.Contrary to the common assumption that higher ID accuracy implies better OOD detection performance, we uncover a non‑monotonic relationship: OOD performance initially improves with accuracy but declines once advanced training recipes push accuracy beyond the baseline.Moreover, we observe a strong interdependence between training strategy, detector choice, and resulting OOD performance, indicating that no single method is universally optimal.


117
TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection

Xinqi Xiong ⋅ Prakrut Patel ⋅ Qingyuan Fan ⋅ Amisha Wadhwa ⋅ Sarathy Selvam ⋅ Xiao Guo ⋅ Luchao Qi ⋅ Xiaoming Liu ⋅ Roni Sengupta

The rapid advancement of talking-head deepfake generation fueled by advanced generative models has elevated the realism of synthetic videos to a level that poses substantial risks in domains such as media, politics, and finance. However, current benchmarks for deepfake talking-head detection fail to reflect this progress, relying on outdated generators and offering limited insight into model robustness and generalization. We introduce TalkingHeadBench, a new benchmark designed to address this gap, featuring talking-head videos from six modern generators, with an additional two emerging generators used exclusively for testing generalization. The dataset is built on an expert-led curation process that filters over 60\% of samples to remove videos with noticeable artifacts, presenting a more difficult challenge for detectors. Our evaluation protocols are designed to measure generalization across identity and generator shifts. Benchmarking seven state-of-the-art detectors reveals that models with high accuracy on older datasets like FaceForensics++ show a significant performance drop on our curated data, particularly at strict false positive rates (e.g., TPR@FPR=0.1\%). In addition, we identify a trend where detectors focus on background cues instead of facial features using Grad-CAM visualization. The dataset will be made available in open access with all data splits and protocols. Our benchmark aims to accelerate research towards more robust and generalizable detection models in the face of rapidly evolving generative techniques.


118
Rethinking Real Image Editing: Unleashing Diverse Editing Operators via Multi-Objective Optimization

Siyuan Wang ⋅ Xi Yang ⋅ Zihao Zhou ⋅ Huiru Shao ⋅ Rui Zhang ⋅ Qiufeng Wang ⋅ Guangliang Cheng ⋅ Kaizhu Huang

Text-conditioned diffusion models have revolutionized the field of controllable real image editing, enabling high-fidelity and precise image manipulation. Recent methods target specific editing tasks, using internal representations from reconstruction to ensure consistency. Although effective for single tasks, they fail to balance precision and consistency across diverse image editing tasks. In this work, we propose a novel inference-time real-image editing framework that enables executing multiple editing tasks by tuning editing operators. Our key insight is to treat real image editing as a multi-objective optimization problem, optimizing editing operators for a Pareto optimal solution that balances editing accuracy and consistency at each denoising iteration. Additionally, we design a benchmark for operator-guided real-image editing that covers various local and global editing tasks. Extensive experimental evaluations demonstrate the method's effectiveness in executing precise edits while preserving image fidelity across all tasks, thereby establishing it as the new state-of-the-art.


119
Decomposition Sampling for Efficient Region Annotations in Active Learning

Jingna Qiu ⋅ Frauke Wilm ⋅ Mathias Oettl ⋅ Jonas Utz ⋅ Maja Schlereth ⋅ Moritz Schillinger ⋅ Marc Aubreville ⋅ Katharina Breininger

Active learning improves annotation efficiency by selecting the most informative samples for annotation and model training. While most prior work has focused on selecting informative images for classification tasks, we investigate the more challenging setting of dense prediction, where annotations are more costly and time-intensive, especially in medical imaging. Region-level annotation has been shown to be more efficient than image-level annotation for these tasks. However, existing methods for representative annotation region selection suffer from high computational and memory costs, irrelevant region choices, and heavy reliance on uncertainty sampling. We propose decomposition sampling (DECOMP), a new active learning sampling strategy that addresses these limitations. It enhances annotation diversity by decomposing images into class-specific components using pseudo-labels and sampling regions from each class. Class-wise predictive confidence further guides the sampling process, ensuring that difficult classes receive additional annotations. Across ROI classification, 2-D segmentation, and 3-D segmentation, DECOMP consistently surpasses baseline methods by better sampling minority-class regions and boosting performance on these challenging classes. Code is in supplementary materials.


120
DNA: Dual-branch Network with Adaptation for Open-Set Online Handwriting Generation

Tsai-Ling Huang ⋅ Nhat-Tuong Do-Tran ⋅ Ngoc-Hoang-Lam Le ⋅ Hong-Han Shuai ⋅ Ching-Chun Huang

Online handwriting generation (OHG) enhances hand- writing recognition models by synthesizing diverse, human- like samples. However, existing OHG methods strug- gle to generate unseen characters, particularly in glyph- based languages like Chinese, limiting their real-world ap- plicability. In this paper, we introduce Open-set Online Handwriting Generation (OOHG). In this new task, the writer’s style and the characters generated during testing are unseen during training. To tackle this challenge, we pro- pose a Dual-branch Network with Adaptation (DNA), which comprises an adaptive style branch and an adaptive con- tent branch. The style branch learns stroke attributes such as writing direction, spacing, placement, and flow to gen- erate realistic handwriting. Meanwhile, the content branch is designed to generalize effectively to unseen characters by decomposing character content into structural information and texture details, extracted via local and global encoders, respectively. Extensive experiments demonstrate that our DNA model is well-suited for the OOHG setting, achieving state-of-the-art performance.


121
STEG-AIW: Spatio-Temporal Gating and Adaptive-Timestep Inference for Efficient Spiking Neural Networks

Gulfam A Saju ⋅ Anton Spirkin ⋅ Felipe Marcelino ⋅ Yuchou Chang

Spiking neural networks (SNNs) are efficient, yet modern systems still waste compute by propagating redundant activations within a timestep and by using a fixed temporal horizon regardless of input difficulty. We present STEG-AIW, a training and inference framework that addresses both issues. The Spatio-Temporal Efficient Gate (STEG) is a lightweight gating module placed at residual stages. It suppresses non-salient activations while preserving temporal dynamics. The Adaptive Inference Window (AIW) module accumulates per-timestep evidence and converts it to halting probabilities for sample-wise early termination. We train the model end-to-end with a loss that balances task accuracy, an efficiency term proportional to the expected number of timesteps, and a sparsity term on gate activations. A simple complexity analysis links these choices to fewer synaptic operations. On static image benchmarks, STEG-AIW attains state-of-the-art accuracy with 34-88% fewer timesteps than the strongest baselines. On neuromorphic datasets, it matches or exceeds the best accuracy with 43-73% fewer timesteps and reduces synaptic operations accordingly. Overall, STEG-AIW provides a backbone-agnostic path to accurate, low-power inference. This moves SNNs closer to practical deployment. Code will be released upon acceptance.


122
Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction

Ce Zhang ⋅ Yale Song ⋅ Ruta Desai ⋅ Michael Iuzzolino ⋅ Joseph Tighe ⋅ Gedas Bertasius ⋅ Satwik Kottur

Visual Planning for Assistance (VPA) aims to predict a sequence of user actions required to achieve a specified goal based on a video showing the user’s progress. Although recent advances in multimodal large language models (MLLMs) have shown promising results in video understanding, long-horizon visual planning remains a challenging problem. We identify two challenges in training large MLLMs for video-based planning tasks: (1) scarcity of procedural annotations, limiting the model’s ability to learn procedural task dynamics effectively, and (2) inefficiency of next-token prediction objective to explicitly capture the structured action space for visual planning when compared to free-form, natural language. To tackle data scarcity, we introduce Auxiliary Task Augmentation. We design and train our model on auxiliary tasks relevant to long-horizon video-based planning (e.g., goal prediction) to augment the model’s planning ability. To more explicitly model the structured action space unique to visual planning tasks, we leverage Multi-token Prediction, extending traditional next-token prediction by using multiple heads to predict multiple future tokens during training. Our approach, VideoPlan, achieves state-of-the-art VPA performance on the COIN and CrossTask datasets, surpassing prior methods by 7.3% and 3.4%, respectively, when predicting 3 future actions. We further extend our method to the challenging Ego4D Long-term Action Anticipation task, and show that it is on par with the state-of-the-art approaches despite not using specialized egocentric features. We will open-source data, model checkpoints, and training code.


123
Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar

Rongsheng Qian ⋅ Chi Xu ⋅ Xiaoqiang Ma ⋅ Hao Fang ⋅ Yili Jin ⋅ William Atlas ⋅ Jiangchuan Liu

The growing affordability of sonar devices, along with expanded 5G and satellite access, has accelerated imaging sonar deployment in remote, real-time scenarios. Common applications include offshore rescue, disaster warnings, and in-season fisheries. Real-time sonar streaming and analytics in such wild environments face challenges from limited infrastructure, network dynamics, and signal distortion. Existing methods also struggle with low data quality and complex, sonar-specific artifacts. To address these issues, we develop SCOPE, a self-supervised framework for joint compression and artifact correction of sonar streams. It integrates (1) Adaptive Codebook Compression (ACC) for stable latent representations of sonar data, (2) Frequency-Aware Multiscale Segmentation (FAMS) to decompose signals into high-frequency temporal components and low-frequency structural signals while suppressing artifacts, and (3) a hedging training strategy that improves frequency sensitivity and further reduces artifacts.SCOPE requires no clean ground truth and adapts to streaming conditions. Evaluated on real-world Adaptive Resolution Imaging Sonar (ARIS) data, it achieved 0.77 SSIM, 40\% higher than prior work, and compressed to $\leq0.0118$ bits per pixel. Experiments showed SCOPE reduced bandwidth by over 50\% while improving downstream tasks. With 3.1 ms encoding and 97 ms decoding, SCOPE enables real-time processing and has been deployed in three Pacific Northwest rivers for salmon and environmental monitoring, enabling practical, quality sonar streaming in the wild.


124
Perceptually Guided 3DGS Streaming and Rendering for Mixed Reality

Yunxiang Zhang ⋅ Sai Mupparaju ⋅ Kenneth Chen ⋅ Jenna Kang ⋅ Xinyu Zhang ⋅ Maito Omori ⋅ Kazuyuki Arimatsu ⋅ Qi Sun

Recent breakthroughs in radiance fields, particularly 3D Gaussian Splatting (3DGS), have unlocked real-time, high-fidelity rendering of complex environments, boosting broad applications. However, the stringent requirements of mixed reality (MR), including high refresh rates, high-resolution stereo rendering, and limited computing, remain beyond the reach of current 3DGS methods. Meanwhile, the wide field-of-view design of MR displays, which mimics natural human vision, offers a unique opportunity to exploit the limitations of the human visual system to reduce computation overhead without compromising perceived rendering quality.To this end, we propose a perception-guided, continuous level-of-detail (LOD) framework for 3DGS that maximizes perceived quality under given compute resources. We distill a visual quality metric, which encodes the spatial, temporal, and peripheral characteristics of human visual perception, into a lightweight, gaze-contingent model that predicts and adaptively modulates the LOD across the user's visual field based on each region's contributions to perceptual quality. This resource-optimized modulation, guided by both scene content and user gaze behavior, enables significant runtime acceleration with minimal loss in perceived visual quality. To support low-power, untethered MR setups, we introduce an edge-cloud rendering framework that partially offloads computation to the cloud, further reducing overhead on the edge device. Objective metrics and MR user study evidence that, compared to vanilla and foveated LOD baselines, our method achieves superior trade-offs between computational efficiency and perceptual quality.


125
Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting

Shu Zou ⋅ Xinyu Tian ⋅ Lukas Wesemann ⋅ Fabian Waschkowski ⋅ Zhaoyuan Yang ⋅ Jing Zhang

Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human–object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-HINT, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-HINT consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-HINT as a new training-free and generalizable solution for explainable video anomaly detection.


126
Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

Sree Bhattacharyya ⋅ Yaman Singla ⋅ Sudhir Yarram ⋅ Somesh Singh ⋅ Harini S I ⋅ James Wang

Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.


127
Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

Yuxiao Chen ⋅ Jue Wang ⋅ Zhikang Zhang ⋅ Jingru Yi ⋅ Xu Zhang ⋅ Yang Zou ⋅ Zhaowei Cai ⋅ Jianbo Yuan ⋅ Xinyu Li ⋅ Hao Yang ⋅ Davide Modolo

With recent advancements in video backbone architectures and the remarkable success of large language models (LLMs), long-form video understanding—analyzing videos that span tens of minutes—has become both feasible and increasingly popular. However, the inherently redundant nature of video sequences presents significant challenges for current state-of-the-art models. These challenges arise from two key aspects: 1) efficiently incorporating a larger number of frames within the memory budget, and 2) extracting discriminative information from the vast volume of input data. In this paper, we present a novel, end-to-end schema for long-form video understanding, featuring an information-density-based adaptive video sampler (AVS) and an autoencoder based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two significant advantages: it adaptively and effectively captures essential information from video sequences with various duration, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework achieves promising performance across a range of benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results demonstrate the versatility and effectiveness of our approach, particularly in handling the complexities of the long video sequences.


128
Guided Texture Segmentation via Coordinate-Aware Class-Ratio Mapping

Bishal Swain ⋅ Kyung Cheoi ⋅ Jaepil Ko

Segmenting texture-rich images is challenging due to the presence of various local ambiguities. These are common in domains like metallography, where scanning electron microscopy (SEM) images must be segmented to quantify different elements that determine material properties. Conventional encoder–decoder models often struggle in these scenarios because they rely solely on local features and lack global distributional awareness. We propose a guided segmentation framework that introduces coordinate-aware class-ratio mapping. It is a mechanism that transforms expected class proportions into spatial maps and integrates them with encoder features via a adaptive gate fusion module. This enforces consistency between global class ratios and local pixel predictions, allowing the model to resolve ambiguous textures more effectively. Unlike traditional methods, we condition the model on image-specific global ratios, which can be obtained from experts or estimated in an auto-regressive method. Extensive experiments on metallographic SEM benchmarks demonstrate that our framework consistently improves performance across diverse backbones, achieving up to +6.1\% Dice score improvements with minimal parameter overhead ($<$2\%). Ablation studies confirm that both coordinate-aware class-ratio mapping and the adaptive gate fusion module contribute complementary benefits. In addition, we demonstrate that segmentation performance improves provided that the accuracy of the input class ratios is at least 60\%. The private data and code will be made available upon publication.


129
Empowering Source-Free Domain Adaptation via MLLM-Guided Reliability-Based Curriculum Learning

Dongjie Chen ⋅ Kartik Patwari ⋅ Zhengfeng Lai ⋅ Xiaoguang Zhu ⋅ Sen-ching Cheung ⋅ Chen-Nee Chuah

Existing SFDA methods struggle to fully use pre-trained knowledge and often rely on a single model’s predictions or handcrafted prompts, limiting robustness under domain shift. Multimodal Large Language Models (MLLMs) offer a promising alternative: they encode rich visual-semantic knowledge and generalize well without task-specific tuning. However, their use in SFDA is hindered by instruction-following failures, inconsistent outputs, and high inference costs. We propose Reliability-based Curriculum Learning (RCL), a novel framework that distills robust supervision from multiple frozen MLLMs into a compact target model. RCL organizes adaptation as a three-stage curriculum that progressively incorporates pseudo-labels based on inter-model agreement and model confidence, enabling stable and noise-aware training. Our approach achieves state-of-the-art performance on standard SFDA datasets, Office-Home, DomainNet-126, and VisDA-C, outperforming zero-shot MLLMs, their ensembles, all without accessing source data or tuning foundation models.


130
ExDDV: A New Dataset for Explainable Deepfake Detection in Video

Vlad Hondru ⋅ Eduard Hogea ⋅ Darian Onchis ⋅ Radu Ionescu

The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for \textbf{Ex}plainable \textbf{D}eepfake \textbf{D}etection in \textbf{V}ideo. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://anonymous.4open.science/r/ExDDV/.

Medical image segmentation requires large annotated datasets, creating a significant bottleneck for clinical applications. While one-shot segmentation methods can learn from minimal examples, existing approaches struggle with precise boundary delineation in medical images, particularly when anatomically similar regions appear without sufficient spatial context. We propose GENet (GeoEdgeNet), a novel framework that incorporates spatial relationships through edge-aware geodesic distance learning. Our key insight is that medical structures follow predictable geometric patterns that can guide prototype extraction even with limited training data. Unlike methods relying on complex architectural components or heavy neural networks, our approach leverages computationally lightweight geometric modeling.The framework combines three main components: (1) A multi-component edge-aware geodesic distance learning module that respects anatomical boundaries, (2) adaptive prototype extraction that captures both global structure and local boundary details, and (3) a dual-mode optimization strategy that adapts to different organ types. Extensive experiments on AMOS and ACDC datasets demonstrate substantial improvements over state-of-the-art methods, achieving 82.60\% and 76.33\% mean Dice scores on ABD-MRI and ABD-CT respectively. Notably, our method reduces boundary errors by 44.5\% in terms of Hausdorff Distance compared to existing approaches, making it highly suitable for clinical applications requiring precise segmentation with limited annotated data.


132
KMOPS: Keypoint-Driven Method for Multi-Object Pose and Metric Size Estimation from Stereo Images

Ying-Kun Wu ⋅ Yi Shen ⋅ Tzuhsuan Huang ⋅ I-Sheng Fang ⋅ Jun-Cheng Chen

The six-degree-of-freedom (6-DoF) pose and metric size estimation of multiple objects from RGB images only remains a challenging task, particularly due to significant variations in object shape, appearance, and frequent occlusions in complex scenes. To address these challenges, we introduce KMOPS, a Keypoint-driven method tailored specifically for occlusion-robust Multi-Object Pose and metric Size estimation from a single calibrated stereo image pair. Leveraging the stereo input, our approach first extracts the 2D keypoints of the enclosing bounding boxes of the objects in each view, followed by triangulating them for accurate 3D positions. Then, a pose fitting module is employed to accurately obtain each object’s rotation, translation, and dimensions by registering the triangulated 3D keypoints with the canonical ones using a closed-form weighted Procrustes alignment. Our formulation eliminates the need for predefined 3D search spaces or volumetric anchors, which are often required by other methods to constrain the vast 3D solution space. With extensive experiments on the challenging Transparent Object Dataset (TOD) and the large-scale StereOBJ-1M benchmark, the proposed method consistently achieves state-of-the-art results, outperforming other monocular and stereo methods with a simple and effective architecture.


133
OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding

Artem Moroz ⋅ Vít Zeman ⋅ Martin Mikšík ⋅ Elizaveta Isianova ⋅ Miroslav David ⋅ Pavel Burget ⋅ Varun Burde

We introduce a unified, end-to-end framework that seamlessly integrates object detection and pose estimation with a versatile onboarding process. Our pipeline begins with an onboarding stage that generates object representations from either traditional 3D CAD models or, in their absence, by rapidly reconstructing a high-fidelity neural representation (NeRF) from multi-view images. Given a test image, our system first employs the CNOS detector to localize target objects. For each detection, our novel pose estimation module, OPFormer, infers the precise 6D pose. The core of OPFormer is a transformer-based architecture that leverages a foundation model for robust feature extraction. It uniquely learns a comprehensive object representation by jointly encoding multiple template views and enriches these features with explicit 3D geometric priors using Normalized Object Coordinate Space (NOCS). A decoder then establishes robust 2D-3D correspondences to determine the final pose. Evaluated on the challenging BOP benchmarks, our integrated system demonstrates a strong balance between accuracy and efficiency, showcasing its practical applicability in both model-based and model-free scenarios.


134
Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling

Alexander Prutsch ⋅ David Schinagl ⋅ Horst Possegger

Future trajectories of neighboring traffic agents have a significant influence on the path planning and decision-making of autonomous vehicles.While trajectory forecasting is a well-studied field, research mainly focuses on snapshot-based prediction, where each scenario is treated independently of its global temporal context.However, real-world autonomous driving systems need to operate in a continuous setting, requiring real-time processing of data streams with low latency and consistent predictions over successive timesteps.We leverage this continuous setting to propose a lightweight yet highly accurate streaming-based trajectory forecasting approach.We integrate valuable information from previous predictions with a novel endpoint-aware modeling scheme.Our temporal context propagation uses the trajectory endpoints of the previous forecasts as anchors to extract targeted scenario context encodings.Our approach efficiently guides its scene encoder to extract highly relevant context information without needing refinement iterations or segment-wise decoding.Our experiments highlight that our approach effectively relays information across consecutive timesteps. Unlike methods using multi-stage refinement processing, our approach significantly reduces inference latency, making it well-suited for real-world deployment.We achieve state-of-the-art streaming trajectory prediction results on the Argoverse 2 multi-agent and single-agent benchmarks, while requiring substantially fewer resources.


135
Hybrid State Representation for Video Procedure Planning

Woo Suk Choi ⋅ Youwon Jang ⋅ Minsu Lee ⋅ Byoung-Tak Zhang

Accurate state representation is critical for effective procedure planning from visual inputs. Existing methods typically rely on sentence-level natural language descriptions to represent states. However, such representations are often ambiguous and fail to capture fine-grained object interactions, leading to errors in complex scenarios. To overcome these limitations, we propose a Hybrid State Representation (HSR), inspired by human-like reasoning, that models procedural states with both structural precision and contextual clarity. HSR integrates two complementary modalities: (1) Semantic State Graphs (SSGs), which explicitly encode objects, attributes, and their relations, and (2) contextual Question-Answer (QA) pairs, which act as semantic probes to disambiguate critical state transitions. We further design a heterogeneous encoder to fuse these components and introduce a visual-state alignment objective to ground the hybrid representation in the visual context. Extensive experiments on the COIN, CrossTask, and NIV benchmarks demonstrate that our method establishes a new state-of-the-art, achieving significant gains on the strict Success Rate (SR) metric. Ablation studies confirm that both the structural (SSG) and contextual (QA) components of HSR are essential for the observed performance gains.