Track: Poster Session 4 + Reception

1

CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization

Anh-Duy Le ⋅ Van Pham ⋅ Thanh Vo ⋅ Mai Toan ⋅ Tuan-Anh Tran

One-shot styled handwriting image generation, despite achieving impressive results in recent years, remains challenging due to the difficulty in capturing the intricate and diverse characteristics of human handwriting by using solely a single reference image. Existing methods still struggle to generate visually appealing and realistic handwritten images and adapt to complex, unseen writer styles, struggling to isolate invariant style features (e.g., slant, stroke width, curvature) while ignoring irrelevant noise. To tackle this problem, we introduce Patch \textbf{Con}trastive Enhancement and \textbf{St}yle-\textbf{A}ware Qua\textbf{nt}ization via Denoising Diffusion (\textbf{CONSTANT}), a novel one-shot handwriting generation via diffusion model. CONSTANT leverages three key innovations: 1) a Style-Aware Quantization (SAQ) module that models style as discrete visual tokens capturing distinct concepts; 2) a contrastive objective to ensure these tokens are well-separated and meaningful in the embedding style space; 3) a latent patch-based contrastive ($L_{LatentPCE}$) objective help improving quality and local structures by aligning multiscale spatial patches of generated and real features in latent space. Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and our proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of our method over state-of-the-art approaches. Code is available at \href{https://github.com/anonymous6399/CONSTANT}

2

DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

Jaewoo Song ⋅ Jooyoung Choi ⋅ Kanghyun Baek ⋅ Sangyub Lee ⋅ Daemin Park ⋅ Sungroh Yoon

Despite recent text-to-image models achieving high-fidelity text rendering, they still struggle to generate long or multiple text due to diluted global attention. To address this, we propose DCText, a training-free method for visual text generation that employs a divide-and-conquer strategy, inspired by the reliable short-text generation of Multi-Modal Diffusion Transformer models. To effectively render long or multiple texts, our method first decomposes a global prompt by extracting and dividing the target text, then assigns each decomposed text to a designated region for generation. To ensure these text segments are accurately rendered within their regions while preserving overall image coherence, we introduce two attention masks, the Text-Focus Attention Mask and the Context-Expansion Attention Mask, that are sequentially applied during denoising. In addition, our Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive evaluations across diverse datasets, covering both single-sentence and multi-sentence cases, demonstrate that DCText consistently achieves the highest text accuracy without compromising image quality, while also delivering the lowest generation latency.

3

VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping

Sanoojan Baliah ⋅ Yohan Abeysinghe ⋅ Rusiru Thushara ⋅ Khan Muhammad ⋅ Abhinav Dhall ⋅ Karthik Nandakumar ⋅ Muhammad Haris Khan

We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code will be released upon acceptance.

4

VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework

Donglin Huang ⋅ Yongyuan Li ⋅ Tianhang Liu ⋅ Junming Huang ⋅ Xiaoda Yang ⋅ Chi Wang ⋅ Weiwei Xu

Existing for audio- and pose-driven human animation methods often struggle with stiff head movements and blurry hands, primarily due to the weak correlation between audio and head movements and the structural complexity of hands.To address these issues, we propose \textbf{VividAnimator}, an end-to-end framework for generating high-quality, half-body human animations driven by audio and sparse hand pose conditions. Our framework introduces three key innovations. First, to overcome the instability and high cost of online codebook training, we pre-train a Hand Clarity Codebook (HCC) that encodes rich, high-fidelity hand texture priors, significantly mitigating hand degradation. Second, we design a Dual-Stream Audio-Aware Module (DSAA) to model lip synchronization and natural head pose dynamics separately while enabling interaction. Third, we introduce a Pose Calibration Trick (PCT) that refines and aligns pose conditions by relaxing rigid constraints, ensuring smooth and natural gesture transitions. Extensive experiments demonstrate that Vivid Animator achieves state-of-the-art performance, producing videos with superior hand detail, gesture realism, and identity consistency, validated by both quantitative metrics and qualitative evaluations.

5

Fine-grained Defocus Blur Control for Generative Image Models

Ayush Shrivastava ⋅ Connelly Barnes ⋅ Cecilia Zhang ⋅ Lingzhi Zhang ⋅ Andrew Owens ⋅ Sohrab Amirghodsi ⋅ Eli Shechtman

Current text-to-image diffusion models excel at generating diverse, high-quality images, yet they struggle to incorporate fine-grained camera metadata such as precise aperture settings. In this work, we introduce a novel text-to-image diffusion framework that leverages camera metadata, or EXIF data, which is often embedded in image files, with an emphasis on generating controllable lens blur. Our method mimics the physical image formation process by first generating an all-in-focus image, estimating its monocular depth, predicting a plausible focus distance with a novel focus distance transformer, and then forming a defocused image with an existing differentiable lens blur model. Gradients flow backwards through this whole process, allowing us to learn without explicit supervision to generate defocus effects based on content elements and the provided EXIF data. At inference time, this enables precise control over defocus effects while preserving scene contents, which is not achievable with existing diffusion models. Experimental results demonstrate that our model enables superior fine-grained control without altering the depicted scene.

6

CalibBEV: LiDAR-Camera Calibration via BEV Alignment

Filippo D'Addeo ⋅ Lorenzo Cipelli ⋅ Adriano Cardace ⋅ Emanuele Ghelfi ⋅ Andrea Zinelli ⋅ Massimo Bertozzi

We present $\textbf{CalibBEV}$, a novel Bird's Eye View (BEV) alignment approach for LiDAR-camera calibration. Our method unifies LiDAR and camera data into a shared 3D spatial representation, enabling accurate and robust cross-modal calibration.CalibBEV extracts sensor-wise BEV features from each modality using domain-specific architectures and estimates the calibration matrix through a two-step alignment process. First, we perform an implicit alignment by regressing a coarse calibration matrix directly from the BEV features. To ease this alignment, we enforce semantic consistency between BEV representations across modalities using a contrastive loss inspired by CLIP, guiding both networks toward a unified feature space.In the second step, we leverage our BEV formulation to explicitly align the features of one modality with the other, refining the initial coarse estimate into a final, more accurate calibration matrix.CalibBEV significantly outperforms prior point-to-pixel matching methods, achieving state-of-the-art calibration accuracy. On the KITTI and nuScenes benchmarks, our method reduces the Relative Rotation Error (RRE) by 51% and 68%, and the Relative Translation Error (RTE) by 80% and 91%, respectively, compared to previous methods.

7

X-JEPA: A Novel Joint Learning Cross-Modal Predictive Alignment Framework for Remote Sensing Image Retrieval

Shabnam Choudhury ⋅ Yash Salunkhe ⋅ Vaibhav Rajan ⋅ Subhasis Chaudhuri ⋅ Biplab Banerjee

The growing scale and heterogeneity of remote sensing (RS) imagery demand robust, scalable frameworks for content-based image retrieval across sensor modalities. We introduce X-JEPA, a novel predictive self-supervised architecture explicitly designed for cross-modal remote sensing image retrieval (RS-CMIR), and the first to extend joint embedding predictive paradigms beyond unimodal domains. Unlike prior contrastive or reconstruction-based methods, X-JEPA formulates representation learning as a latent forecasting task: predicting the semantic embedding of a target modality given context from another. To enforce modality-invariant alignment, we propose a geometry-aware Prediction Space Alignment (PSA) loss, which captures the structure of the latent space without requiring pixel-level reconstruction or modality pairing. We evaluate X-JEPA on two large-scale benchmarks—BEN-14K (Sentinel-1/Sentinel-2) and fMoW (RGB/Sentinel) across both unimodal and cross-modal retrieval tasks. X-JEPA consistently outperforms state-of-the-art self-supervised baselines, including MAE, SatMAE, CrossMAE, CSMAE-SESD, CROMA, SkySense, DeCUR and REJEPA, achieving up to 11.0% F1-score improvement in cross-modal retrieval and 9.8% in unimodal settings. Despite its high retrieval accuracy, the model remains lightweight, requiring fewer parameters and yielding 8–10% F1-score gains on average, establishing a new state-of-the-art for scalable, sensor-agnostic RS-CMIR.

8

SSMRadNet : A Sample-wise State-Space Framework for Efficient and Ultra-Light Radar Segmentation and Object Detection

Anuvab Sen ⋅ Mir Sayeed Mohammad ⋅ Saibal Mukhopadhyay

We introduce SSMRadNet, the first multi-scale State Space Model (SSM) based detector for Frequency Modulated Continuous Wave (FMCW) radar that sequentially processes raw ADC samples through two SSMs. One SSM learns a chirp-wise feature by sequentially processing samples from all receiver channels within one chirp, and a second SSM learns a representation of a frame by sequentially processing chirp-wise features. The latent representations of a radar frame are decoded to perform segmentation and detection tasks. Comprehensive evaluations on the RADIal dataset show SSMRadNet has 10–33× fewer parameters and 60–88× less computation (GFLOPs) while being 3.7× faster than state-of-the-art transformer and convolution-based radar detectors at competitive performance for segmentation tasks.

9

Rank-based Geographical Regularization: Revisiting Contrastive Self-Supervised Learning for Multispectral Remote Sensing Imagery

Tom Burgert ⋅ Leonard Hackel ⋅ Paolo Rota ⋅ Begüm Demir

Self-supervised learning (SSL) has become a powerful paradigm for learning from large, unlabeled datasets, particularly in computer vision (CV). However, applying SSL to multispectral remote sensing (RS) images presents unique challenges and opportunities due to the geographical and temporal variability of the data. In this paper, we introduce GeoRank, a novel regularization method for contrastive SSL that improves upon prior techniques by directly optimizing spherical distances to embed geographical relationships into the learned feature space. GeoRank outperforms or matches prior methods that integrate geographical metadata and consistently improves diverse contrastive SSL algorithms (e.g., BYOL, DINO). Beyond this, we present a systematic investigation of key adaptations of contrastive SSL for multispectral RS images, including the effectiveness of data augmentations, the impact of dataset cardinality and image size on performance, and the task dependency of temporal views. All code and models will be made publicly available upon acceptance.

10

OMeGa: Joint Optimization of Explicit Meshes and Gaussian Splats for Robust Scene-Level Surface Reconstruction

Yuhang Cao ⋅ Haojun Yan ⋅ Danya Yao

Neural rendering algorithms for Novel View Synthesis and Scene Reconstruction tasks have recently received much attention with the advancement of 3D Gaussian Splatting. For mesh reconstruction, most existing works over-fit a Gaussian Splatting model with multi-view images and obtain the triangle mesh from the model using post-optimization extraction strategies. However, such methods exhibit following limitations: 1) Gaussian Splats often yield inaccurate geometry in indoor scene reconstructions, particularly in texture-less regions, leading to suboptimal triangle mesh quality; 2) the mesh extraction is entirely decoupled from the optimization process, neglecting the potential of using mesh geometry as constraints to guide the optimization of Gaussian Splats. To address these challenges, this paper introduces a novel end-to-end differentiable framework for both rendering and geometry reconstruction tasks. Our key contribution involves jointly optimizing 2D splats and an explicit 3D mesh representation through a flexible binding strategy during the training process. This allows our approach to effectively leverage mesh geometry constraints to guide the optimization of 2D splats while preserving sufficient flexibility, resulting in both accurate alignment with scene surfaces and expressive texture representation. Furthermore, as another core component of our method, we design an iterative mesh refinement technique, including a novel gradient-based subdivision strategy and a mesh face removal strategy, to further improve the detail and accuracy of the reconstructed mesh. Extensive experiments show that our joint-representation framework achieves overall state-of-the-art performance on challenging benchmarks, effectively addressing prior limitations associated with indoor scene reconstruction.

11

Confidence Through Parallel Attention for Depth and Uncertainty Estimation in Dynamic Environments

Onkar Susladkar ⋅ Rohit Pawar ⋅ Chirag Sehgal ⋅ Samaksh Ujjawal ⋅ Sparsh Mittal

Monocular depth estimation is crucial for robotics, offering a lightweight and scalable alternative to stereo or LiDAR-based systems. While recent methods have achieved high accuracy, their efficacy degrades under real-world conditions such as occlusion, texture ambiguity, and domain shifts. We introduce ConFiDeNet, a unified framework that jointly predicts metric depth and associated uncertainty, enabling risk-aware robotic perception. ConFiDeNet employs a lightweight parallel attention module that efficiently fuses semantic cues from DINOv2 dense descriptors and SAM2-based segmentation for densely occluded objects, enhancing structural understanding without sacrificing real-time performance. Furthermore, we explicitly condition the model on environment type, improving generalization across diverse indoor and outdoor scenes without retraining. Our method achieves state-of-the-art results across six datasets under both supervised and zero-shot settings, outperforming nine prior techniques, including Marigold, ZeoDepth, PatchFusion, and MonoProb. With significantly faster inference and high prediction confidence, ConFiDeNet is readily deployable for embodied AI, self-driving applications, and robotic manipulation tasks.

12

BiNAR: A Bi-Modal Framework for Non-Aligned RGB-IR 3D Reconstruction via Gaussian Splatting

Zhongwen Wang ⋅ Han Ling ⋅ Weihao Zhang ⋅ Yinghui Sun ⋅ Quansen Sun

Existing RGB-IR (infrared) bi-modal 3D reconstruction methods generally have difficulty in simultaneously processing non-aligned multi-modal data with significant differences in resolution and spectral characteristics and achieving high-precision pixel-level reconstruction. Non-aligned RGB-IR 3D reconstruction and rendering represents a new domain. To this end, we propose BiNAR, a bi-modal framework that can directly process non-aligned data collected by conventional RGB and IR cameras and generate high-resolution, pixel-level aligned renderings. BiNAR first uses cross-modal multi-camera joint calibration to accurately estimate the internal and external parameters of the RGB-IR camera and unify the coordinate system; then, it fuses the features of different modalities in the Unified Gaussian Field and jointly optimizes the Gaussians to achieve cross-modal consistent 3D scene expression. Experimental results show that BiNAR significantly outperforms traditional single-modal and bi-modal Gaussian splatting methods in rendering quality, achieving a sub-pixel average reprojection error of 0.242 px and improves IR PSNR by 12.22 dB. We also build a pixel-level aligned RGB-IR dataset covering a variety of indoor and outdoor scenes and including real temperature data, providing a reliable benchmark for subsequent multi-modal research. The code and dataset will be available.

13

Spec-Gloss Surfels and Normal–Diffuse Priors for Relightable Glossy Objects

Georgios Kouros ⋅ Minye Wu ⋅ Tinne Tuytelaars

Accurate reconstruction and relighting of glossy objects remain a longstanding challenge, as object shape, material properties, and illumination are inherently difficult to disentangle. Existing neural rendering approaches often rely on simplified BRDF models or parameterizations that couple diffuse and specular components, which restricts faithful material recovery and limits relighting fidelity. We propose a relightable framework that integrates a microfacet BRDF with the specular–glossiness parameterization into 2D Gaussian Splatting with deferred shading. This formulation enables more physically consistent material decomposition, while diffusion-based priors for surface normals and diffuse color guide early-stage optimization and mitigate ambiguity. Furthermore, a coarse-to-fine optimization of the environment map accelerates convergence and preserves high-dynamic-range specular reflections. Extensive experiments on complex glossy scenes demonstrate that our method achieves high-quality geometry and material reconstruction and delivers substantially more realistic and consistent relighting under novel illumination compared to existing Gaussian splatting methods. The code will be released upon acceptance.

14

Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning

Lintao XU ⋅ Yinghao WANG ⋅ Chaohui Wang

Occlusion Boundary Estimation (OBE) identifies boundaries arising from both inter-object occlusions and self-occlusion within individual objects, distinguishing them from ordinary edges and semantic contours to support more accurate scene understanding.This task is closely related to Monocular Depth Estimation (MDE), which infers depth from a single image, as Occlusion Boundaries (OBs) provide critical geometric cues for resolving depth ambiguities, while depth can conversely refine occlusion reasoning. In this paper, we propose MoDOT, a novel method that jointly estimates depth and OBs from a single image for the first time. MoDOT incorporates a new module, CASM, which combines cross-attention and multi-scale strip convolutions to leverage mid-level OB features for improved depth prediction. It also includes an occlusion-aware loss, OBDCL, which encourages more accurate boundaries in the predicted depth map.Extensive experiments demonstrate the mutual benefits of jointly estimating depth and OBs, and validate the effectiveness of MoDOT's design. Our method achieves state-of-the-art (SOTA) performance on two synthetic datasets and the widely used NYUD-v2 real-world dataset, significantly outperforming multi-task baselines. Furthermore, the cross-domain results of MoDOT on real-world depth prediction—trained solely on our synthetic dataset—yield promising results, preserving sharp OBs in the predicted depth maps and demonstrating improved geometric fidelity compared to competitors. We will release our code, pre-trained models, and dataset at [link].

15

Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance

Francesco Ragusa ⋅ Michele Mazzamuto ⋅ Rosario Forte ⋅ Irene D'Ambra ⋅ James Fort ⋅ Jakob Engel ⋅ Antonino Furnari ⋅ Giovanni Farinella

We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a "Wizard of OZ" data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 45k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlights the limitations of current models when used to provide expert-level assistance to the user. Ego-EXTRA dataset will be publicly shared. We believe that Ego-EXTRA will support the benchmark of egocentric video-language assistants.

16

Similarity-aware Probabilistic Embeddings Modeling for Video-Text Retrieval

Yuliang Huang ⋅ Pengxu Wei ⋅ Zhicheng Dong ⋅ Liang Lin

Video-text retrieval is a fundamental task in multi-modal learning, aiming to accurately retrieve videos that match given textual descriptions. While recent contrastive methods have made significant progress by embedding videos and texts into a joint space, they often suffer from semantic over-clustering—a phenomenon where semantically distinct videos are mapped to overly similar embeddings due to dominant but uninformative visual patterns (e.g., recurring backgrounds or common objects). This effect becomes particularly problematic under short or ambiguous queries, where it suppresses fine-grained semantics and degrades retrieval precision. To address this, we propose Similarity-aware Probabilistic Embeddings Modeling (SPEM), a novel framework that refines video representations by modeling them as adaptive probability distributions rather than static vectors. SPEM incorporates cross-modal attention to highlight text-relevant visual content and suppress irrelevant patterns, and leverages multi-level similarity features to dynamically adjust the embedding variance, thereby preserving subtle but critical semantic cues. To further improve alignment, we employ a Semantic-Distribution Contrastive Loss to optimize the alignment structure in the probabilistic space, encouraging more discriminative separation across hard negatives. Extensive experiments on five widely-used video-text retrieval benchmarks—MSRVTT, DiDeMo, VATEX, MSVD, and Charades—demonstrate that SPEM consistently outperforms strong CLIP-based baselines.

17

PromptGAR: Flexible Promptive Group Activity Recognition

Zhangyu Jin ⋅ Andrew Feng ⋅ Ankur Chemburkar ⋅ Celso de Melo

We present PromptGAR, a novel framework for Group Activity Recognition (GAR) that offering both input flexibility and high recognition accuracy. The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, fixed number of frames and instances, and the lack of actor consistency.To bridge the gap, we proposed PromptGAR, which is the first GAR model to provide input flexibility across prompts, frames, and instances without the need for retraining.We leverage diverse visual prompts—like bounding boxes, skeletal keypoints, and instance identities—by unifying them as point prompts. A recognition decoder then cross-updates class and prompt tokens for enhanced performance.To ensure actor consistency for extended activity durations, we also introduce a relative instance attention mechanism that directly encodes instance identities.Comprehensive evaluations demonstrate that PromptGAR achieves competitive performances both on full prompts and partial prompt inputs, establishing its effectiveness on input flexibility and generalization ability for real-world applications.

18

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Zitian Tang ⋅ Rohan Krishnan ⋅ Zhiqiu Yu ⋅ Chen Sun

Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g., visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning.

19

Broadcast2Pitch: Game State Reconstruction from Unconstrained Soccer Videos

Yin May Oo ⋅ Yewon Hwang ⋅ Muhammad Robbani ⋅ VANYI CHAO ⋅ Ankhzaya Jamsrandorj ⋅ Hoang Nguyen ⋅ Kyung-Ryoul Mun ⋅ Jinwook Kim

Game State Reconstruction (GSR) aims to reconstruct the 2D positions and identities of all athletes from broadcast soccer videos, requiring robust tracking, localization, and identity association under dynamic and unconstrained camera motions. We propose a modular GSR framework that integrates a multi-task keypoint and line detection model with an optimization-based homography estimation module. This approach leverages dense geometric cues from lines, circles, and keypoints to achieve robust spatial localization on a frame-by-frame basis, providing reliable alignment in diverse broadcast scenarios. To address identity consistency, we use appearance-based re-identification and a vision-language-guided tracklet refinement strategy to reduce ID switches and enforce temporal coherence. Comprehensive ablation studies validate the contribution of each component, and our framework achieves state-of-the-art performance on the SoccerNet-GSR benchmark, outperforming existing baselines by a significant margin. The proposed framework demonstrates strong robustness, generalization across scenes, and practical utility for structured game understanding in real-world broadcast sports analytics.

20

VLMs Guided Interpretable Decision Making in Autonomous Driving

Xin Hu ⋅ TAOTAO JING ⋅ Renran Tian ⋅ Zhengming Ding

Recent advancements in autonomous driving have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a novel approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable autonomous driving systems.

21

DARB-Splatting: Generalizing Splatting with Decaying Anisotropic Radial Basis Functions

Hashiru Pramuditha ⋅ Vinasirajan Viruthshaan ⋅ Vishagar Arunan ⋅ Saeedha Nazar ⋅ Sameera Ramasinghe ⋅ Simon Lucey ⋅ Ranga Rodrigo

Splatting-based 3D reconstruction methods have gained popularity with the advent of 3D Gaussian Splatting, efficiently synthesizing high-quality novel views. These methods commonly resort to using exponential family functions, such as the Gaussian function, as reconstruction kernels due to their anisotropic nature, ease of projection, and differentiability in rasterization. However, the field remains restricted to variations within the exponential family, leaving generalized reconstruction kernels largely underexplored, partly due to the lack of easy integrability in 3D to 2D projections. In this light, we show that a class of decaying anisotropic radial basis functions (DARBFs), which are non-negative functions of the Mahalanobis distance, supports splatting by approximating the Gaussian function's closed-form integration advantage. With this fresh perspective, we demonstrate varying performances across selected DARB reconstruction kernels, achieving comparable training convergence and memory footprints, with on-par PSNR, SSIM, and LPIPS results.

22

Surgical Gaussian Surfels: Highly Accurate Real-time Surgical Scene Rendering using Gaussian Surfels

Idris Sunmola ⋅ Zhenjun Zhao ⋅ Samuel Schmidgall ⋅ Yumeng Wang ⋅ Paul Maria Scheikl ⋅ Viet Pham ⋅ Axel Krieger

Accurate geometric reconstruction of deformable tissues in monocular endoscopic video remains a fundamental challenge in robot-assisted minimally invasive surgery. Although recent volumetric and point primitive methods based on neural radiance fields (NeRF) and 3D Gaussian primitives have efficiently rendered surgical scenes, they still struggle with handling artifact-free tool occlusions and preserving fine anatomical details. These limitations stem from unrestricted Gaussian scaling and insufficient surface alignment constraints during reconstruction. To address these issues, we introduce Surgical Gaussian Surfels (SGS), which transform anisotropic point primitives into surface-aligned elliptical splats by constraining the scale component of the Gaussian covariance matrix along the view-aligned axis. We also introduce the Fully Fused Deformation Multilayer Perceptron (FFD-MLP), a lightweight Multi-Layer Perceptron (MLP) that predicts accurate surfel motion fields up to 5× faster than a standard MLP. This is coupled with locality constraints to handle complex tissue deformations. We use homodirectional view-space positional gradients to capture fine image details by splitting Gaussian Surfels in over-reconstructed regions. In addition, we define surface normals as the direction of the steepest density change within each Gaussian surfel primitive, enabling accurate normal estimation without requiring monocular normal priors. We evaluate our method on two in-vivo surgical datasets, where it outperforms current state-of-the-art methods in surface geometry, normal map quality, and rendering efficiency, while remaining competitive in real-time rendering performance.

23

Gated Temporal Fusion Transformers for Robust Multi-Object Tracking

Jinho Kim ⋅ Kuk-Jin Yoon

Multiple Object Tracking (MOT) in dynamic and densely populated scenes presents significant challenges due to frequent occlusions, erratic object motion, and identity switches. While recent Transformer-based approaches have successfully leveraged global attention for object detection, most rely on temporal reasoning at the decoder level, leaving encoder-stage modeling underexplored. In this work, we propose an encoder-level temporal reasoning Transformer framework that embeds historical object trajectory information into the encoder stage via a tracklet memory. The encoder module, enhanced by Attention-by-Tracking, enriches visual features with temporal priors, while the decoder leverages Tracking-by-Attention to guide identity association using learned tracklet representations. To further improve temporal consistency and object localization, we introduce a gating-based temporal feature fusion mechanism that adaptively integrates multi-frame features based on cosine similarity. Additionally, we refine reference points in the encoder using temporal cues and incorporate learnable positional embeddings to enhance detection accuracy in cluttered environments. Our method is model-agnostic and can be applied to various Transformer-based MOT frameworks. When integrated into existing models such as TransTrack, MeMOTR, and MOTIP, it yields consistent performance improvements. Extensive experiments on DanceTrack and SportsMOT benchmarks demonstrate that our approach achieves superior tracking performance, including a HOTA score of 76.4 on SportsMOT. These results validate the effectiveness of encoder-level temporal integration and adaptive feature fusion for robust multi-object tracking in real-world scenarios.

24

SIAM: Synchronous Interaction Attention for Human Mesh Recovery

Niaz Ahmad ⋅ Saif Ullah ⋅ Youngmoon Lee ⋅ Guanghui Wang

Conventional 3D body mesh reconstruction methods often use decoupling strategies that isolate individual features for separate representation, lacking relational cues among entities. In this paper, we propose SIAM, a novel Synchronous Interaction Attention for Human Mesh Recovery. Our framework builds upon a high-resolution multi-branch backbone (HRNet) and introduces two key components. First, Synchronous Interaction Attention (SIA), which explicitly models spatial relational cues among multiple human instances in live scenes. Second, Feature Decomposition (FD), which extracts enriched instance-specific features by leveraging the attributes captured by the SIA module. This integrated approach significantly enhances spatial reasoning, mitigates error accumulation, and results in more accurate 3D human mesh reconstruction. SIAM achieves state-of-the-art performance on several benchmarks, including 3DPW, 3DPW-OCC, AGORA, and and CMU-Panoptic for 3D human mesh reconstruction. Notably, our model runs at 25 frames per second on video streams, highlighting its potential for real-time applications. The source code will be released publicly

25

SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding

Phyo Thet Yee ⋅ Dimitrios Kollias ⋅ Sudeepta Mishra ⋅ Abhinav Dhall

Audio-driven talking face generation has received growing interest, particularly for applications requiring expressive and natural human-avatar interaction. However, most existing emotion-aware methods rely on a single modality (either audio or image) for emotion embedding, limiting their ability to capture nuanced affective cues. Additionally, most methods condition on a single reference image, restricting the model’s ability to represent dynamic changes in actions or attributes across time. To address these issues, we introduce SynchroRaMa, a novel framework that integrates a multi-modal emotion embedding by combining emotional signals from text (via sentiment analysis) and audio (via speech-based emotion recognition and audio-derived valence-arousal features), enabling the generation of talking face videos with richer and more authentic emotional expressiveness and fidelity. To ensure natural head motion and accurate lip synchronization, SynchroRaMa includes an audio-to-motion (A2M) module that generates motion frames aligned with the input audio. Finally, SynchroRaMa incorporates scene descriptions generated by Large Language Model (LLM) as additional textual input, enabling it to capture dynamic actions and high-level semantic attributes. Conditioning the model on both visual and textual cues enhances temporal consistency and visual realism. Quantitative and qualitative experiments on benchmark datasets demonstrate that SynchroRaMa outperforms the state-of-the-art, achieving improvements in image quality, expression preservation, and motion realism. A user study further confirms that SynchroRaMa achieves higher subjective ratings than competing methods in overall naturalness, motion diversity, and video smoothness. Code and model weights will be released upon acceptance.

26

Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection

Melanie Wille ⋅ Tobias Fischer ⋅ Scarlett Raine

Underwater object detection is critical for monitoring marine ecosystems but poses unique challenges, including degraded image quality, imbalanced class distribution, and distinct visual characteristics. Not every species is detected equally well, yet underlying causes remain unclear. We address two key research questions: 1) What factors beyond data quantity drive class-specific performance disparities? 2) How can we systematically improve detection of under-performing marine species? We manipulate the DUO dataset to separate the object detection task into localization and classification and investigate the under-performance of the scallop class. Localization analysis using YOLO11 and TIDE finds that foreground-background discrimination is the most problematic stage regardless of data quantity. Classification experiments reveal persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity and inter-class dependencies. We recommend imbalanced distributions when prioritizing precision, and balanced distributions when prioritizing recall. Improving under-performing classes should focus on algorithmic advances, especially within localization modules. We publicly release our code and datasets.

27

LVM-Lite: Training Large Vision Models with Efficient Sequential Modeling

Xianhang Li ⋅ Hongru Zhu ⋅ Sucheng Ren ⋅ Linjie Yang ⋅ Peng Wang ⋅ Heng Wang ⋅ Xiaohui Shen ⋅ Qing Liu ⋅ Cihang Xie

Generative pre-training has significantly advanced natural language understanding. Building upon this success, recent research begins to innovate Large Vision Models (LVM) by leveraging large-scale pre-training on visual sequences, where simultaneous consideration of image token sequences within single images and across a set of images is of key importance. This paper shows that sequential modeling on single images and across multiple images can be efficiently and effectively decoupled. We introduce a two-stage learning pipeline, starting with single-image pre-training, followed by fine-tuning on long image/video sequences. We term this method Large Vision Model Lite (LVM-Lite). Extensive experiments showcase the impressive performance of LVM-Lite across various generative and discriminative benchmarks, comparable to specifically trained models without the need for task-specific training. Importantly, LVM-Lite accelerates training speed substantially up to $2.7\times$ and demonstrates strong scalability.

28

HiGlassRM: Learning to Remove High-prescription Glasses via Synthetic Dataset Generation

Sebin Lee ⋅ Heewon Kim

Existing eyeglass removal methods can handle frames and shadows but fail to correct lens-induced geometric distortions, as public datasets lack the necessary supervision. To address this, we introduce the HiGlass Dataset, the first large-scale synthetic dataset providing explicit flow-based supervision for refractive warping. We also propose HiGlassRM, a novel pipeline whose core is a network that explicitly estimates a displacement flowmap to de-warp distorted facial geometry.Experiments on both synthetic and real images show that this flowmap-centric approach, trained on our data, significantly improves identity preservation and perceptual quality over existing methods. Our work demonstrates that explicitly modeling and correcting geometric distortion via flowmap estimation, enabled by targeted supervision, is key to faithful eyeglass removal.

29

Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

Leif V Holland ⋅ Domenic Zingsheim ⋅ Mana Takhsha ⋅ Hannah Dröge ⋅ Patrick Stotko ⋅ Markus Plack ⋅ Reinhard Klein

High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views – often due to real-time constraints – leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.

30

SeaClips: A Video Dataset for Maritime Object Detection.

Franziska Denk ⋅ Christian Rankl ⋅ Shaban ALMOUAHED ⋅ David Moser ⋅ Robert Sablatnig

Maritime computer vision is a requirement for autonomous surface vehicles and can improve maritime safety if a high level of robustness is achieved. As deep learning dominates the computer vision community, domain-specific datasets are required to obtain well-generalizing and reliable models. However, maritime datasets, especially those containing videos and temporally dense annotations, are still small compared to other domains, such as autonomous driving or generic computer vision datasets. This paper introduces MaritimeClips, a new maritime video dataset containing 74 videos with an average duration of 14 seconds and with 31k frames in total. Videos were recorded under varying conditions, with three cameras mounted on shore and on boats. MaritimeClips provides frame-by-frame annotations, encompassing 129k bounding boxes of seven categories, containing vessel and non-vessel classes. MaritimeClips contributes to a broader coverage of maritime scenarios and, ultimately, more robust computer vision models. Baseline results on the dataset are established by evaluating six image-based models and three models using temporal context, ranging from lightweight YOLO-based to heavy transformer architectures. It is found that the different scales and shapes at which objects appear in MaritimeClips pose a challenge to state-of-the-art detectors. MaritimeClips is made accessible for research on maritime obstacle detection upon paper acceptance.

31

UniDiff: Parameter-Efficient Adaptation of Diffusion Models for Land Cover Classification with Multi-Modal Remotely Sensed Imagery and Sparse Annotations

Yuzhen Hu ⋅ Saurabh Prasad

Sparse annotations fundamentally constrain multimodal remote sensing: even recent state-of-the-art supervised methods such as MSFMamba are limited by the availability of labeled data, restricting their practical deployment despite architectural advances. ImageNet-pretrained models provide rich visual representations, but adapting them to heterogeneous modalities such as hyperspectral imaging (HSI) and synthetic aperture radar (SAR) without large labeled datasets remains challenging.We propose UniDiff, a parameter-efficient framework that adapts a single ImageNet-pretrained diffusion model to multiple sensing modalities using only target-domain data. UniDiff combines FiLM-based timestep-modality conditioning, parameter-efficient adaptation of approximately 5\% of parameters, and pseudo-RGB anchoring to preserve pre-trained representations and prevent catastrophic forgetting. This design enables effective feature extraction from remote sensing data under sparse annotations.Our results with two established multi-modal benchmarking datasets demonstrate that unsupervised adaptation of a pre-trained diffusion model effectively mitigates annotation constraints and achieves effective fusion of multi-modal remotely sensed data.

32

CropAT: Leveraging Diffusion-Generated Target-Like Cropped Objects for Pseudo-Label Refinement in Domain-Adaptive Object Detection

Chen-Che Huang ⋅ Tzuhsuan Huang ⋅ Jun-Cheng Chen

Unsupervised domain adaptation for object detection (UDAOD) aims to adapt the source detector to the target domain using labeled source data and unlabeled target data. To mitigate the gap between the two domains, existing methods employ the Mean Teacher (MT) framework for domain adaptation, selecting high-quality pseudo-labels generated by the teacher model to supervise the training of the student model. However, the pseudo-labels generated by the teacher model often contain a high proportion of false positive labels, which can mislead the student model and result in a decline in overall performance. In this paper, we propose a novel data augmentation strategy for domain adaptive object detection, Crop Adaptive Teacher (CropAT), to address this problem and improve model performance. We leverage prompt tuning on an off-the-shelf image editing model to generate target-like images from source data, aiming to reduce the domain gap. Additionally, we insert object crops from these target-like images into the unlabeled target data to increase the correct labels within pseudo-labels, consequently decreasing the proportion of false positive pseudo-labels. Our method outperforms existing approaches across multiple benchmarks. For Cityscapes (source) to Foggy Cityscapes (target) adaptation, CropAT achieves 53.2\% mAP on the target domain, surpassing the baseline method and the previous state-of-the-art (SOTA) by 3.9\% and 0.7\%. For PASCAL VOC (source) to Clipart1k (target) adaptation, CropAT achieves 52.2\% mAP, surpassing the baseline method and the previous SOTA by 6.5\% and 3.1\%.

33

Beyond Faces: A Multimodal Person Clustering for Unconstrained Environments

Sahngmin Yoo ⋅ Sangwon Lee ⋅ Seongin Jo

The increasing demand for on-device AI, driven by privacy concerns and the need for real-time processing, poses new challenges for fundamental computer vision tasks. This paper addresses one such task, person clustering in photo galleries, which has traditionally relied on server-side computation or simplistic on-device models. We introduce the Multimodal Person Clustering Architecture (MPCA), a research framework designed to explore the feasibility of a high-performance, multimodal clustering pipeline operating entirely under mobile constraints. Our framework makes three principal contributions: (1) Multimodal Appearance-Assisted Identity Recovery (MAIR), a late-fusion strategy that leverages temporal consistency to recover identities when facial data is unreliable; (2) Language-Guided Appearance Extractor (LGAE), which adapts a vision-language paradigm to construct robust appearance representations efficiently; and (3) Sequential Graph-Density Clustering (SGDC), a novel algorithm that synergistically combines graph-based and density-based methods to handle the high variance of appearance data. We demonstrate through extensive experiments that our on-device framework achieves an unprecedented 87.97\% average recall, significantly outperforming leading cloud-based commercial systems like Google Photos (77.74\%) and on-device systems like Apple Photos (67.84\%) and Samsung Gallery (83.39\%). This work provides a blueprint for future research in privacy-preserving, efficient, and robust person clustering, highlighting a viable path for deploying next-generation computer vision applications directly on mobile devices.

34

Eye-for-an-eye: Appearance Transfer with Dense Semantic Correspondence in Diffusion Models

Sooyeon Go ⋅ Kyungmook Choi ⋅ Minjung Shin ⋅ Youngjung Uh

As pre-trained text-to-image diffusion models have become a useful tool for image synthesis, people want to specify the results in various ways. This paper tackles training-free appearance transfer, which produces an image with the structure of a target image from the appearance of a reference image. Existing methods usually do not reflect semantic correspondence, as they rely on query-key similarity within the self-attention layer to establish correspondences between images. To this end, we propose explicitly rearranging the features according to the dense semantic correspondences. Extensive experiments show the superiority of our method in various aspects: preserving the structure of the target and reflecting the correct color from the reference, even when the two images are not aligned.

35

Towards Photorealistic Style Transfer with Multimodal Guidance and Robustness to Content Images in Arbitrary Styles

Ruikai Zhou ⋅ Yating Liu ⋅ Yi Xu

Existing photorealistic style transfer methods are broadly categorized into two groups: image-guided and text-guided approaches. The image-guided paradigm requires a reference style image, which is superior when the target style is difficult to define precisely with text. Unfortunately, such references are not always available in practical scenarios. In contrast, the text-guided paradigm offers greater flexibility. However, existing text-guided methods often fail to preserve the details of content images or perform poorly when content images deviate from normal style. In this paper, we present a novel multimodal-guided photorealistic style transfer framework, supporting flexible switching and fusion of both modalities while ensuring robust performance across content images in arbitrary styles. Specifically, we adopt a two-stage pipeline. First, the Style Removal Module removes the original style from the content image. Then, the Style Injection Module applies stylization based on the style guidance (image, text, or their fusion). To make the text-guided branch compatible with this pipeline, we propose the Image-Assisted Textual Style Injection (IATSI) strategy. Additionally, we design a Dual-Residual Adaptive MLP (DRA-MLP), which exhibits strong color mapping capability and avoids spatial distortions. Extensive experiments show that our method achieves state-of-the-art (SOTA) performance in both image-guided and text-guided settings. Moreover, we innovatively implement multimodal fusion-guided photorealistic style transfer, achieving promising results.

36

Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation

Sameer Ambekar ⋅ Marta Hasny ⋅ Laura Daza ⋅ Daniel Lang ⋅ Julia Schnabel

Test-time adaptation allows pretrained models to adjust to incoming data streams, addressing distribution shifts between source and target domains. However, standard methods rely on single-dimensional linear classification layers, which often fail to handle diverse and complex shifts. We propose Hierarchical Adaptive Networks with Task Vectors (Hi-Vec), which leverages multiple layers of increasing size for dynamic test-time adaptation. By decomposing the encoder's representation space into such hierarchically organized layers, Hi-Vec, in a plug-and-play manner, allows existing methods to adapt to shifts of varying complexity. Our contributions are threefold: First, we propose dynamic layer selection for automatic identification of the optimal layer for adaptation to each test batch. Second, we propose a mechanism that merges weights from the dynamic layer to other layers, ensuring all layers receive target information. Third, we propose linear layer agreement that acts as a gating function, preventing erroneous fine-tuning by adaptation on noisy batches. We rigorously evaluate the performance of Hi-Vec in challenging scenarios and on multiple target datasets, proving its strong capability to advance state-of-the-art methods. Our results show that Hi-Vec improves robustness, addresses uncertainty, and handles limited batch sizes and increased outlier rates.

37

Automated Pore Detection from In-Situ FDM 3D Printing Video: A Comparative Evaluation of Modern Segmentation Models

Abdullah Al Ahad Khan ⋅ Md Islam ⋅ Lin Li ⋅ Lai Jiang ⋅ Noushin Ghaffari

In extrusion-based fused deposit modeling (FDM) 3D printing, porosity weakens layer adhesion and compromises the mechanical reliability of printed parts, making it one of the most critical defects. While porosity in additive manufacturing has been studied extensively, the pixel-level segmentation of pores has received little attention. To address this gap, we collected a new dataset based on in-situ video of FDM 3D printing with biofiber-reinforced thermoplastic biopolymers. After manually annotating the frames with polygon-level pore masks, we used the dataset to benchmark four widely used segmentation models YOLOv8-seg, YOLOv11-seg, Mask R-CNN, and DeepLabV3+ under a consistent training and evaluation protocol. Our results show that YOLOv11-seg achieves the highest segmentation accuracy with a mask mAP@50 of 92.9%, while YOLOv8-seg delivers a comparable accuracy of 92.6% but with the fastest throughput at nearly 60 FPS, making it particularly suited for real-time monitoring. DeepLabV3+ and Mask R-CNN provide useful baselines but lag in either efficiency or stability. This work introduces the first annotated dataset and baseline comparison for segmentation of small, irregular defects in FDM, establishing a benchmark relevant both for additive manufacturing and for broader computer vision research on challenging low-contrast defect segmentation.

38

Safe Vision-Language Models via Unsafe Weights Manipulation

Moreno D'Incà ⋅ Elia Peruzzo ⋅ Xingqian Xu ⋅ Humphrey Shi ⋅ Nicu Sebe ⋅ Massimiliano Mancini

Vision-language models (VLMs) often inherit the biases and unsafe associations present within their large-scale training dataset. While recent approaches mitigate unsafe behaviors, their evaluation focuses on how safe the model is on unsafe inputs, ignoring potential shortcomings on safe ones. In this paper, we first revise safety evaluation by introducing SafeGround, a new set of metrics that evaluate safety at different levels of granularity. With this metric, we uncover a surprising issue of training-based methods: they make the model less safe on safe inputs. From this finding, we take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM). UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter. Their values are then manipulated via negation. Experiments show that UWM achieves the best tradeoff between safety and knowledge preservation, consistently improving VLMs on unsafe queries while outperforming even training-based state-of-the-art methods on safe ones.

39

SOPHY: Generating Simulation-Ready Objects with Physical Materials

Junyi Cao ⋅ Evangelos Kalogerakis

We present SOPHY, a generative model for 3D physics-aware shape synthesis. Unlike existing 3D generative models that focus solely on static geometry or 4D models that produce physics-agnostic animations, our method jointly synthesizes shape, texture, and material properties related to physics-grounded dynamics, making the generated objects ready for simulations and interactive, dynamic environments. To train our model, we introduce a dataset of 3D objects annotated with detailed physical material attributes, along with an efficient pipeline for material annotation. Our method enables applications such as text-driven generation of interactive, physics-aware 3D objects and single-image reconstruction of physically plausible shapes. Furthermore, our experiments show that jointly modeling shape and material properties enhances the realism and fidelity of the generated shapes, improving performance on both generative geometry and physical plausibility.

40

Generalization of Real World Video Deblurring By Image-to-Image Translation

Kassymzhomart Aitbek ⋅ Seungjoon Yang

We address the challenge of generalizing video deblurring models to real-world scenarios, where traditional methods often fail due to a significant domain gap between synthetic and real blur. This work extends the image-to-image translation framework to the more complex domain of video deblurring, introducing a training procedure that effectively bridges this gap. Our method integrates a robust video deblurring backbone with realistic motion priors captured from gimbal-mounted cameras, enabling the model to generalize well across both synthetic and real-world datasets—without requiring paired real-world training data or dataset-specific tuning. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art methods on multiple real-world benchmarks, including datasets never seen during training. Importantly, the generalization capability stems not from the specific architecture, but from the modular training procedure itself, which can be readily applied to other deblurring backbones. This positions our method as a scalable and transferable framework for real-world video deblurring.

41

A Dataset and Framework for Learning State-invariant Object Representations

Rohan Sarkar ⋅ Avinash Kak

We introduce state invariance alongside other common invariances to learn object representations for recognition and retrieval tasks. State invariance refers to robustness against changes in an object’s structural form, such as when an umbrella is folded or a clothing item is tossed on the floor. Humans recognize objects despite such changes, motivating the question of whether neural architectures can achieve similar robustness. To that end, we present ObjectsWithStateChange, a novel dataset designed to facilitate research in fine-grained 3D object recognition and retrieval of objects capable of state changes. The dataset captures variations in state and pose from arbitrary viewpoints to support learning discriminative embeddings that are invariant not only to state changes but also to variations in viewpoint, pose, and illumination.A key challenge is that different objects (within and across categories) may appear visually similar under certain state changes, causing their embeddings to be close and making discrimination difficult. To address this, we propose a curriculum learning strategy that leverages the learned similarity relationships after each epoch to guide training. Following curriculum learning principles, the approach progressively selects object pairs with smaller inter-object distances, gradually sampling harder-to-distinguish examples of visually similar objects from within and across categories during training.Our ablation study shows that this curriculum learning strategy improves object recognition accuracy by 7.9% and retrieval mAP by 9.2% compared to state-of-the-art methods. We believe that this approach enhances the model’s ability to learn discriminative features for fine-grained tasks involving objects with state changes, leading to improved performance not only on the new dataset we present, but also on three other multi-view datasets, such as ModelNet40, ObjectPI, and FG3D.

42

Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Dasol Choi ⋅ Seunghyun Lee ⋅ Youngsook Song

Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical everyday life scenarios remains insufficiently explored. We introduce VERI (Visual Emergency Recognition Dataset), a diagnostic benchmark comprising 200 images organized into 100 contrastive pairs. Each emergency scene is paired with a visually similar but safe counterpart through human verification and refinement. Using a two-stage evaluation protocol—risk identification and emergency response. We assess 17 VLMs (from open source to commercial APIs) across medical emergencies, accidents, and natural disasters. Our analysis reveals an "overreaction problem," where models achieve high recall in detecting genuine emergencies (70-100%) but suffer from low precision, misclassifying 31-96% of safe situations as dangerous. Seven safe scenarios were universally misclassified by all models, regardless of scale. This "better-safe-than-sorry" bias primarily results from contextual overinterpretation (88-98% of errors), challenging VLM reliability in safety-critical applications. As VLMs increasingly power real-world applications from smart home monitoring to autonomous systems, understanding and addressing these systematic biases becomes critical for safe deployment. Our results demonstrate a need for strategies specifically improving contextual reasoning in ambiguous visual situations.

43

Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation

Chengzhi Yu ⋅ Yifan Xu ⋅ Yifan Chen ⋅ Wenyi Zhang

Recently, large vision-language models (LVLMs) have risen to be a promising approach for multimodal tasks. However, principled hallucination mitigation remains a critical challenge. In this work, we first analyze the data generation process in LVLM hallucination mitigation and affirm that on-policy data significantly outperforms off-policy data, which thus calls for efficient and reliable preference annotation of on-policy data. We then point out that, existing annotation methods introduce additional hallucination in training samples, which may enhance the model's hallucination patterns, to address this problem, we propose training a hallucination classifier giving binary annotations, which guarantee clean chosen samples for the subsequent alignment. To further harness of the power of on-policy data, we design a robust iterative direct preference optimization (DPO) algorithm adopting a dynamic sample reweighting scheme. We conduct comprehensive experiments on three benchmarks with comparison to 8 state-of-the-art baselines. In particular, our approach reduces the hallucination rate of LLaVA-1.5-7B on MMHalBench by 50.8\% and the average hallucination rate on Object HalBench by 79.5\%; more significantly, our method fully taps into the potential of open-source models, enabling LLaVA-1.5-13B to even surpass the performance of GPT-4V.

44

Countering Multi-modal Representation Collapse through Rank-targeted Fusion

Seulgi Kim ⋅ Kiran Kokilepersaud ⋅ Mohit Prabhushankar ⋅ Ghassan AlRegib

Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose \textit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less informative features from one modality with complementary features from another modality. We show that our method increases the effective rank of the fused representation. To address modality collapse, we evaluate modality combinations that mutually increase each others' effective rank. We show that depth maintains representational balance when fused with RGB, avoiding modality collapse. We validate our method on action anticipation, where we present \texttt{R3D}, a depth-informed fusion framework. Extensive experiments on NTURGBD, UTKinect, and DARai demonstrate that our approach significantly outperforms prior state-of-the-art methods by up to 3.74\%. Upon acceptance, we will release our codebase to facilitate further research.

45

Learning Unified Spatio-temporal Representations for Efficient Compressed Video Understanding

Shristi Biswas Biswas ⋅ Efstathia Soufleri ⋅ Arani Roy ⋅ Kaushik Roy

Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of $56$ while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330\times$ versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame -- P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs ($0.73$J/V) and fast inference ($16$V/s). Further, our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners.

46

More Than Memory Savings: Zeroth-Order Optimization Mitigates Forgetting in Continual Learning

Wanhao Yu ⋅ Zheng Wang ⋅ Shuteng Niu ⋅ Sen Lin ⋅ Li Yang

Zeroth-order (ZO) optimization has gained attention as a memory-efficient alternative to first-order (FO) methods, particularly in settings where gradient computation is expensive or even impractical. Beyond its memory efficiency, in this work, we investigate ZO optimization for continual learning (CL) as a novel approach to address the plasticity-stability-efficiency trilemma.Through theoretical analysis and empirical evidence, we show that ZO optimization naturally leads to flatter loss landscapes, which in turn helps reduce forgetting in continual learning. However, this stability comes at a cost of plasticity: ZO optimization converges more slowly than first-order methods, which limits its ability to acquire new task-specific knowledge, particularly under the constrained training budgets.To better understand this trade-off, we conduct a holistic evaluation of ZO optimization applied to various existing CL methods. Our findings reveal that ZO optimization enhances stability but often undermines plasticity, particularly when used to learnable classifiers.Motivated by this insight, we propose ZO-FC, a simple but effective approach that applies ZO optimization to a single adapter-based PEFT module with FO optimized classifier. This design leverages the stability benefits of ZO while preserving the adaptability of FO updates with negligible memory overhead. Experiments demonstrate that ZO-FC achieves an effective balance between stability and plasticity, offering a practical and memory-efficient solution for

47

Global Focal and Radial Distortion Averaging from Radial Fundamental Matrices for Robust Self-Calibration

Sergei Solonets ⋅ Daniil Sinitsyn ⋅ Daniel Cremers

Classical self-calibration techniques either perform computationally expensive bundle adjustment to estimate all camera parameters or initialize the focal lengths alone by globally averaging fundamental matrices, typically ignoring radial distortion. Both strategies can degrade accuracy in large-scale Structure from Motion pipelines.We present a calibration method that overcomes these limitations by averaging focal lengths and radial distortion parameters using radial fundamental matrices. This method avoids costly point-wise optimization. Our algorithm minimizes the geometric distance between an observed fundamental matrix and the essential-matrix manifold. This provides a mathematically consistent and highly scalable framework for calibrating the camera's intrinsic parameters.Experiments on diverse real-world datasets demonstrate that our joint estimator provides more precise focal length and distortion parameter estimates than existing methods. Furthermore, we demonstrate that naive, independent distortion averaging is suboptimal, which reinforces the importance of joint focal-radial estimation. These results underscore the importance of incorporating radial distortion averaging into modern self-calibration methods to improve reconstruction accuracy and stability.

48

UCDSC: Open Set UnCertainty aware Deep Simplex Classifier for Medical Image Datasets

Arnav Aditya ⋅ Nitin Kumar ⋅ Saurabh Shigwan

Driven by advancements in deep learning, computer-aided diagnoses have made remarkable progress. However, outside controlled laboratory settings, algorithms may encounter several challenges. In the medical domain, these difficulties often stem from limited data availability due to ethical and legal restrictions, as well as the high cost and time required for expert annotations—especially in the face of emerging or rare diseases. In this context, open-set recognition plays a vital role by identifying whether a sample belongs to one of the known classes seen during training or should be rejected as an unknown. Recent studies have shown that features learned in the later stages of deep neural networks are observed to cluster around their class means, which themselves are arranged as individual vertices of a regular simplex \cite{papyan2020prevalence}. The proposed method introduces a loss function designed to reject samples of unknown classes effectively by penalizing open space regions using auxiliary datasets. This approach achieves significant performance gain across four MedMNIST datasets—BloodMNIST, OCTMNIST, DermaMNIST, and TissueMNIST—outperforming state-of-the-art techniques.The source code can be accessed at https://anonymous.4open.science/r/UCDSC/

49

DODA: Adapting Object Detectors to Dynamic Agricultural Environments in Real-Time with Diffusion

Shuai Xiang ⋅ Pieter Blok ⋅ James Burridge ⋅ Haozhou Wang ⋅ Wei Guo

Object detection has wide applications in agriculture, but domain shifts of diverse environments limit the broader use of the trained models. Existing domain adaptation methods usually require retraining the model for new domains, which is impractical for agricultural applications due to constantly changing environments. In this paper, we propose DODA (Diffusion for Object-detection Domain Adaptation in Agriculture), a diffusion-based framework that can adapt the detector to a new domain in just 2 minutes. DODA incorporates external domain embeddings and an improved layout-to-image approach, allowing it to generate high-quality detection data for new domains without additional training. We demonstrate DODA's effectiveness on the Global Wheat Head Detection dataset, where fine-tuning detectors on DODA-generated data yields significant improvements across multiple domains. DODA provides a simple yet powerful solution for agricultural domain adaptation, reducing the barriers for growers to use detection in personalised environments.

50

Structured Context Learning for Generic Event Boundary Detection

Xin Gu ⋅ Congcong Li ⋅ Xinyao Wang ⋅ Dexiang Hong ⋅ Libo Zhang ⋅ Tiejian Luo ⋅ Longyin Wen ⋅ Heng Fan

Generic Event Boundary Detection (GEBD) aims to identify moments in videos that humans perceive as event boundaries. This paper proposes a novel method for addressing this task, called Structured Context Learning, which introduces the Structured Partition of Sequence (SPoS) to provide structured context for temporal information learning. Our approach is end-to-end trainable and flexible, not being restricted to specific temporal models like GRU, LSTM, and Transformers. This flexibility enables our method to achieve a better speed-accuracy trade-off. Specifically, we apply SPoS to partition the input frame sequence and provide a structured context for the subsequent temporal model. Notably, SPoS's overall computation complexity is linear with respect to the video length. We next calculate group similarities to capture differences between frames, and a lightweight fully convolutional network is utilized to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, we adapt the Gaussian kernel to preprocess the ground-truth event boundaries. Our proposed method is extensively evaluated on the challenging Kinetics-GEBD, TAPOS, and shot transition detection datasets, demonstrating its superiority over existing state-of-the-art methods.

51

Sun-E: Dataset and Benchmark for Event-Based Sun Sensing

Sydney Dolan ⋅ Alessandro Golkar

Event cameras are increasingly being explored for space applications due to their high dynamic range and increased spatiotemporal resolution. Existing datasets in this application have focused on capturing low-light, sub-pixel space objects and Earth observation scenarios. There remains a notable gap in datasets tailored to high-illumination conditions, particularly those involving direct solar imaging. This work introduces a dataset of solar event recordings captured with an event camera in a controlled sun-simulator environment. The dataset is specifically designed to support research in sun sensing and stray light analysis for spacecraft attitude estimation applications. It includes raw event data, annotated sun centroid locations, object motion profiles, and secondary optical aberration artifacts. In addition to the dataset, we present a systematic methodology for estimating the sun vector, intended to serve as a benchmark for evaluating sun sensing approaches in this application. All data and code are open source to facilitate further study.

52

FCC: Fully Connected Correlation for One-Shot Segmentation

Seonghyeon Moon ⋅ Haein Kong ⋅ Muhammad Haris Khan ⋅ Mubbasir Kapadia ⋅ Yuewei Lin

One-shot segmentation(OSS) aims to segment the target object in a query image using only one set of support image and mask. Therefore, having strong prior information for the target object using the support set is essential to guide the initial training of OSS, which leads to the success of one-shot segmentation in challenging cases, such as when the target object shows considerable variation in appearance, texture, or scale across the support and query images. To enrich this prior knowledge, we introduce FCC(Fully Connected Correlation) which integrates pixel-level correlations between support and query features, capturing associations that reveal target-specific patterns and correspondences in both same-layers and cross-layers. FCC captures previously inaccessible target information, effectively addressing the limitations of support mask. Our approach consistently demonstrates state-of-the-art performance in the PASCAL, COCO, and domain shift tests, while also notably accelerating model convergence. We conducted an ablation study and cross-layer correlation analysis to validate FCC's core methodology. These findings reveal the effectiveness of FCC in enhancing prior information and overall model performance for OSS.

53

MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency

Dongki Jung ⋅ Jaehoon Choi ⋅ Yonghan Lee ⋅ Sungmin Eum ⋅ Heesung Kwon ⋅ Dinesh Manocha

Monocular 3D foundation models offer an extensible solution for perception tasks, making them attractive for broader 3D vision applications.In this paper, we propose MoRe, a training-free Monocular Geometry Refinement method designed to improve cross-view consistency and achieve scale alignment.To induce inter-frame relationships, our method employs feature matching between frames to establish correspondences.Rather than applying simple least squares optimization on these matched points, we formulate a graph-based optimization framework that performs local planar approximation using the estimated 3D points and surface normals estimated by monocular foundation models.This formulation addresses the scale ambiguity inherent in monocular geometric priors while preserving the underlying 3D structure.We further demonstrate that MoRe not only enhances 3D reconstruction but also improves novel view synthesis, particularly in sparse-view rendering scenarios.

54

ProSkill: Segment-Level Skill Assessment in Procedural Videos

Michele Mazzamuto ⋅ Daniele Di Mauro ⋅ Gianpiero Francesca ⋅ Giovanni Farinella ⋅ Antonino Furnari

Skill assessment in procedural videos is crucial for the objective evaluation of human performance in settings such as manufacturing and procedural daily tasks. Current research on skill assessment has predominantly focused on sports and lacks large-scale datasets for complex procedural activities. Existing studies typically involve only a limited number of actions, focus on either pairwise assessments (e.g., A is better than B) or on binary labels (e.g., good execution vs needs improvement). In response to these shortcomings, we introduce ProSKILL, the first benchmark dataset for action-level skill assessment in procedural tasks. ProSKILL provides absolute skill assessment annotations, along with pairwise ones. This is enabled by a novel and scalable annotation protocol that allows for the creation of an absolute skill assessment ranking starting from pairwise assessments. This protocol leverages a Swiss Tournament scheme for efficient pairwise comparisons, which are then aggregated into consistent, continuous global scores using an ELO-based rating system. We use our dataset to benchmark the main state-of-the-art skill assessment algorithms, including both ranking-based and pairwise paradigms. The suboptimal results achieved by the current state-of-the-art highlight the challenges and thus the value of ProSKILL in the context of skill assessment for procedural videos.

55

MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

Bui Cao Doanh ⋅ Ba Ngo ⋅ Pham Luan ⋅ Khang Nguyen ⋅ Mai Nguyen ⋅ Yasuhiko Nakashima

Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs.In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting.For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions.To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Furthermore, MergeSlide demonstrates robustness to task and dataset order, whereas other methods often show significant performance variation. These findings confirm its consistency, scalability, and flexibility. Code and data will be made publicly available.

56

CoL2A: Convolution-free Local Linear Attention for SpatioTemporal Event Processing

Yusuke Sekikawa ⋅ Itsumi Araki ⋅ Jun Nagata ⋅ Andreu Girbau

Linear attention is $\textit{sparse}$, $\textit{recurrent}$, and $\textit{GPU-parallel}$; these are essential features for processing sparse data from event-based cameras. We argue that $\textit{locality}$ is missing to efficiently model event-to-event relationships for continuous spatiotemporal perception. We propose $\textit{CoL}^2\mkern-3mu\textit{A}$ by introducing locality into linear attention without using a computationally demanding convolution operation. The key idea for the convolution-free formulation is restricting the positional embedding local convolutional kernel into the special class which can be decomposed into two global positional embeddings which can be absorbed into query and key; this replaces convolution with a local sum. To the best of our knowledge, $\textit{CoL}^2\mkern-3mu\textit{A}$is the first to equip $\textit{sparsity}$, $\textit{recurrence}$, GPU $\textit{parallelism}$ and $\textit{locality}$, simultaneously. We demonstrate $\textit{CoL}^2\mkern-3mu\textit{A}$'s effectiveness on dense, high-temporal-resolution ($>$ 1000 fps ) prediction task from events, demonstrating real-time capability while maintaining competitive results over the conventional method.

57

Dronaquatics: Real-time Swimming Analytics Using Drone Captured Imagery

Thu Tran ⋅ Harold Abraham Joseph ⋅ Kichang Lee ⋅ Kenny Choo ⋅ Dong Ma ⋅ Shaohui Foong ⋅ Thivya Kandappu ⋅ Jeonggil Ko ⋅ Rajesh Balan

Accurate swimming performance monitoring has traditionally relied on wearable sensors, which can disrupt natural technique and are often impractical in competitive settings. In this paper, we present a fully vision-based system for automatic swimmer analysis using overhead drone footage, removing the need for any body-mounted device or underwater equipment. By fine-tuning pose estimation models for aerial aquatic conditions, our approach robustly extracts full-body swimmer skeletons even under challenging scenarios such as splashes and partial occlusions. From these poses, we classify swimming strokes, compute instantaneous speed, estimate lap times, and count individual strokes. Unlike existing methods, our system provides scalable, unobtrusive, and infrastructure-free tracking. Evaluated on real-world drone-captured swimming competition data, our method achieves a median speed estimation error below 4\% (under 0.05 m/s), a median lap time error of just 0.03s, and stroke count errors typically under one stroke per lap.

58

Histopath-C: Towards Realistic Domain Shifts for Histopathology Vision-Language Adaptation

Mehrdad Noori ⋅ Gustavo Vargas Hakim ⋅ David OSOWIECHI ⋅ Fereshteh Shakeri ⋅ Ali Bahri ⋅ Moslem Yazdanpanah ⋅ Sahar Dastani ⋅ Ismail Ayed ⋅ Christian Desrosiers

Medical Vision-language models (VLMs) have shown remarkable performances in various medical imaging domains such as histo-pathology by leveraging pre-trained, contrastive models that exploit visual and textual information. However, histopathology images may exhibit severe domain shifts, such as staining, contamination, blurring, and noise, which may severely degrade the VLM's downstream performance. In this work, we introduce Histopath-C, a new benchmark with realistic synthetic corruptions designed to mimic real-world distribution shifts observed in digital histopathology. Our framework dynamically applies corruptions to any available dataset and evaluates Test-Time Adaptation (TTA) mechanisms on the fly. We then propose LATTE, a transductive, low-rank adaptation strategy that exploits multiple text templates, mitigating the sensitivity of histopathology VLMs to diverse text inputs. Our approach outperforms state-of-the-art TTA methods originally designed for natural images across a breadth of histopathology datasets, demonstrating the effectiveness of our proposed design for robust adaptation in histopathology images. An anonymized code repository is available at https://anonymous.4open.science/r/Histopath-C_LATTE, and a full project website will be available upon publication.

59

Robust Multimodal Emotion Recognition from Incomplete Modalities via Query-Based Unimodal and Cross-Modal Learning

Ryo Miyoshi ⋅ Mayu Otani ⋅ Yuki Okafuji

Multimodal emotion recognition (MER) aims to identify human emotions from inputs such as text, vision, and audio. However, existing methods often assume complete modality availability during training and inference, which is unrealistic in real-world scenarios due to sensor failures or privacy constraints.We propose Dual-Query Fusion (DQF), a framework that enables robust MER using only incomplete modality inputs, without relying on reconstruction or knowledge distillation.DQF introduces two types of learnable queries: Q-UA for extracting informative unimodal features, and Q-CA for adaptive cross-modal integration. These modules are designed to operate effectively even when some modalities are missing.Experiments on two public datasets demonstrate that DQF achieves superior performance and robustness compared to existing methods, even when trained exclusively on incomplete inputs. These results highlight the effectiveness and practicality of DQF for real-world MER tasks.

60

WiSE-OD: Benchmarking Robustness in Infrared Object Detection

Heitor Medeiros ⋅ ATIF BELAL ⋅ Masih Aminbeidokhti ⋅ Eric Granger ⋅ Marco Pedersoli

Object detection (OD) in infrared (IR) imagery is critical for low-light and nighttime applications. However, the scarcity of large-scale IR datasets forces models to rely on weights pre-trained on RGB images. While fine-tuning on IR improves accuracy, it often compromises robustness under distribution shifts due to the inherent modality gap between RGB and IR. To address this, we introduce LLVIP-C and FLIR-C, two cross-modality out‑of‑distribution (OOD) benchmarks built by applying corruption to standard IR datasets. Additionally, to fully leverage the complementary knowledge from RGB and infrared trained models, we propose WiSE-OD, a weight-space ensembling method with two variants: WiSE-OD_{ZS}, which combines RGB zero-shot and IR fine-tuned weights, and WiSE-OD$_{LP}$, which blends zero-shot and linear probing.Evaluated across three RGB-pretrained detectors and two robust baselines, WiSE-OD improves both cross-modality and corruption robustness without any additional training or inference cost.

61

Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching

Wonseok Choi ⋅ Sohwi Lim ⋅ Nam Hyeon-Woo ⋅ Moon Ye-Bin ⋅ Dong-ju Jeong ⋅ Jinyoung Hwang ⋅ Tae-Hyun Oh

Instance-level image retrieval aims to find images containing the same object as a given query, despite variations in size, position, or appearance. To address this challenging task, we propose Patchify, a simple yet effective patch-wise retrieval framework that offers high performance, scalability, and interpretability without requiring fine-tuning. Patchify divides each database image into a small number of structured patches and performs retrieval by comparing these local features with a global query descriptor, enabling accurate and spatially grounded matching. To assess not just retrieval accuracy but also spatial correctness, we introduce LocScore, a localization-aware metric that quantifies whether the retrieved region aligns with the target object. This makes LocScore a valuable diagnostic tool for understanding and improving retrieval behavior. We conduct extensive experiments across multiple benchmarks, backbones, and region selection strategies, showing that Patchify outperforms global methods and complements state-of-the-art reranking pipelines. Furthermore, we apply Product Quantization for efficient large-scale retrieval and highlight the importance of using informative features during compression, which significantly boosts performance.

62

Gaussian Swaying: Surface-Based Framework for Aerodynamic Simulation with 3D Gaussians

Hongru Yan ⋅ Xiang Zhang ⋅ Zeyuan Chen ⋅ Fangyin Wei ⋅ Zhuowen Tu

Branches swaying in the breeze, flags rippling in the wind, and boats rocking on the water all show how aerodynamics shape natural motion -- an effect crucial for realism in vision and graphics. In this paper, we present Gaussian Swaying, a surface-based framework for aerodynamic simulation using 3D Gaussians. Unlike mesh-based methods that require costly meshing, or particle-based approaches that rely on discrete positional data, Gaussian Swaying models surfaces continuously with 3D Gaussians, enabling efficient and fine-grained aerodynamic interaction. Our framework unifies simulation and rendering on the same representation: Gaussian patches, which support force computation for dynamics while simultaneously providing normals for lightweight shading. Comprehensive experiments on both synthetic and real-world datasets across multiple metrics demonstrate that Gaussian Swaying achieves state-of-the-art performance and efficiency, offering a scalable approach for realistic aerodynamic scene simulation.

63

Restora-Flow: Mask-Guided Image Restoration with Flow Matching

Arnela Hadzic ⋅ Franz Thaler ⋅ Lea Bogensperger ⋅ Simon Johannes Joham ⋅ Martin Urschler

Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods. Our code will be made available to the public.

64

PrevMatch: Revisiting and Maximizing Temporal Knowledge in Semi-Supervised Semantic Segmentation

Wooseok Shin ⋅ Hyun Joon Park ⋅ Jin Sob Kim ⋅ Juan Yun ⋅ Se Park ⋅ Sung Han

In semi-supervised semantic segmentation, the Mean Teacher- and co-training-based approaches are employed to mitigate confirmation bias and coupling problems. However, despite their high performance, these approaches frequently involve complex training pipelines and a substantial computational burden, limiting the scalability and compatibility of these methods. In this paper, we propose a PrevMatch framework that effectively mitigates the aforementioned limitations by maximizing the utilization of the temporal knowledge obtained during the training process. The PrevMatch framework relies on two core strategies: (1) we reconsider the use of temporal knowledge and thus directly utilize previous models obtained during training to generate additional pseudo-label guidance, referred to as previous guidance. (2) we design a highly randomized ensemble strategy to maximize the effectiveness of the previous guidance. PrevMatch, a simple yet effective plug-in method, can be seamlessly integrated into existing semi-supervised learning frameworks with minimal computational overhead. Experimental results on three benchmark semantic segmentation datasets show that incorporating PrevMatch into existing methods significantly improves their performance. Furthermore, our analysis indicates that PrevMatch facilitates stable optimization during training, resulting in improved generalization performance.

65

One-shot Portrait Stylizaiton via Geometric Alignment

Xinrui Wang ⋅ Zilin Guo ⋅ Zhuoru Li ⋅ Jinze Yu ⋅ Heng Zhang ⋅ Yusuke Iwasawa ⋅ Yutaka Matsuo ⋅ Jiaxian Guo

Portrait stylization aims to cast vivid artistic style drawn from style examples to portrait photos. This task has recently been extensively studied with machine learning algorithms, but it is still difficult for existing methods to stylize portraits from a single style reference, severely limiting these methods for real-world applications. In this paper, we propose a portrait stylization method that learns style reference from a single artistic portrait image. Unlike previous StyleGAN based methods that heavily rely on the quality of GAN inversion or diffusion based methods that introduce computational expensive operations and fall short of precise control, our method achieves high-quality stylization with small computation and parameter budget. Specifically, we employ geometric alignment to build spatial correlation between content images and style reference. A content LoRA and a style LoRA are then jointly optimized based on a pre-trained diffusion backbone respectively, with orthogonal adaptation used to disentangle the content and style information. During inference, the style LoRA is integrated into the diffusion backbone and ControlNet is further combined to facilitate better spatial and identity control. We illustrate abundant stylized portraits with multiple styles. Qualitative comparison, quantitative validation and user study all prove that our method outperforms existing methods, and ablation study demonstrates the effectiveness of each components. Code and pre-trained model will be made publicly available upon paper acceptance.

66

Zero-Shot Table Extraction in Business Documents: A Unified Benchmark with Error Taxonomy and Ecological Analysis

Eliott THOMAS ⋅ Mickael Coustaty ⋅ Aurélie JOSEPH ⋅ Tri-Cong Pham ⋅ Gaspar DELOIN ⋅ Elodie CAREL ⋅ Vincent d'Andecy ⋅ Jean-marc Ogier

Tables in business documents power analytics and compliance, yet task-specific datasets are costly to build. Practitioners therefore turn to zero-shot vision--language models (VLMs). We study zero-shot realism for table detection (TD) and table structure recognition (TSR) under a unified protocol on DocILE-QUEST and a private STM154 corpus. We report TD with GIoU, Purity, and Completeness, and TSR with TEDS and TEDS-S, evaluating commercial VLMs (GPT-4o, GPT-5-mini), compact detectors, and supervised YOLO/DETR baselines. Zero-shot VLMs are strong for TSR and competitive for TD, while fine-tuned or from-scratch detectors lead when box quality and robustness to clutter matter. We add an automated error taxonomy that isolates actionable failures (missed, merged/split tables, header--body confusions, cell topology). Finally, we quantify emissions, finding a $10^4$ gap between the lightest and heaviest systems.

67

SegMango: Early Deep Mango Yield Prediction based on Flower Segmentation and Weather Data

Janaksinh Ven ⋅ Charu Sharma ⋅ Azeemuddin Syed

Early-stage fruit yield prediction plays a key role in supporting timely agronomic decisions, enhancing market planning, and empowering farmers with data-driven insights. Over the years, most approaches to yield estimation have focused on fruit counting techniques, typically performed just before harvest. While these methods have proven useful, they often come into play late in the cultivation cycle, limiting their impact on early planning and resource optimization. In this work, we introduce a comprehensive baseline framework for predicting mango yield at an earlier stage - during flowering - using image-based learning. Our contributions are twofold. (i) Our approach combines a SegFormer-based segmentation model with a regression pipeline to estimate yield from images, while also exploring the role of contextual features such as weather and scale. (ii) This work introduces a novel benchmark and an enriched dataset, paving the way for scalable, automated tools that can assist farmers and stakeholders in making proactive decisions throughout the mango growing season. Our work demonstrates that for multi-modal yield prediction, integrating features that complement visual representations (like scale) can be more impactful than using features with a stronger standalone linear correlation (like weather). Our single-image model, based on the SegFormer-B1 encoder, achieved a mean absolute error (MAE) of 7.68, R² of 0.76, and mean squared error (MSE) of 115.48. These results highlight the promise of vision-based models for yield estimation from early-stage flowering cues. To the best of our knowledge, this is the first work to address the prediction of mango yield using images from the flowering stage and weather data.

68

Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery

Mai Tsujimoto ⋅ Junjue Wang ⋅ Weihao Xuan ⋅ Naoto Yokoya

Three-dimensional geospatial analysis underpins critical applications in urban planning, climate adaptation, and environmental assessment. Current methods rely on expensive specialized sensors (e.g., LiDAR and multispectral) that limit global accessibility. Existing sensor-based and rule-driven methods further struggle with tasks that require the integration of multiple 3D cues, handling uncertainty, and providing interpretable reasoning. We introduce Geo3DVQA, the first comprehensive benchmark for evaluating vision–language models (VLMs) in height-aware 3D geospatial reasoning from RGB-only remote sensing imagery. Unlike traditional sensor-based frameworks, Geo3DVQA emphasizes holistic scenarios that combine elevation, sky view factors, and land cover patterns. The benchmark includes 110k curated question–answer pairs spanning 16 task categories across three complexity levels: single-feature inference, multi-feature reasoning, and application-level spatial analysis. The evaluation of ten state-of-the-art VLMs highlights the difficulty of RGB-to-3D reasoning. GPT-4o and Gemini-2.5-Flash achieved only 28.6\% and 33.0\% accuracy respectively, while domain-specific fine-tuning of Qwen2.5-VL-7B achieved 49.6\% (+24.8 points). These results reveal the current VLM limitations and establish a new challenge frontier for scalable and accessible 3D geospatial analyses with holistic reasoning capabilities.

69

PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction

Ju Shen ⋅ Chen Chen ⋅ Tam Nguyen ⋅ Vijayan Asari

We propose PoseGaussian, a pose-guided Gaussian Splatting framework for high-fidelity human novel view synthesis. Human body pose serves a dual purpose in our design: as a structural prior, it is fused with a color encoder to refine depth estimation; as a temporal cue, it is processed by a dedicated pose encoder to enhance temporal consistency across frames. These components are integrated into a fully differentiable, end-to-end trainable pipeline. Unlike prior works that use pose only as a condition or for warping, PoseGaussian embeds pose signals into both geometric and temporal stages to improve robustness and generalization. It is specifically designed to address challenges inherent in dynamic human scenes, such as articulated motion and severe self-occlusion. Notably, our framework achieves real-time rendering at 100 FPS, maintaining the efficiency of standard Gaussian Splatting pipelines. We validate our approach on ZJU-MoCap, THuman2.0, and in-house datasets, demonstrating state-of-the-art performance in perceptual quality and structural accuracy (PSNR 30.86, SSIM 0.979, LPIPS 0.028).

70

Timestamp Query Transformer for Temporal Action Segmentation

Tieqiao Wang ⋅ Sinisa Todorovic

This work addresses action segmentation in videos under sparse timestamp supervision, where only a single frame per action segment---referred to as a timestamp---is labeled during training. We propose the Timestamp Query Transformer (TQT) that treats timestamps as learnable class query tokens. While existing approaches rely on iterative, multi-step generation of framewise pseudo-labels, TQT directly predicts temporal segmentation masks by leveraging query-feature cross-attention. This design enables fully end-to-end learning and maximizes the utility of sparse labels from the entire training dataset, rather than relying on only a few local timestamps within each training video as in prior work. Experiments on the GTEA, 50Salads, and Breakfast datasets demonstrate that TQT outperforms SOTA methods by up to 5.8\% in accuracy and 7.7\% in F1@50. The model and code will be released.

71

QC-SF: Improving Computer Vision for Airborne LiDAR Point Clouds of Boreal Forests with Quebec Simulated Forest Dataset

Olivier Stocker ⋅ Reza Mahmoudi Kouhi ⋅ Omid Reisi Gahrouei ⋅ Thierry Badard ⋅ Eric Guilbert

Boreal forest ecosystems are under immense pressure, and while airborne LiDAR has emerged as a powerful monitoring tool, leveraging its large data volumes requires automated analysis. Deep learning methods offer a solution but are hindered by the scarcity of large-scale, labeled datasets in forestry, contrasting to the data-rich urban environments.To address this gap, we introduce the Quebec Boreal Sim (QC-BS), a large-scale, synthetic airborne LiDAR dataset fully labeled for semantic segmentation. QC-BS contains 60,000 forest plots, each composed of a controlled mixture of two dominant species in Quebec's boreal forest: Black Spruce and Balsam Fir. Using this benchmark, we evaluate the performance of four state-of-the-art point cloud networks: KPConv, MinkUNet, DGCNN, and Point Transformer V3.Our results identify Point Transformer V3 as the most effective architecture, achieving 91.66\% mIoU. Furthermore, we validate the sim-to-real transferability of our dataset, demonstrating that augmenting a small number of real-world scans with our synthetic data improves segmentation performance by 6\% in mIoU score. [Our dataset will be made publicly available upon acceptance].

72

RemEdit: Efficient Diffusion Editing with Riemannian Geometry

Eashan Adhikarla ⋅ Brian Davison

Controllable image generation is fundamental to the success of modern generative AI, yet it faces a critical trade-off between semantic fidelity and inference speed. The RemEdit diffusion-based framework addresses this trade-off, avoiding the compromise between geometric precision and inference speed from which existing methods suffer; RemEdit overcomes this with two synergistic innovations. First, for editing fidelity, we navigate the latent space as a Riemannian manifold. A Mamba-based module efficiently learns the manifold's structure via Christoffel symbols, enabling direct and accurate geodesic path computation for smooth semantic edits. This control is further refined by a dual-SLERP blending technique and a goal-aware prompt enrichment pass from a Vision-Language Model. Second, for additional acceleration, we introduce a novel task-specific attention pruning mechanism. A lightweight pruning head learns to identify and retain only tokens essential to the edit, enabling effective optimization without the semantic degradation common in content-agnostic approaches. RemEdit surpasses prior SOTA editing frameworks while maintaining real-time performance under 50% pruning. Consequently, RemEdit establishes a new benchmark for practical and powerful image editing. RemEdit source code will be released upon publication.

73

GrowTAS: Progressive Expansion from Small to Large Subnets for Efficient ViT Architecture Search

Hyunju Lee ⋅ Youngmin Oh ⋅ Jeimin Jeon ⋅ Donghyeon Baek ⋅ Bumsub Ham

Transformer architecture search (TAS) aims to automatically discover efficient vision transformers (ViTs), reducing the need for manual design. Existing TAS methods typically train an over-parameterized network (i.e., a supernet) that encompasses all candidate architectures (i.e., subnets). However, all subnets share the same set of weights, which leads to interference that degrades the smaller subnets severely. We have found that well-trained small subnets can serve as a good foundation for training larger ones. Motivated by this, we propose a progressive training framework, dubbed GrowTAS, that begins with training small subnets and incorporate larger ones gradually. This enables reducing the interference and stabilizing a training process. We also introduce GrowTAS+ that fine-tunes a subset of weights only to further enhance the performance of large subnets. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate the effectiveness of our approach over current TAS methods.

74

Subspace-Guided Knowledge Distillation for Efficient Model Transfer

Zeeshan Hayder ⋅ Ali Cheraghian ⋅ Lars Petersson ⋅ Mehrtash Harandi

Compact models can be effectively trained via Knowledge Distillation (KD), where a lightweight student model learns to replicate the behavior of a larger, high-performing teacher. A persistent challenge in KD lies in the misalignment between the representational spaces of teacher and student networks, especially when they differ in architecture or capacity. To address this, we propose Subspace-Driven Knowledge Distillation (SDMD), a novel framework that mitigates representational disparity by projecting features into an indefinite inner product space. This relaxation from traditional Hilbert spaces enables more flexible geometric alignment, capturing transformations such as rotations and reflections that are often necessary for accurate knowledge transfer. By learning a subspace that bridges the semantic gap between teacher and student, SDMD facilitates more effective distillation without increasing model complexity. We validate SDMD through extensive experiments on large-scale image classification (ImageNet-1K) and object detection (COCO), where it consistently outperforms existing distillation methods. Notably, SDMD-trained models not only achieve state-of-the-art results in distilled settings but also surpass the performance of equivalent models trained from scratch, highlighting the strength of our subspace-based alignment strategy.

75

TimeRefine: Temporal Grounding with Time Refining Video LLM

Xizi Wang ⋅ Feng Cheng ⋅ Ziyang Wang ⋅ Huiyu Wang ⋅ Md Mohaiminul Islam ⋅ Lorenzo Torresani ⋅ Mohit Bansal ⋅ Gedas Bertasius ⋅ David Crandall

Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt. Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-token prediction of temporal timestamps. However, accurately localizing timestamps in videos remains challenging for Video LLMs when relying solely on temporal token prediction. Our proposed TimeRefine addresses this challenge in two ways. First, instead of directly predicting the start and end timestamps, we reformulate the temporal grounding task as a temporal refining task: the model first makes rough predictions and then refines them by predicting offsets to the target segment. This refining process is repeated multiple times, through which the model progressively self-improves its temporal localization accuracy. Second, to enhance the model’s temporal perception capabilities, we incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth, thus encouraging the model to make closer and more accurate predictions. Our plug-and-play method can be integrated into most LLM-based temporal grounding approaches. The experimental results demonstrate that TIMEREFINE achieves 3.6% and 5.0% mIoU improvements on the ActivityNet and Charades-STA datasets, respectively. Code and pretrained models will be available.

76

Curve Skeletonization in Continuous domain for Meshes and Point Clouds

Jai Bardhan ⋅ Ramya Hebbalaguppe ⋅ Aravind Udupa

Advancements in 3D curve skeletonization are accelerating progress across a wide range of applications. However, developing robust skeletonization algorithms that capture intricate object details remains challenging. Skeletonization via Local Separators (LS) offers an efficient graph-based approach but suffers from representation inaccuracies due to its discrete nature.To address this, we introduce CSCD, a novel framework for Curve Skeletonization in the Continuous Domain, generalizing LS to manifolds. Specifically, we present two realizations: CSCD-M for meshes and CSCD-PC for point clouds. CSCD-M leverages the intrinsic triangulation of a mesh for resilience to noise and improved topological preservation, while CSCD-PC employs tufted Laplacians for enhanced robustness. To our knowledge, CSCD-M is the first intrinsic method for curve skeletonization. Our results show CSCD-M matches LS performance across diverse meshes and outperforms LS (TOG'21) on benchmarks like Thingi10k dataset. CSCD-PC qualitatively outperforms CoverageAxis++ (Eurographics'24) and EPCS (CAG'23). Finally, we demonstrate the efficacy of CSCD in a few downstream tasks: object classification, shape segmentation, identifying handles, tunnels, and constrictions in objects. Website: https://cscd-skel.pages.dev

77

Gene-DML: Dual-Pathway Multi-Level Discrimination for Gene Expression Prediction from Histopathology Images

Yaxuan Song ⋅ Jianan Fan ⋅ Hang Chang ⋅ Weidong Cai

Accurately predicting gene expression from histopathology images offers a scalable and non-invasive approach to molecular profiling, with significant implications for precision medicine and computational pathology. However, existing methods often underutilize the cross-modal representation alignment between histopathology images and gene expression profiles across multiple representational levels, thereby limiting their prediction performance. To address this, we propose Gene-DML, a unified framework that structures latent space through Dual-pathway Multi-Level discrimination to enhance correspondence between morphological and transcriptional modalities. The multi-scale instance-level discrimination pathway aligns hierarchical histopathology representations extracted at local, neighbor, and global levels with gene expression profiles, capturing scale-aware morphological-transcriptional relationships. In parallel, the cross-level instance-group discrimination pathway enforces structural consistency between individual (image/gene) instances and modality-crossed (gene/image, respectively) groups, strengthening the alignment across modalities. By jointly modelling fine-grained and structural-level discrimination, Gene-DML is able to learn robust cross-modal representations, enhancing both predictive accuracy and generalization across diverse biological contexts. Extensive experiments on public spatial transcriptomics datasets demonstrate that Gene-DML achieves state-of-the-art performance in gene expression prediction. The code will be released upon publication.

78

Color Preserving CMOS-SPAD Fusion for Multi-Frame HDR

Aleksi Suonsivu ⋅ Lauri Salmela ⋅ Lassi Helin ⋅ Leevi Uosukainen ⋅ Giacomo Boracchi

High dynamic range (HDR) imaging aims to simultaneously capture a large range of illuminance levels, often observed in real world scenes. An HDR image is conventionally captured on a CMOS sensor by combining multiple frames with different exposure values. Ultimately, the dynamic range is limited by the read noise of the CMOS at low-light and by the full-well capacity at high-light. Single-photon avalanche diode (SPAD) imagers display high HDR capabilities due to sensitivity from individual photons up to high photon fluxes. In this work, we demonstrate that fusing multi-frame raw-CMOS RGB and monochrome SPAD images can enhance the dynamic range and color accuracy in real-world settings. To fully leverage the SPAD data, we propose a Linearisation-Upsample processing block that linearises the intrinsically nonlinear response of SPAD-QIS and accounts for the low spatial resolution of current commercial SPAD technology. We demonstrate a clear advantage by including a SPAD image into a multiframe HDR pipeline with comprehensive evaluation on HDR datasets and real-world data. We also address the lack of public SPAD datasets by providing two raw CMOS-SPAD datasets for multi-frame HDR. Both datasets are available for download here: https://tinyurl.com/rawcmosspadhdr

79

Unsupervised Segmentation by Diffusing, Walking and Cutting

Daniela Ivanova ⋅ Marco Aversa ⋅ Paul Henderson ⋅ John Williamson

We propose a zero-shot unsupervised image segmentation method by utilising self-attention activations extracted from Stable Diffusion. We demonstrate that self-attention can directly be interpreted as transition probabilities in a Markov random walk between image patches. This property enables us to modulate multi-hop relationships through matrix exponentiation, which captures k-step transitions between patches. We then construct a graph representation based on self-attention feature similarity and apply Normalised Cuts to cluster them. We quantitatively analyse the effects of incorporating multi-node paths when constructing the NCuts adjacency matrix, showing that higher-order transitions enhance hierarchical relationships in the proposed segmentations. Finally, we describe an approach to automatically determine the NCut threshold criterion, avoiding the need to manually tune it. Our approach surpasses all existing methods for zero-shot unsupervised segmentation based on pre-trained diffusion models features, achieving state-of-the-art results on COCO-Stuff-27, Cityscapes and ADE20K.

80

Learning spatio-temporal feature representations for video-based gaze estimation

Alexandre Personnic ⋅ Mihai Bace

Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames. However, since models must capture both spatial and temporal relationships, performance is limited by the feature representations within a frame but also between multiple frames.We propose Spatio-Temporal Gaze Network (ST-Gaze), a model that combines an existing CNN backbone with dedicated channel attention and self-attention modules to fuse eye and face features optimally. The fused features are then treated as a spatial sequence, allowing for the capture of an intra-frame context, which is then propagated through time to model inter-frame dynamics. We evaluated our method on the EVE dataset and show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation. Additionally, our ablation study provides further insights into the model performance, showing that preserving and modelling intra-frame spatial context with our spatio-temporal recurrence is fundamentally superior to premature spatial pooling. As such, our results pave the way towards robust video-based gaze estimation.

81

ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

Danae Sanchez Villegas ⋅ Ingo Ziegler ⋅ Desmond Elliott

Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task -- achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.

82

Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss

Minsu Gong ⋅ Nuri Ryu ⋅ Jungseul Ok ⋅ Sunghyun Cho

Recent advances in image editing leverage latent diffusion models (LDMs) for versatile, text-prompt-driven edits across diverse tasks.Yet, maintaining pixel-level edge structures—crucial for tasks such as photorealistic style transfer or image tone adjustment—remains as a challenge for latent-diffusion-based editing. To overcome this limitation, we propose a novel Structure Preservation Loss (SPL) that leverages local linear models to quantify structural differences between input and edited images. Our training-free approach integrates SPL directly into the diffusion model's generative process to ensure structural fidelity. This core mechanism is complemented by a post-processing step to mitigate LDM decoding distortions, a masking strategy for precise edit localization, and a color preservation loss to preserve hues in unedited areas. Experiments confirm SPL enhances structural fidelity, delivering state-of-the-art performance in latent-diffusion-based image editing.

83

3D Superquadric Splatting

Daniel MacSwayne ⋅ Ales Leonardis ⋅ Jianbo Jiao

Gaussian Splatting has proven to be an effective algorithm for novel view synthesis and 3D reconstruction from multi-view images. However, the underlying volumetric primitive -- the ellipsoidal Gaussian-- has limited expressive capabilities, leading to difficulties in 3D modelling (especially geometry such as edges, corners, and high curvature). To address this limitation, in this paper, we introduce Superquadric Splats (SQS), an extended class of volumetric primitives, as a super-set of Gaussian splats, to model more detailed geometry. We treat superquadrics as volumetric distance functions rather than level-set surfaces. A non-trivial differentiable rendering pipeline is developed to support this. Extensive experimental analysis on multiple datasets validates the effectiveness of the proposed SQS approach, showing both enhanced visual and geometric performance compared to Gaussian-based splatting (with more than 1dB in PSNR and prominent geometric improvement).

84

Controllable Long-term Motion Generation with Extended Joint Targets

Eunjong Lee ⋅ Eunhee Kim ⋅ Sanghoon Hong ⋅ Eunho Jung ⋅ Jihoon Kim

Generating stable and controllable character motion in real-time is a key challenge in computer animation. Existing methods often fail to provide fine-grained control or suffer from motion degradation over long sequences, limiting their use in interactive applications. We propose COMET, an autoregressive framework that runs in real time, enabling versatile character control and robust long-horizon synthesis. Our efficient Transformer-based conditional VAE allows for precise, interactive control over arbitrary user-specified joints for tasks like goal-reaching and in-betweening from a single model. To ensure long-term temporal stability, we introduce a novel reference-guided feedback mechanism that prevents error accumulation. This mechanism also serves as a plug-and-play stylization module, enabling real-time style transfer. Extensive evaluations demonstrate that COMET robustly generates high-quality motion at real-time speeds, significantly outperforming state-of-the-art approaches in complex sequential goal-reaching tasks and confirming its readiness for demanding interactive applications. Video results and code are available at~\url{https://comet-proj.github.io/}.

85

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

Peiran Wu ⋅ Yunze Liu ⋅ Miao Liu ⋅ Junxiao Shen

Humans excel at spatial-temporal reasoning, effortlessly interpreting dynamic visual events from an egocentric viewpoint. However, whether multimodal large language models (MLLMs) can similarly understand the 4D world remains uncertain. This paper explores multimodal spatial-temporal reasoning from an egocentric perspective, aiming to equip MLLMs with human-like reasoning capabilities. To support this objective, we introduce \textbf{Ego-ST Bench}, a novel benchmark containing over 5,000 question-answer pairs across four categories, systematically evaluating spatial, temporal, and integrated spatial-temporal reasoning. Additionally, we propose \textbf{ST-R1} training paradigm, a video-based reasoning model that incorporates reverse thinking into its reinforcement learning process, significantly enhancing performance. We combine long-chain-of-thought (long-CoT) supervised fine-tuning with Group Relative Policy Optimization (GRPO) reinforcement learning, achieving notable improvements with limited high-quality data. Ego-ST Bench and ST-R1 provide valuable insights and resources for advancing video-based spatial-temporal reasoning research.

86

RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution

Youngwan Jin ⋅ Incheol Park ⋅ Yagiz Nalcakan ⋅ Hyeongjin Ju ⋅ Sang Yeo ⋅ Shiho Kim

General-purpose super-resolution models, particularly Vision Transformers, have achieved remarkable success but exhibit fundamental inefficiencies in common infrared imaging scenarios like surveillance and autonomous driving, which operate from fixed or nearly-static viewpoints. These models fail to exploit the strong, persistent spatial priors inherent in such scenes, leading to redundant learning and suboptimal performance. To address this, we propose the Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR), a novel architecture that explicitly encodes scene layout information into the attention mechanism. Our core contribution is a dual-token framework that fuses (1) learnable, regional prior tokens, which act as a persistent memory for the scene's global structure, with (2) local tokens that capture the frame-specific content of the current input. By utilizing these tokens into an attention, our model allows the priors to dynamically modulate the local reconstruction process. Extensive experiments validate our approach. While most prior works focus on a single infrared band, we demonstrate the broad applicability and versatility of RPT-SR by establishing new state-of-the-art performance across diverse datasets covering both Long-Wave (LWIR) and Short-Wave (SWIR) spectra. Our code will be publicly available.

87

LASOR: Towards Clinically Transparent and Explainable Ophthalmic Report Generation via Lesion-Aware Segmentation

Jian Park ⋅ Hyunseon Won ⋅ JeeEun Kim ⋅ JOON HWANG ⋅ Jeong Han ⋅ Ji Park ⋅ Daniel Hwang ⋅ Jinyoung Han

Automated ophthalmic report generation aims to reduce the diagnostic burden on retinal specialists by producing clinically accurate and standardized descriptions from medical imaging. However, current research predominantly remains fundus-centric and rarely exploits OCT-derived spatial evidence, limiting clinical transparency by obscuring which anatomical regions drive diagnostic decisions. To address these limitations, we propose $\textbf{LASOR}$ ($\textbf{L}$esion-$\textbf{A}$ware $\textbf{S}$egmentation-Guided $\textbf{O}$phthalmic $\textbf{R}$eport Generation), which extracts multi-scale features to robustly capture both small focal abnormalities and broader anatomical structures, generating reliable segmentation masks as spatial priors for report generation. Specifically, we utilize a lesion-aware patch weighting module to emphasize abnormal regions and leverage a curated instruction dataset incorporating spatial mask information to enhance the diagnostic capabilities of the proposed model. In addition, we introduce a mask-guided cross-modal consistency loss that strengthens vision–language alignment between pathological regions and their diagnostic descriptions. Extensive experiments on a retinal OCT dataset that includes twenty pathological conditions exhibit state-of-the-art performance, underscoring LASOR's potential to advance clinically transparent ophthalmic report generation systems.

88

Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects

Yixin Zhang ⋅ Nicholas Konz ⋅ Kevin Kramer ⋅ Maciej Mazurowski

Image segmentation foundation models (SFMs) like Segment Anything Model (SAM) have achieved impressive zero-shot and interactive segmentation across diverse domains. However, they struggle to segment objects with certain structures, particularly those with dense, tree-like morphology and low textural contrast from their surroundings. These failure modes are crucial for understanding the limitations of SFMs in real-world applications. To systematically study this issue, we introduce interpretable metrics quantifying object tree-likeness and textural separability. On carefully controlled synthetic experiments and real-world datasets, we show that SFM performance (e.g., SAM, SAM 2, HQ-SAM) noticeably correlates with these factors. We attribute these failures to SFMs misinterpreting local structure as global texture, resulting in over-segmentation or difficulty distinguishing objects from similar backgrounds. Notably, targeted fine-tuning fails to resolve this issue, indicating a fundamental limitation. Our study provides the first quantitative framework for modeling the behavior of SFMs on challenging structures, offering interpretable insights into their segmentation capabilities.

89

Lorentz Entailment Cone for Semantic Segmentation

Zahid Hasan ⋅ Masud Ahmed ⋅ Nirmalya Roy

Semantic segmentation in hyperbolic space can capture hierarchical structure in low dimensions with uncertainty quantification. Existing approaches choose the Poincaré ball model for hyperbolic geometry, which suffers from numerical instabilities, optimization, and computational challenges. We propose a novel, tractable architecture-agnostic semantic segmentation framework in the hyperbolic Lorentz model. We employ text embeddings with semantic and visual cues to guide hierarchical pixel-level representations in Lorentz space. This enables stable and efficient optimization without requiring a Riemannian optimizer, and easily integrates with existing Euclidean architectures. Beyond segmentation, our approach yields free uncertainty estimation, confidence map, boundary delineation, hierarchical and text-based retrieval, and zero-shot performance, reaching generalized flatter minima. We further introduce a novel uncertainty and confidence indicator in Lorentz cone embeddings. Extensive experiments on ADE20K, COCO-Stuff-164k, Pascal-VOC, and Cityscapes with state-of-the-art models (DeepLabV3 and SegFormer) validate the effectiveness and generality of our approach. Our results demonstrate the potential of hyperbolic Lorentz embeddings for robust and uncertainty-aware semantic segmentation, and we will release our code to foster further research.

90

WarpRF: Multi-View Consistency for Training-Free Uncertainty Quantification and Applications in Radiance Fields

Sadra Safadoust ⋅ Fabio Tosi ⋅ Fatma Güney ⋅ Matteo Poggi

We introduce WarpRF, a training-free general-purpose framework for quantifying the uncertainty of radiance fields. Built upon the assumption that photometric and geometric consistency should hold among images rendered by an accurate model, WarpRF quantifies its underlying uncertainty from an unseen point of view by leveraging backward warping across viewpoints, projecting reliable renderings to the unseen viewpoint and measuring the consistency with images rendered there. WarpRF is simple and inexpensive, does not require any training, and can be applied to any radiance field implementation for free. WarpRF excels at both uncertainty quantification and downstream tasks, e.g., active view selection and active mapping, outperforming any existing method tailored to specific frameworks.

91

GAEA: A Geolocation Aware Conversational Assistant

Ron Campos ⋅ Ashmal Vayani ⋅ Parth Parag Kulkarni ⋅ Rohit Gupta ⋅ Aizan Zafar ⋅ Aritra Dutta ⋅ Mubarak Shah

Image geolocalization, in which an AI model traditionally predicts the precise GPS coordinates of an image, is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge beyond the GPS coordinates; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with the tremendous progress of large multimodal models (LMMs)---proprietary and open-source---researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, such as geolocalization, LMMs struggle. In this work, we propose solving this problem by introducing a conversational model, GAEA, that provides information regarding the location of an image as the user requires. No large-scale dataset enabling the training of such a model exists. Thus, we propose GAEA-1.4M, a comprehensive dataset comprising over 800k images and approximately 1.4M question-answer pairs, constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark, GAEA-Bench, comprising 3.5k image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision, by 18.2% and the best proprietary model, GPT-4o, by 7.2%.

92

WSSSP-Net: Weakly Supervised Semantic Segmentation Plugin Network for Face Anti-Spoofing

Krzysztof Galus ⋅ Piotr Syga ⋅ Piotr Kawa

Face anti-spoofing (FAS) is essential for protecting facial-biometric systems from presentation attacks. We propose WSSSP-Net, a Weakly Supervised Semantic Segmentation Plugin Network that integrates a lightweight, attention-based segmentation decoder at multiple depths of any CNN or transformer encoder. Serving only as an auxiliary training-time module, the decoder guides feature learning without increasing inference runtime. Pixel-wise spoof masks are automatically generated via a face-parsing pipeline, removing the need for manual annotations and enabling multiscale spoof-aware feature refinement. In leave-one-out evaluations on leading FAS benchmarks, WSSSP-Net reduces HTER by up to 24.9% and increases AUC by up to 3.2% over state-of-the-art methods. In out-of-distribution tests on a separate dataset, it lowers HTER by up to 18.4%. Across attack classes, it reduces average APCER by up to 12.9% and BPCER by up to 12.4%, achieving all improvements without added inference cost.

93

CONCORD: Concept-Informed Diffusion for Dataset Distillation

Jianyang Gu ⋅ Haonan Wang ⋅ Ruoxi Jia ⋅ Saeed Vahidian ⋅ Vyacheslav Kungurtsev ⋅ Wei Jiang ⋅ Yiran Chen

Dataset distillation (DD) has witnessed significant progress in creating small datasets that encapsulate rich information from large original ones. Particularly, methods based on generative priors show promising performance while maintaining computational efficiency and cross-architecture generalization. However, the generation process lacks explicit controllability for each sample. Previous distillation methods primarily match the real distribution from the perspective of the entire dataset, whereas overlooking concept completeness at the instance level. The missing or incorrectly represented object details cannot be efficiently compensated due to the constrained sample amount typical in DD settings. To this end, we propose incorporating the concept understanding of large language models (LLMs) to perform Concept-Informed Diffusion (Concord) for dataset distillation. Specifically, distinguishable and fine-grained concepts are retrieved based on category labels to inform the denoising process and refine essential object details. By integrating these concepts, the proposed method significantly enhances both the controllability and interpretability of the distilled image generation, without relying on pre-trained classifiers. We demonstrate the efficacy of Concord by achieving state-of-the-art performance on ImageNet-1K and its subsets. The code implementation is in the supplementary material.

94

Improving Out-of-Distribution Detection Using Segmented Images and Cross-View Attention Fusion

Alexander Politowicz ⋅ Sahisnu Mazumder ⋅ Bing Liu

Although out-of-distribution (OOD) detection has been extensively studied, it continues to face challenges in handling OOD data semantically similar to In-Distribution (ID) data. Part of the difficulty arises from the model's inability to learn superior ID class discriminative features. We propose to improve this by segmenting input images into foreground and background views and combining them with the original input image (original view) in a multi-view learning approach. We present a novel method, called CASOD (Cross-view Attention of Segmented views for OOD Detection), that learns better discriminative information from all three views and subsequently, fuses them through a novel stacked cross-view attention mechanism to produce the final predictive feature representation. A feature-based method is then applied to the final fused feature for OOD detection, giving major improvements over a range of strong baselines on various near- and far-OOD datasets. CASOD achieves state-of-the-art performance in various experimental settings with challenging ID and OOD datasets. The CASOD codebase is submitted in the supplementary materials.

95

An improved architecture for part-based animal re-identification through semantic segmentation distillation

Eugênio Dias Ribeiro Neto ⋅ Marc Chaumont ⋅ Gérard Subsol ⋅ Michel Garine-Wichatitsky ⋅ Hélène Guis

Wildlife re-identification (Re-ID) is critical for non-invasive monitoring. Yet, animal Re-ID performances remain far behind person Re-ID due to limited datasets and a greater fine-grained appearance variability between individuals. One strategy is to adopt part-based methods in order to more precisely attend to distinct anatomical regions. To adapt to animal Re-ID, we propose PAW-ViT (Part-AWare animal re-identification Vision Transformer), a ViT that replaces the standard classification token with $K$ learnable part tokens, each specialized to a specific anatomical region of the animal. Spatial specialization is achieved via feature-based knowledge distillation by training each token’s attention to image patches to produce a semantic segmentation mask. An additional aggregation token fuses the part embeddings into a single part-aware descriptor. Trained with a multi-task loss, PAW-ViT outperforms state-of-the-art methods in animal Re-ID on ATRW (Amur tigers) and YakREID-103 (yaks), particularly in scenarios of strong viewpoint variations like the cross-camera setting.

96

FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks

Jinwei Li ⋅ Huan-ang Gao ⋅ Wenyi Li ⋅ Haohan Chi ⋅ Chenyu Liu ⋅ Chenxi Du ⋅ Yiqian Liu ⋅ Mingju Gao ⋅ Guiyu Zhang ⋅ Zongzheng Zhang ⋅ Li Yi ⋅ Yao Yao ⋅ Jingwei Zhao ⋅ Hongyang Li ⋅ Yikai Wang ⋅ Hao Zhao

With the rapid advancements in diffusion models and 3D generation techniques, dynamic 3D content generation has become a crucial research area. However, achieving high-fidelity 4D (dynamic 3D) generation with strong spatial-temporal consistency remains a challenging task. Inspired by recent findings that pretrained diffusion features capture rich correspondences, we propose FB-4D, a novel 4D generation framework that integrates a Feature Bank mechanism to enhance both spatial and temporal consistency in generated frames. In FB-4D, we store features extracted from previous frames and fuse them into the process of generating subsequent frames, ensuring consistent characteristics across both time and multiple views. To ensure a compact representation, the Feature Bank is updated by a proposed dynamic merging mechanism. Leveraging this Feature Bank, we demonstrate for the first time that generating additional reference sequences through multiple autoregressive iterations can continuously improve generation performance. Experimental results show that FB-4D significantly outperforms existing methods in terms of rendering quality, spatial-temporal consistency, and robustness. It surpasses all multi-view generation tuning-free approaches by a large margin and achieves performance on par with training-based methods. Our code and data will be publicly available to support future research.

97

Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction

Cheng-You Lu ⋅ Zhuoli Zhuang ⋅ Nguyen Le ⋅ da xiao ⋅ Yu-Cheng Chang ⋅ Thomas Do ⋅ Srinath Sridhar ⋅ Chin-teng Lin

Advances in 3D reconstruction and novel view synthesis have enabled efficient and photorealistic rendering. However, images for reconstruction are still either largely manual or constrained by simple preplanned trajectories. To address this issue, recent works propose generalizable next-best-view planners that do not require online learning. Nevertheless, robustness and performance remain limited across various shapes. Hence, this study introduces Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction (Hestia), which addresses the shortcomings of the reinforcement learning-based generalizable approaches for five-degree-of-freedom viewpoint prediction. Hestia systematically improves the planners through four components: a more diverse dataset to promote robustness, a hierarchical structure to manage the high-dimensional continuous action search space, a close-greedy strategy to mitigate spurious correlations, and a face-aware design to avoid overlooking geometry. Experimental results show that Hestia achieves non-marginal improvements, with at least a 4% gain in coverage ratio, while reducing Chamfer Distance by 50% and maintaining real-time inference. In addition, Hestia outperforms prior methods by at least 12% in coverage ratio with a 5-image budget and remains robust to object placement variations. Finally, we demonstrate that Hestia, as a next-best-view planner, is feasible for the real-world application. The code and data processing script for Hestia are provided in the supplementary materials and will be released after publication.

98

R3: Reconstruction, Raw, and Rain: Deraining Directly in the Bayer Domain

Nate Rothschild ⋅ Moshe Kimhi ⋅ Avi Mendelson ⋅ Chaim Baskin

Image reconstruction from corrupted (rain) images is crucial across many domains. Most deraining networks are trained on post‑ISP RGB images, eventhough the image‑signal‑processing pipeline irreversibly mixes colors,clips dynamic range and blurs fine detail. This paper indicates that these lossesare avoidable and show that learning directly on raw Bayermosaics yields superior reconstructions from a single camera.To substantiate the claim we (i) curate Raw‑Rain, the firstpublic benchmark of real rainy scenes captured in both 12‑bit Bayer andbit‑depth‑matched sRGB, (ii) design a lightweight U‑Net that ingests thesingle‑channel Bayer tensor, and (iii) introduce InformationConservation Score (ICS}, a color‑invariant metric that aligns moreclosely with human opinion than PSNR or SSIM. On the test split ourraw‑domain model improves RGB results by up to +0.99 dB PSNR and +1.2 \% ICS, while running faster with half of the GFLOPs. The results advocate an \emph{ISP‑last}paradigm for low‑level vision and open the door to end‑to‑end learnablecamera pipelines.

99

An Efficient Multi-Rater Setup Towards Personalized and Diversified Medical Image Segmentation

Sajed Almorsy ⋅ Ayman Khalafallah ⋅ Marwan Torki

Multi-rater medical image segmentation addresses annotation ambiguities but typically requires costly multiple expert annotations per scan. We propose P-Diverse, a novel two-stage framework that minimizes the annotation needs while achieving state-of-the-art performance. Stage-I trains a modified nnU-Net with expert-specific embeddings throughout the network stages, generating personalized segmentations using as low as one annotation per scan. Stage-II freezes that network to synthesize the missing annotations and trains a diversification model that captures multi-rater variability using the available and synthesized annotations. We evaluated on the public NPC dataset and QUBIQ2021 dataset (where the current SOTA method fails), P-Diverse establishes new SOTA performance using synthetic annotations on the diversification stage, significantly reducing clinical annotation burdens. Code: https://github.com/XXX/XXX.

100

HiMix : Hierarchical Visual-Textual Mixing Network for Lesion Segmentation

Soojin Hwang ⋅ Jaeyoon Sim ⋅ Won Hwa Kim

Lesion segmentation is an essential task in medical imaging to support diagnosis and assessment of pulmonary diseases. While deep learning models have shown success in various domains, their reliance on large-scale annotated datasets limits applicability in the medical domain due to labeling cost. To address this issue, recent studies in medical image segmentation have utilized clinical texts as complementary semantic cues without additional annotations. However, most existing methods utilize a single textual embedding and fail to capture hierarchical interactions between language and visual features, which limits their ability to leverage fine-grained cues essential for precise and detailed segmentation. In this regime, we propose Hierarchical Visual-Textual Mixing Network (HiMix), a novel multi-modal segmentation framework that mixes multi-scale image and text representations throughout the mask decoding process. HiMix progressively injects hierarchical text embedding, from high-level semantics to fine-grained spatial details, into corresponding image decoder layers to bridge the modality gap and enhance visual feature refinement at multiple levels of abstraction. Experiments on the QaTa-COV19 and MosMedData+ datasets demonstrate that HiMix consistently outperforms uni-modal and multi-modal methods. Furthermore, HiMix exhibits strong generalization to unstructured textual formats, highlighting its practical applicability in real-world clinical scenarios.

101

FSP-DETR: Few-Shot Prototypical Parasitic Ova Detection

Shubham Trehan ⋅ Udhav Ramachandran ⋅ Akash Rao ⋅ Ruth Scimeca ⋅ Sathya Aakur

Object detection in biomedical settings is fundamentally constrained by the scarcity of labeled data and the frequent emergence of novel or rare categories. We present FSP-DETR, a unified detection framework that enables robust few-shot detection, open-set recognition, and generalization to unseen biomedical tasks within a single model. Built upon a class-agnostic DETR backbone, our approach constructs class prototypes from original support images and learns an embedding space using augmented views and a lightweight transformer decoder. Training jointly optimizes a prototype matching loss, an alignment-based separation loss, and a KL divergence regularization to improve discriminative feature learning and calibration under scarce supervision. Unlike prior work that tackles these tasks in isolation, FSP-DETR enables inference-time flexibility to support unseen class recognition, background rejection, and cross-task adaptation without retraining. We also introduce a new ova species detection benchmark with 20 parasite classes and establish standardized evaluation protocols. Extensive experiments across ova, blood cell, and malaria detection tasks demonstrate that FSP-DETR significantly outperforms prior few-shot and prototype-based detectors, especially in low-shot and open-set scenarios.

102

DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions

Yifan Zhou ⋅ Takehiko Ohkawa ⋅ Guwenxiao Zhou ⋅ Kanoko Goto ⋅ Takumi Hirose ⋅ Yusuke Sekikawa ⋅ Nakamasa Inoue

Reconstructing daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE).To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured prediction tasks like 3D HPE, with improved inference speed over ResNet50. Our experiments involve extensive evaluations on five datasets that cover diverse scenarios, including single-hand and two-hands estimation, hand-only and hand-object interactions, as well as RGB and depth modalities. We demonstrate that DF-Mamba outperforms the latest image backbones, including VMamba and Spatial Mamba, on all datasets and achieves state-of-the-art performance.

103

Context-Preserving Dermoscopic Editing: Mask-Guided Lesion-Aware Diffusion for Attribute Modification

Tao Sun ⋅ Yun Jiang ⋅ Yarong Jin ⋅ Huanting Guo ⋅ Zequn Zhang

Controllable dermoscopic image editing has the potential to enable clinically meaningful data augmentation and to support diagnostic decision-making. However, existing diffusion-based approaches are not tailored to the unique constraints of lesion-level attribute modification. Meanwhile, generic editing methods commonly produce global changes or fail to preserve surrounding tissue context, risking alteration of diagnostic cues. To address these shortcomings, we propose CPDE, a context-preserving dermoscopic editing framework that utilizes mask-guided lesion-aware diffusion for precise attribute modification. CPDE employs a three-stage denoising pipeline with a dual-branch design that separates lesion editing from background reconstruction. The framework incorporates a Spatial-channel Transformer that predicts semantic residuals in $h$-space via sequential spatial–channel attention. Additionally, a lesion-aware mask-guided training strategy enforces semantic directionality while restricting optimization to pathology regions. Extensive experiments on dermoscopic benchmarks demonstrate that CPDE produces spatially localized, clinically coherent edits while preserving diagnostic context and background fidelity. Our method achieves superior performance with FID of 0.274, $S_{dir}$ of 0.486, and NS-LPIPS of 0.012, outperforming existing generative editing approaches.

104

How I Met Your Bias: Investigating Bias Amplification in Diffusion Models

Nathan Roos ⋅ Ekaterina Iakovleva ⋅ Ani Gjergji ⋅ Vito Paolo Pastore ⋅ Enzo Tartaglione

Diffusion-based generative models demonstrate state-of-the-art performance across various image synthesis tasks, yet their tendency to replicate and amplify dataset biases remains poorly understood. Although previous research has viewed bias amplification as an inherent characteristic of diffusion models, this work provides the first analysis of how sampling algorithms and their hyperparameters influence bias amplification. We empirically demonstrate that samplers for diffusion models -- commonly optimized for sample quality and speed -- have a significant and measurable effect on bias amplification. Through controlled studies with models trained on Biased MNIST and BFFHQ, and with Stable Diffusion, we show that sampling hyperparameters can induce both bias reduction and amplification, even when the trained model is fixed. We commit to releasing the code open-source upon acceptance of the article.

105

BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries

Tianle Li ⋅ Yongming Rao ⋅ Winston Hu ⋅ Yu Cheng

Encoder-free multimodal large language models (MLLMs) eliminate the need for a well-trained vision encoder by directly processing image tokens before the language model. While this approach reduces computational overhead and model complexity, it often requires large amounts of training data to effectively capture the visual knowledge typically encoded by vision models like CLIP. The absence of a vision encoder implies that the model is likely to rely on substantial data to learn the necessary visual-semantic alignments. In this work, we present BREEN, a data-efficient encoder-free multimodal architecture that mitigates this issue. BREEN leverages a learnable query and image experts to achieve comparable performance with significantly less training data. The learnable query, positioned between image and text tokens, is supervised by the output of a pretrained CLIP model to distill visual knowledge, bridging the gap between visual and textual modalities. Additionally, the image expert processes image tokens and learnable queries independently, improving efficiency and reducing interference with the LLM’s textual capabilities. BREEN achieves comparable performance to prior encoder-free state-of-the-art models like Mono-InternVL, using only 13 million text-image pairs in training—about one percent of the data required by existing methods. Our work highlights a promising direction for data-efficient encoder-free multimodal learning, offering an alternative to traditional encoder-based approaches.

106

ProtoGMVAE: A Variational Auto-Encoder with True Gaussian Mixture Prior for Prototypical-based Self-Explainability

Martin Blanchard ⋅ Christophe Ducottet ⋅ Damien Muselet ⋅ Olivier Delézay

Recently, significant efforts were made towards Variational Autoencoder (VAE) -based prototypical Self Explainable Models (SEM) for image classification. The princi-ple is to learn class-specific prototypes that can be projected back into the image spacethanks to the decoding branch of a VAE. However, existing VAE-based SEM fail to rep-resent properly the distribution of training samples in the embedding space, requiringto define specific additional constraints as diversity or orthogonality. In this work, wepropose to define the prototypes as the components of a Gaussian Mixture VAE (GM-VAE) that is an approximation of the distribution of training samples. We show that thisdefinition allows to produce relevant and diverse prototypes providing a probabilistic ex-planation of the model without assigning prototypes to a specific class. We support ourdefinition with extensive experimentation and comparison with previous self-explainableapproaches.

107

AEON: Adaptive Embedding Optimized Noise for Robust Watermarking in Diffusion Models

Muhammad Muneer ⋅ Simon Woo

The widespread use of synthetic image generation models and the challenges associated with authenticity preservation have fueled the demand for robust watermarking methods to safeguard authenticity and protect the copyright of synthetic images. Existing watermarking methods embed. Invisible signatures in synthetic images often compromise image quality and remain susceptible to multiple watermark removal attacks, including reconstruction and forgery methods. To overcome this issue, we propose a novel watermarking approach,~\SystemName, which seamlessly integrates the watermark into the latent diffusion process and ensures the watermark aligns with scene semantics in the final image. Unlike existing invisible in-diffusion watermarking and traditional hash-based methods, our approach adapts the neural synthesized hash-based watermark to the semantics of the generated image during the intermediate diffusion process instead of embedding traditional hashes with the initial noise. This facilitates visual coherence in the generated image while enhancing adversarial robustness and resilience against single or multiple adversarial and traditional watermark removal attacks. Our proposed approach a) modulates the noise sampling in each diffusion denoising iteration through a learnable watermark embedding, b) optimizes consistency, reconstruction, and similarity loss, enforcing local and global alignment between the watermark structure and the underlying image content, and c) generates a strong watermark by allowing late embedding of the watermark in the diffusion process. Empirical results demonstrate the effectiveness of the proposed approach in retaining quality and its robustness against cumulative adversarial attacks. For the review process, our anonymous version of the code is available at https://anonymous.4open.science/r/aeon-144C/.

108

Revisiting Vision–Language Foundations for No-Reference Image Quality Assessment

ANKIT YADAV ⋅ Ta Duc Huy ⋅ Lingqiao Liu

Large-scale vision–language pre-training has recently shown promise for no-reference image-quality assessment(NR-IQA), yet the relative merits of modern Vision Transformer foundations remain poorly understood. In this work.We present the first systematic evaluation of six prominent pretrained vision-language backbones, CLIP, SigLIP2, DINOv2, DINOv3, Perception, and ResNet—for the task of No-Reference Image Quality Assessment (NR-IQA), each finetuned using an identical lightweight MLP head. Our study uncovers two previously overlooked factors: (1) SigLIP2 consistently achieves strong performance; and (2) the choice of activation function plays a surprisingly crucial role, particularly for enhancing the generalization ability of imagequality assessment models. Notably, we find that simple sigmoid activations outperform commonly used ReLU and GELU on several benchmarks. Motivated by this finding, we introduce a learnable activation selection mechanism that adaptively determines the nonlinearity for each channel, eliminating the need for manual activation design. achieving new state-of-the-art SRCC on CLIVE, KADID, and AGIQA-3K. Extensive ablations confirm the benefits across architectures and regimes, establishing strong, resource-efficient NR-IQA baselines.

109

Semi-supervised Domain Adaptation via Mutual Alignment through Joint Error

Dexuan Zhang ⋅ Thomas Westfechtel ⋅ Tatsuya Harada

Most existing methods for unsupervised domain adaptation focus on learning domain-invariant representations. However, recent works have shown that the generalization on the target domain can fail due to the trade-off between marginal distribution alignment and joint error under a large domain shift. A few labeled target data points can enhance adaptation quality, but the distribution shift between labeled and unlabeled target data is often overlooked. Therefore, we propose a novel learning theory to address the joint error in semi-supervised domain adaptation that can reduce the mutual distribution shift between pairs from labeled and unlabeled domains. Furthermore, we introduce a discrepancy measurement between hypotheses to tackle the inconsistency of the loss functions in the algorithm and theory. Extensive experiments demonstrate that our method consistently outperforms baseline approaches, particularly in scenarios with large domain shifts and scarce labeled target data.

110

Unified Control for Inference-Time Guidance of Denoising Diffusion Models

Maurya Goyal ⋅ Anuj Singh ⋅ Hadi Rad

Aligning diffusion model outputs with downstream objectives is essential for improving task-specific performance. Broadly, inference-time training-free approaches for aligning diffusion models can be categorized into two main strategies: sampling-based methods, which explore multiple candidate outputs and select those with higher reward signals, and gradient-guided methods, which use differentiable reward approximations to directly steer the generation process. In this work, we propose a universal algorithm, UniCoDe, which brings together the strengths of sampling and gradient-based guidance into a unified framework. UniCoDe integrates local gradient signals during sampling, thereby addressing the sampling inefficiency inherent in complex reward-based sampling approaches. Concurrently, it overcomes the limited applicability of traditional gradient-guided methods, which often struggle with non-differentiable rewards. By cohesively combining these two paradigms, UniCoDe enables more efficient sampling while offering better trade-offs between reward alignment and divergence from the diffusion unconditional prior. Empirical results demonstrate that UniCoDe remains competitive with state-of-the-art baselines across a range of tasks.

111

Learning Subglacial Bed Topography from Sparse Radar with Physics-Guided Residuals

Bayu Tama ⋅ Jianwu Wang ⋅ Vandana Janeja ⋅ Mostafa Cham

Accurate subglacial bed topography is essential for ice-sheet modeling, yet radar observations are sparse and uneven. We propose a physics-guided residual learning framework that predicts bed \emph{thickness residuals} over a BedMachine prior and reconstructs bed from the observed surface. A DeepLabV3$+$ decoder over a standard encoder (e.g., ResNet-50) is trained with lightweight physics and data terms: multi-scale mass conservation, flow-aligned total variation, Laplacian damping, non-negativity of thickness, a ramped prior-consistency term, and a masked Huber fit to radar picks modulated by a confidence map. To measure real-world generalization, we adopt leakage-safe \emph{block-wise} hold-outs (vertical/horizontal) with safety buffers and report metrics only on held-out cores. Across two Greenland sub-regions, our approach achieves strong test-core accuracy (RMSE $3.05$–$10.54$\,m; $R^2=0.993$–$0.999$) and high structural fidelity (SSIM $\ge 0.998$, PSNR up to $52.9$\,dB), outperforming U-Net, Attention U-Net, FPN, and a plain CNN. The residual-over-prior design, combined with physics, yields spatially coherent, physically plausible beds suitable for operational mapping under domain shift.

112

4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimer's Diagnosis

YUXIANG WEI ⋅ Yanteng Zhang ⋅ Xi Xiao ⋅ Tianyang Wang ⋅ Xiao Wang ⋅ Vince Calhoun

Multimodal neuroimaging provides complementary structural and functional insights into both human brain organization and disease-related dynamics. Recent studies demonstrate enhanced diagnostic sensitivity for Alzheimer's disease (AD) through synergistic integration of neuroimaging data (e.g., sMRI, fMRI) with tabular data (e.g., behavioral and cognitive tests). However, the intrinsic heterogeneity across modalities (e.g., 4D spatiotemporal fMRI dynamics vs. 3D anatomical sMRI structure) presents critical challenges for discriminative feature fusion, often leading to information loss or biased fusion. To bridge this gap, we propose M2M-AlignNet: a multimodal co-attention network with latent alignment for early AD diagnosis using sMRI and fMRI. At the core of our approach is a multi-patch-to-multi-patch (M2M) contrastive loss function that quantifies and reduces representational discrepancies via weighted patch correspondence, explicitly aligning fMRI components across brain regions with their sMRI structural substrates without one-to-one constraints. Additionally, we propose a latent-as-query co-attention module to autonomously discover fusion patterns, circumventing modality prioritization biases while minimizing feature redundancy. We conduct extensive experiments to confirm the effectiveness of our method and highlight the correspondence between fMRI and sMRI as AD biomarkers.

113

One-Cycle Structured Pruning via Stability-Driven Subnetwork Search

Deepak Ghimire ⋅ Dayoung Kil ⋅ Sunghwan Jeong ⋅ Jaesik Park ⋅ Seong-heum Kim

Existing structured pruning typically involves multi-stage training procedures that often demand heavy computation. Pruning at initialization, which aims to address this limitation, reduces training costs but struggles with performance. To address these challenges, we propose an efficient framework for one-cycle structured pruning without compromising model performance. In this approach, we integrate pre-training, pruning, and fine-tuning into a single training cycle, referred to as the `one cycle approach'. The core idea is to search for the optimal sub-network during the early stages of network training, guided by norm-based group saliency criteria and structured sparsity regularization. We introduce a novel pruning indicator that determines the stable pruning epoch by assessing the similarity between evolving pruning sub-networks across consecutive training epochs. Also, group sparsity regularization helps accelerate the pruning process and speeds up the entire process. Extensive experiments on datasets, including CIFAR-10/100 and ImageNet, using VGGNet, ResNet, MobileNet, and ViT architectures, demonstrate that our method achieves state-of-the-art accuracy while being one of the most efficient pruning frameworks in terms of training time. The source code will be made publicly available.

114

PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction

Bo Lang ⋅ Nirav Savaliya ⋅ Zhihao Zheng ⋅ Jinglun Feng ⋅ Zheng-Hang Yeh ⋅ Mooi Choo Chuah

High-definition (HD) maps are crucial to autonomous driving, providing structured representations of road elements to support navigation and planning. However, existing query-based methods often employ random query initialization and depend on implicit temporal modeling, which lead to temporal inconsistencies and instabilities during the construction of a global map. To overcome these challenges, we introduce a novel end-to-end framework for consistent online HD vectorized map construction, which jointly performs map instance tracking and short-term prediction. First, we propose a Semantic-Aware Query Generator that initializes queries with spatially aligned semantic masks to capture scene-level context globally. Next, we design a History Rasterized Map Memory to store fine-grained instance-level maps for each tracked instance, enabling explicit historical priors. A History-Map Guidance Module then integrates rasterized map information into track queries, improving temporal continuity. Finally, we propose a Short-Term Future Guidance module to forecast the immediate motion of map instances based on the stored history trajectories. These predicted future locations serve as hints for tracked instances to further avoid implausible predictions and keep temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed method outperforms state-of-the-art (SOTA) methods with good efficiency.

115

Multi-view stereo with multiple projectors for oneshot entire shape scan based on Neural SDF and DSSS demultiplexing

Kota Nishihara ⋅ Ryo Furukawa ⋅ Ryusuke Sagawa ⋅ Hiroshi Kawasaki

3D reconstruction has been widely studied and applied in various fields. Multi-view stereo (MVS) methods can recover dense geometry from multiple views but often fail for texture-less objects due to unreliable feature matching. Active stereo with structured light (SL) addresses this limitation, however, when using multiple cameras and projectors for entire shape acquisition, overlapped SL patterns interfere with one another, leading to decoding failures.We propose a novel MVS framework based on neural signed distance fields (Neural SDF) with multiple projectors that employs Direct Sequence Spread Spectrum (DSSS) to separate multiplexed patterns. This approach enables robust and accurate 3D shape reconstruction through Neural SDF optimization with a photometric loss that accounts for both the positions and the patterns of the projectors. We built real scanning devices, and experiments on several objects demonstrated the effectiveness of the proposed method.

116

FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting

Teng-Fang Hsiao ⋅ Bo-Kai Ruan ⋅ Sung-Lin Tsai ⋅ Yi-Lun Wu ⋅ Hong-Han Shuai

Text-to-image inpainting models often exhibit an unpredictable balance among image coherence and prompt adherence. This rigidity limits their adaptability across diverse scenarios, including coarse masks, non-object, and interaction prompts. Recognizing this instability as an indicator of learned generation diversity, we aim to control model behavior for given objective. We propose Empirical Feature Intervention (EFI), a metric-agnostic framework that precomputes how feature interventions influence evaluation metrics—such as CLIP, Human Preference Score (HPS), and Image Reward (IR). Building on EFI, we introduce FreeCond, a free-of-cost framework that applies two simple input interventions (Image Frequency and Mask Value Modulation), these interventions can be further optimized via Surrogate Intervention Optimization (SIO) based on a surrogate model regressed with precomputed EFI data. FreeCond enables real-time, user-interactive control of pre-trained models without retraining or architectural modifications. Also, to benchmark performance on challenging settings, we present FCIBench. Experiments on EditBench, BrushBench, and FCIBench demonstrate that FreeCond substantially improves CLIP, HPS, and IR metrics by up to 22%, 8%, and 54%, respectively.

117

ControlEvents: Controllable Synthesis of Event Camera Data with Foundational Prior from Image Diffusion Models

Yixuan Hu ⋅ Yuxuan Xue ⋅ Simon Klenk ⋅ Daniel Cremers ⋅ Gerard Pons-Moll

In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range.However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly.In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data.Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models. Our models and generated datasets will be publicly available for future research.

118

DPBridge: Latent Diffusion Bridge for Dense Prediction

Haorui Ji ⋅ Tao Jun Lin ⋅ Hongdong Li

Diffusion models demonstrate remarkable capabilities in capturing complex data distributions and have achieved compelling results in many generative tasks. While they have recently been extended to dense prediction tasks such as depth estimation and surface normal prediction, their full potential in this area remains underexplored. As target signal maps and input images are pixel-wise aligned, the conventional noise-to-data generation paradigm is inefficient, and input images can serve as a more informative prior compared to pure noise. Diffusion bridge models, which support data-to-data generation between two general data distributions, offer a promising alternative, but they typically fail to exploit the rich visual priors embedded in large pretrained foundation models. To address these limitations, we integrate diffusion bridge formulation with structured visual priors and introduce DPBridge, the first latent diffusion bridge framework for dense prediction tasks. To resolve the incompatibility between diffusion bridge models and pretrained diffusion backbones, we propose (1) a tractable reverse transition kernel for the diffusion bridge process, enabling maximum likelihood training scheme; (2) finetuning strategies including distribution-aligned normalization and image consistency loss. Experiments across extensive benchmarks validate that our method consistently achieves superior performance, demonstrating its effectiveness and generalization capability under different scenarios.

119

PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection

Po-Han Huang ⋅ Jeng-Lin Li ⋅ Po-Hsuan Huang ⋅ Ming-Ching Chang ⋅ Wei-Chao Chen

Industrial anomaly detection is increasingly relying on foundation models, aiming for strong out-of-distribution generalization and rapid adaptation in real-world deployments. Notably, past studies have primarily focused on textual prompt tuning, leaving the intrinsic visual counterpart fragmented into processing steps specific to each foundation model. We aim to address this limitation by proposing a unified patch-focused framework, Patch-Exclusive Anomaly Detection (PatchEAD), enabling training-free anomaly detection that is compatible with diverse foundation models. The framework constructs visual prompting techniques, including an alignment module and foreground masking. Our experiments show superior few-shot and batch zero-shot performance compared to prior work, despite the absence of textual features. Our study further examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing actionable guidance for selecting and configuring foundation models for real-world visual inspection. These results confirm that a well-unified patch-only framework can enable quick, calibration-light deployment without the need for carefully engineered textual prompts.

120

SurfDist: Interpretable Three-Dimensional Instance Segmentation Using Curved Surface Patches

Jackson Borchardt ⋅ Saul Kato

We present SurfDist, a convolutional neural network architecture for three-dimensional volumetric instance segmentation. SurfDist is a modification of the popular model architecture StarDist-3D which enables learning instance boundaries as closed piecewise compositions of smooth parametric surfaces. This parameterization breaks StarDist-3D's coupling of instance dimension and instance voxel resolution, and it produces predictions which may be upsampled to arbitrarily high resolutions without introduction of voxelization artifacts. For datasets with blob-shaped instances, common in biomedical imaging, SurfDist can achieve higher segmentation accuracy than StarDist-3D with more compact instance parameterizations.

121

CRISP: Cylindrical Rendering for In-Stream Point Clouds

Hyungwoo Kang ⋅ Seonyoung Jang ⋅ YeoJun Yoon ⋅ Byungtae Oh

Conventional point cloud rendering methods are based on the assumption that the complete point cloud is used as input data. However, these methods exhibit limitations in real-world applications, particularly in-stream environments. In real-world scenarios, point cloud acquisition and transmission typically require compression, resulting in degraded point clouds that should be rendered effectively. To address these challenges, we proposed a novel point cloud rendering method through cylindrical projection, optimized for in-stream environments. This approach minimizes distortion by projecting the entire point cloud radially onto a 2D cylindrical plane based on its central axis, enabling grid-based 2D processing. This method provides stable and consistent performance even with sparse or incomplete point clouds. Extensive benchmark experiments have revealed that the proposed framework effectively renders various scenes and objects without requiring additional fine-tuning and demonstrated superior performance and generalization ability in in-stream environments compared with conventional methods.

122

INRetouch: Context Aware Implicit Neural Representation for Photography Retouching

Omar Elezabi ⋅ Marcos Conde ⋅ Zongwei Wu ⋅ Radu Timofte

Professional photo editing remains challenging, requiring extensive knowledge of imaging pipelines and significant expertise. While recent deep learning approaches, particularly style transfer methods, have attempted to automate this process, they often struggle with output fidelity, editing control, and complex retouching capabilities. We propose a novel retouch transfer approach that learns from professional edits through before-after image pairs, enabling precise replication of complex editing operations. We develop a context-aware Implicit Neural Representation that learns to apply edits adaptively based on image content and context, and is capable of learning from a single example. Our method extracts implicit transformations from reference edits and adaptively applies them to new images. To facilitate this research direction, we introduce a comprehensive Photo Retouching Dataset comprising 100,000 high-quality images edited using over 170 professional Adobe Lightroom presets. Through extensive evaluation, we demonstrate that our approach not only surpasses existing methods in photo retouching but also enhances performance in related image reconstruction tasks like Gamut Mapping and Raw Reconstruction. By bridging the gap between professional editing capabilities and automated solutions, our work presents a significant step toward making sophisticated photo editing more accessible while maintaining high-fidelity results.

123

Line Art Colorization with Offset Prior-based Diffusion Model

Xuan Zhu ⋅ Miao Cao ⋅ Fang-Lue Zhang ⋅ Yu-Kun Lai ⋅ Paul Rosin

Reference-based line art video colorization colorizes the target line art according to reference images, which is an essential stage for the cartoon production workflow. However, the manual colorization process is time-consuming and repetitive, making automatic video colorization highly desirable. Existing cartoon colorization methods struggle with domain misalignment between the reference and line art images and the loss of details caused by compression into a low-dimensional space in the existing video diffusion models, reducing colorization quality. In this paper, we propose an Offset Prior-based Diffusion Model (OPDM) for cartoon video colorization, which utilizes the powerful generation capability of the diffusion model and cross-domain matching priors to generate high-quality colorization results. Specifically, we design a simple and effective Offset-Adapter that leverages the idea of sampling offsets in deformable convolution to estimate the cross-domain spatial offset features between the target line arts and reference images. We further introduce a new training strategy that combines forward diffusion and reverse denoising in the training stage to ensure content consistency. Experiments on a public cartoon dataset and our newly constructed long cartoon video dataset demonstrate that our proposed method outperforms the existing state-of-the-art line art coloring methods. The code will be available upon publication.

124

Food Image Generation on Multi-Noun Categories

Xinyue Pan ⋅ Yuhao Chen ⋅ Jiangpeng He ⋅ Fengqing Zhu

Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt “egg noodle” may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.

125

RobuMTL: Enhancing Multi-Task Learning Robustness Against Weather Conditions

Tasneem Shaffee ⋅ Sherief Reda

The Multi-Task Learning (MTL) paradigm has recently emerged as a promising approach to tackle complex problems across various domains, such as computer vision, reinforcement learning, and natural language processing, and has been successfully deployed in numerous applications, including on edge devices. Consequently, addressing the robustness of MTL has become a necessity to withstand diverse types of perturbations, including noise and environmental conditions such as weather. In this paper, we introduce RoboMTL, a novel architecture designed to adaptively address the degradation of visual input based on its characteristics by dynamically selecting task-specific hierarchal Low-Rank Adaptation (LoRA) modules and LoRA squad based on input perturbations in mixture-of-experts manner. Our framework enables adaptive specialization based on input perturbations, enhancing robustness across diverse conditions. To validate our approach, we evaluated it on the PASCAL and NYUD-v2 datasets and compared it against single-task models, traditional MTL approaches, and state-of-the-art methods. Our approach demonstrated superior performance, achieving a 2.8% increase under single perturbations and up to a 44.4% relative improvement under mixed weather conditions on PASCAL, as well as a 9.7% improvement on NYUD-v2, while maintaining a 3.6× parameter reduction and 3.52× lower computational cost with only a +3.6 ms latency overhead per image over the baseline, effectively enhancing robustness and task efficiency under adverse conditions. The code will be publicly available on GitHub.

126

EndoPBR: Photorealistic Synthetic Data for Surgical 3D Vision via Physically-based Rendering

John Han ⋅ Jie Ying Wu

Synthetic data has played a pivotal role in developing large-scale 3D vision models due to its high-quality annotations and ease of curation. In domains where labeled data collection is difficult, synthetic data holds promise as a means to generate the large‑scale annotated datasets required to train modern neural networks. As a result, the ability to generate photorealistic synthetic data with 3D labels would be immensely helpful for domains like endoscopy, where conventional 3D reconstruction algorithms struggle and labeled data is scarce. In this work, we address a core question for data-scarce applications in 3D vision: how can we generate synthetic labeled data, and how useful would it be for training downstream vision models? To this end, we first introduce a novel data generation module that takes images with known geometry and camera poses as input and estimates the material and lighting conditions of the scene. To disambiguate the training process, we leverage domain-specific properties like non-stationary lighting and anatomical material priors. We model the material properties as a bidirectional reflectance distribution function, parameterized by a neural network. Via the rendering equation, we can generate photorealistic images at arbitrary camera poses. We demonstrate that this method produces competitive novel view synthesis results compared to previous work. Secondly, we use our synthetic data to train models on various downstream 3D vision tasks and find that models trained solely on our synthetic data outperform those trained on real data across all metrics and tasks. Our experiments show that synthetic data is a promising avenue towards robust 3D vision solutions in surgical scenes.

127

Inpainting of Sparse Depth Maps from Monocular Depth-from-Focus on Pixel Processor Arrays

Maciej Lewandowski ⋅ Piotr Dudek

Depth estimation is essential for robotics and effective navigation. While many recent methods attempt to estimate dense depth maps from a single RGB image or a combination of an RGB image and sparse depth measurements, our work leverages the in-pixel computing capabilities of a pixel processor array (PPA), combined with an electrically tunable liquid lens, to capture semi-sparse depth maps via a depth-from-focus approach. We consider the problem of reconstructing dense depth maps from such measurements.We simulate a PPA-based depth-from-focus algorithm on a synthetic focal stack derived from a monocular RGB-D dataset, demonstrating competitive dense depth map reconstruction from depth frames containing as few as 10% non-zero pixels, with a 5-bit resolution. Furthermore, we enhance the performance of semi-sparse depth completion by fusing these PPA-captured depth cues with concurrently acquired RGB images. We also use belief propagation, allowing for highly localized and parallel computation without access to global memory, offering a promising solution from the perspective of PPAs. We show the performance of these algorithms on semi-sparse image depth reconstruction tasks.

128

DMS2F-HAD: A Dual-branch Mamba-based Spatial–Spectral Fusion Network for Hyperspectral Anomaly Detection

Aayushma Pant ⋅ Lakpa Tamang ⋅ Tsz-Kwan Lee ⋅ Sunil Aryal

Hyperspectral anomaly detection (HAD) aims to identify rare and irregular targets in high-dimensional hyperspectral images (HSIs), which are often noisy and unlabeled data. Existing deep learning methods either fail to capture long-range spectral dependencies (e.g., convolutional neural networks) or suffer from high computational cost (e.g., Transformers). To address these challenges, we propose DMS2F-HAD, a novel dual-branch Mamba-based model. Our architecture utilizes Mamba’s linear-time modeling to efficiently learn distinct spatial and spectral features in specialized branches, which are then integrated by dynamic gated fusion mechanism to enhance anomaly localization. Across fourteen benchmark HSI datasets, our proposed DMS2F-HAD not only achieves a state-of-the-art average AUC of 98.78% but also demonstrates superior efficiency with an inference speed 4.6x faster than comparable deep learning methods. The results highlight DMS2F-HAD's strong generalization and scalability, positioning it as a strong candidate for practical HAD applications. The source code is available at https://anonymous.4open.science/r/DMS2F-HAD-45CC.

129

F-ViTA: Foundation Model Guided Visible to Infrared Translation

Jay Paranjape ⋅ Celso de Melo ⋅ Vishal Patel

Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: Released post-review.

130

KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding

Zongyao Li ⋅ Kengo Ishida ⋅ Satoshi Yamazaki ⋅ XIAOTONG JI ⋅ Jianquan Liu

We propose KFS-Bench, the first benchmark for key frame sampling in long video question answering (QA), featuring multi-scene annotations to enable direct and robust evaluation of sampling strategies. Key frame sampling is crucial for efficient long-form video understanding. In long video QA, selecting informative frames enables multimodal large language models (MLLMs) to improve both accuracy and efficiency. KFS-Bench addresses the limitation of prior works that only indirectly assess frame selection quality via QA accuracy. By providing ground-truth annotations of multiple disjoint scenes required per question, KFS-Bench allows us to directly analyze how different sampling approaches capture essential content across an entire long video. Using KFS-Bench, we conduct a comprehensive study of key frame sampling methods and identify that not only sampling precision but also scene coverage and sampling balance are the key factors influencing QA performance. Regarding all the factors, we design a novel sampling quality metric that correlates with QA accuracy. Furthermore, we develop a novel key frame sampling method that leverages question–video relevance to balance sampling diversity against question–frame similarity, thereby improving coverage of relevant scenes. Our adaptively balanced sampling approach achieves superior performance in both key frame sampling and QA performance. The benchmark will be released publicly.

131

FujiView: Multimodal Late-Fusion for Predicting Scenic Visibility

Bryce Bible ⋅ Shah Hasnaeen ⋅ Hairong Qi

Visibility of natural landmarks such as Mount Fuji is a defining factor in both tourism planning and visitor experience, yet it remains difficult to predict due to rapidly changing atmospheric conditions. We present **FujiView**, a multimodal learning framework and dataset for predicting scenic visibility by fusing webcam imagery with structured meteorological data. Our late-fusion approach combines image-derived class probabilities with numerical weather features to classify visibility into five categories. The dataset currently comprises over 100,000 webcam images paired with concurrent and forecasted weather conditions from more than 40 cameras around Mount Fuji, and continues to expand; it will be released to support further research in environmental forecasting. Experiments show that YOLO-based vision features dominate short-term horizons such as "nowcasting'' and "samedaycasting'', while weather-driven forecasts increasingly take over as the primary predictive signal beyond $+1$d. Late fusion consistently yields the highest overall accuracy, achieving $\mathrm{ACC} \approx 0.89$ for same-day prediction and up to **84%** for next-day forecasts. These results position **Scenic Visibility Forecasting (SVF)** as a new benchmark task for multimodal learning.

132

Improving Animal Pose Estimation through Species Similarity Measures and Rigorous Label Definition

Medhashree Parhy ⋅ Shaan Chanchani ⋅ Claire Kim ⋅ Joshua Mansky ⋅ Parth Thakre ⋅ Zian Pan ⋅ Haoyu Chen ⋅ Amy Reibman

Effective image-based analysis of animals, their phenotypes, and their behavior requires accurate localization of key bodyparts. Keypoint detection algorithms can be either generalized, for a large set of animal species, or specialized and targeted to a specific small set of species.In this paper, we explore specialized models, and address two critical aspects of the associated data-label quality:selection of training data and definition of keypoints.Using antelope species as an example, we introduce a variety of species-similarity measures that we apply for selecting relevant training samples, and we demonstrate that training with the automatically selected species leads to improved pose estimation performance while reducing the required number of images.Then, we demonstrate that labeling keypoints with more precise locations leads to improved localization performance that would be valuable for downstream tasks.

133

Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Shaunak Halbe ⋅ Junjiao Tian ⋅ Joseph J ⋅ James Smith ⋅ Katherine Stevo ⋅ Vineeth Balasubramanian ⋅ Zsolt Kira

Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts. This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image. While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution. Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements. We attribute these limited gains to a fundamental misalignment between image and description representations, which is rooted in the pretraining structure of CLIP. In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. Our approach learns to jointly ground textual descriptions in image regions along with aligning overarching captions with global image representations. To drive this pre-training, we leverage frozen Multimodal Large Language Models (MLLMs) to derive large-scale synthetic annotations. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods across 11 diverse image classification datasets. Additionally, we introduce Products-2023, a newly curated, manually labeled dataset featuring novel concepts, and showcase our model's ability to recognize these concepts by benchmarking on it. Significant improvements achieved by our model on other downstream tasks like retrieval further highlight the superior quality of representations learned by our approach.

134

Semi-supervised Key-Point Estimation for Echocardiography Video

Seok-Hwan Oh ⋅ hyeonjik lee ⋅ Guil Jung ⋅ Myeong-Gee Kim ⋅ Young-Min Kim ⋅ Hyuksool Kwon ⋅ Hyeon-min Bae

Echocardiography, a widely used imaging modality, offers real-time assessments of cardiac morphology and function, with a particular emphasis on left ventricular dynamics. Despite its clinical importance, existing automated methods for echocardiographic analysis struggle to ensure temporal consistency in left ventricular key-point trajectories, largely due to their reliance on static frame annotations. To overcome these challenges, we propose a semi-supervised trajectory refinement framework that employs inter-frame correlations to enhance key-point estimation across echocardiography videos. A semi-supervised trajectory learning scheme is presented to improve the efficacy of key-point trajectory analysis using unannotated echocardiography videos. The experiments present considerable improvements in both spatial accuracy and temporal stability of the left ventricle key-point trajectories, outperforming state-of-the-art baselines and demonstrating the clinical applicability for robust echocardiography analysis.

135

Anatomically-guided masked autoencoder pre-training for aneurysm detection

Alberto Mario Ceballos Arroyo ⋅ Jisoo Kim ⋅ Chu-Hsuan Lin ⋅ Lei Qin ⋅ Geoffrey Young ⋅ Huaizu Jiang

Intracranial aneurysms are a major cause of morbidity and mortality worldwide, and detecting them manually is a complex, time-consuming task. Albeit automated solutions are desirable, the limited availability of training data makes it difficult to develop such solutions using typical supervised learning frameworks. In this work, we propose a novel pre-training strategy using more widely available unannotated head CT scan data to pre-train a 3D Vision Transformer model prior to fine-tuning for the aneurysm detection task. Specifically, we modify masked auto-encoder (MAE) pre-training in the following ways: we use a factorized self-attention mechanism to make 3D attention computationally viable, we restrict the masked patches to areas near arteries to focus on areas where aneurysms are likely to occur, and we reconstruct not only CT scan intensity values but also artery distance maps, which describe the distance between each voxel and the closest artery, thereby enhancing the backbone's learned representations. Compared with SOTA aneurysm detection models, our approach gains +2-7% absolute Sensitivity at a false positive rate of 0.5. Code and weights will be released.

136

Style-Friendly SNR Sampler for Style-Driven Generation

Jooyoung Choi ⋅ Chaehun Shin ⋅ Yeongtak Oh ⋅ Heeseung Kim ⋅ Jungbeom Lee ⋅ Sungroh Yoon

Recent text-to-image diffusion models generate high-quality images but struggle to learn new styles, which limits the personalized content creation. In response, style-driven generation has become a popular task, wherein users supply reference images capturing the target style, complemented by text prompts that specify stylistic cues. Fine-tuning is a common approach, yet it often blindly utilizes pre-training configurations without modification, especially for noise schedules defined in terms of signal-to-noise ratio (SNR), which determines the amount of image information available at each denoising step. We discover that stylistic features predominantly emerge at low SNR range, leading current fine-tuning methods using regular noise schedules to exhibit suboptimal style alignment. We propose the Style-friendly SNR sampler, which focuses the fine-tuning on low SNR range where stylistic features emerge. We demonstrate improved generation of novel styles that cannot be described solely with a text prompt, enabling high-fidelity personalized content creation.

137

Lose Your Self (LoYS): an adversarial entropy-based unsupervised approach for model debiasing

Vito Paolo Pastore ⋅ Massimiliano Ciranni ⋅ Vittorio Murino

When spurious correlations between targets and data are present in training samples, deep neural networks may struggle to generalize, typically learning shortcuts corresponding to such undesired correlations (i.e., the bias), rather than fundamental target attributes. If bias attributes are assumed to be known, several strategies can be applied to mitigate a model's dependency on bias, including upsampling or upweighting samples with no bias. However, this is hardly the case in real-world scenarios, and alternative unsupervised debiasing approaches have been proposed in recent years, assuming no bias information is available. In this work, we propose Lose Your Self (LoYS), a novel bias-unsupervised approach for model debiasing. The main design strategy in LoYS aims to force a model learning semantic features, discouraging it from learning bias-related descriptors, specifically focusing on the target classifier confidence. Exploiting an adversarial scheme based on entropy loss, against a shallow auxiliary classifier trained to match the predictions of a pre-trained biased model, and entropy regularizations, our experiments show how LoYS is competitive or outperforms state-of-the-art methods, on several common benchmarks.

138

MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

Ruiyuan Gao ⋅ Kai Chen ⋅ Zhihao Li ⋅ Lanqing HONG ⋅ Zhenguo Li ⋅ Qiang Xu

Controllable generative models for images and videos have seen significant success, yet 3D scene generation, especially in unbounded scenarios like autonomous driving, remains underdeveloped. Existing methods lack flexible controllability and often rely on dense view data collection in controlled environments, limiting their generalizability across common datasets (e.g., nuScenes). In this paper, we introduce MagicDrive3D, a novel framework for controllable 3D street scene generation that combines video-based view synthesis with 3D representation (3DGS) generation. It supports multi-condition control, including road maps, 3D objects, and text descriptions. Unlike previous approaches that require 3D representation before training, MagicDrive3D first trains a multi-view video generation model to synthesize diverse street views. This method utilizes routinely collected autonomous driving data, reducing data acquisition challenges and enriching 3D scene generation. In the 3DGS generation step, we introduce Fault-Tolerant Gaussian Splatting to address minor errors and use monocular depth for better initialization, alongside appearance modeling to manage exposure discrepancies across viewpoints. Experiments show that MagicDrive3D generates diverse, high-quality 3D driving scenes, supports any-view rendering, and enhances downstream tasks like BEV segmentation, demonstrating its potential for autonomous driving simulation and beyond. Project Page: https://magicdrive3d.github.io/.

139

Grounding Degradations in Natural Language for All-In-One Video Restoration

Muhammad Kamran Janjua ⋅ Amirhosein Ghasemabadi ⋅ Kunlin Zhang ⋅ Mohammad Salameh ⋅ Chao Gao ⋅ Di Niu

In this work, we propose an all-in-one video restoration framework that grounds degradation-aware semantic context of video frames in natural language via foundation models, offering interpretable and flexible guidance. Unlike prior art, our method assumes no degradation knowledge in train or test time and learns an approximation to the grounded knowledge such that the foundation model can be safely disentangled during inference adding no extra cost. Further, we call for standardization of benchmarks in all-in-one video restoration, and propose two benchmarks in multi-degradation setting, three-task (3D) and four-task (4D), and two time-varying composite degradation benchmarks; one of the latter being our proposed dataset with varying snow intensity, simulating how weather degradations affect videos naturally. We compare our method with prior works and report state-of-the-art performance on all benchmarks.

140

ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points

Ryota Okumura ⋅ Kaede Shiohara ⋅ Toshihiko Yamasaki

Recent text-to-image models, such as Stable Diffusion, have achieved impressive visual quality, yet they often suffer from geometric inconsistencies that undermine the structural realism of generated scenes. One prominent issue is vanishing point inconsistency, where projections of parallel lines fail to converge correctly in 2D space. This leads to structurally implausible geometry degrading spatial realism, especially in architectural scenes. We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images. Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours. We also introduce geometric constraints that explicitly encourage alignment between image edges and perspective cues. Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines. This capability is particularly valuable for applications that require accurate spatial structure, such as image-to-3D reconstruction. The dataset and source codes will be publicly available upon acceptance.

141

Enhancing Vision Language Corruption Robustness using Cross Distribution & Prompted Denoisers

Sameer Shafayet Latif ⋅ Sadab Shiper ⋅ K. Kiran ⋅ Md Ishmam ⋅ MD HOSSAIN ⋅ Abu Kamal ⋅ Md. Ashmafee

The current generation of Vision Language Models (VLMs) has excelled in idealistic conditions, but their performance drops significantly when exposed to realistic multimodal corruptions, e.g., blurry images, grammatically incorrect texts. Our work addresses this by establishing a novel multimodal corruption and denoising benchmark, with a rich suite of 18 visual and 18 textual corruption functions, to evaluate the system robustness of VLMs. To enhance robustness, we employ: (i) cross-distribution visual denoisers, inspired by the Mixture of Experts (MoE) architecture, and (ii) a prompted zero-shot textual denoiser. Experimental results show up to a 5.5% overall accuracy gain and up to 9% improvement in certain VL tasks.Our experiments reveal the vulnerability of models against specific corruptions and the over-reliance on the textual modality. We envision that the detailed behavioral insights from our benchmark will help in developing robust VLM systems.

142

BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining

Ajinkya Khoche ⋅ Gergő Nagy ⋅ Maciej Wozniak ⋅ Thomas Gustafsson ⋅ Patric Jensfelt

Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects.We introduce StreetCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets—consisting of a point cloud, image, and text description—mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing strategy that first grounds the model in the semantically rich synthetic CAD data before progressively adapting it to the specific characteristics of the autonomous driving setting. Our experiments show that our approach is highly label-efficient: introducing as few as 1.5% real-world samples into training boosts zero-shot accuracy on the nuScenes benchmark by 27%. Consequently, our final model achieves state-of-the-art performance on challenging outdoor datasets like nuScenes and TruckScenes, while maintaining strong generalization on diverse synthetic benchmarks. Our findings demonstrate that effective domain adaptation, not full-scale real-world annotation, is the key to unlocking robust open-vocabulary 3D perception. Our code and dataset will be released upon acceptance.

143

Towards Egocentric 3D Hand Pose Estimation in Unseen Domains

Wiktor Mucha ⋅ Michael Wray ⋅ Martin Kampel

We present V-HPOT, a novel approach for improving the zero-shot performance of 3D hand pose estimation from egocentric images across diverse, unseen domains. State-of-the-art methods demonstrate strong performance when trained and tested within the same domain. However, they struggle to generalise to new environments due to limited training data and depth perception -- overfitting to specific camera intrinsics. Our method addresses this by estimating keypoint z-coordinates in a virtual camera space, normalised by focal length and image size, enabling camera-agnostic depth prediction. We further leverage this invariance to camera intrinsics to propose a self-supervised test-time optimisation strategy that refines the model's depth perception during inference. This is achieved by applying a 3D consistency loss between predicted and in-space scale-transformed hand poses, allowing the model to adapt to target domain characteristics without requiring ground truth annotations. V-HPOT significantly improves 3D hand pose estimation performance in zero-shot scenarios, achieving a 71% reduction in mean pose error on the H2O dataset and a 41% reduction on the AssemblyHands dataset. Compared to state-of-the-art methods, V-HPOT outperforms all single-stage approaches across all datasets and competes closely with two-stage methods, despite needing x3.5 to x14 less data. The code is planned to be released under https://github.com/tobereleased.

144

Direct Visual Grounding by Directing Attention of Visual Tokens

Parsa Esmaeilkhani ⋅ Longin Jan Latecki

Vision Language Models (VLMs) mix visual tokens and text tokens.A puzzling issue is the fact that visual tokens most related to the query receive little to no attention in the final layers of the LLM module of VLMs from the answer tokens,where all tokens are treated equally, in particular, visual and language tokens in the LLM attention layers.This fact may result in wrong answers to visual questions, as our experimental results confirm.It appears that the standard next-token prediction (NTP) loss provides an insufficient signal for directing attention to visual tokens.We hypothesize that a more direct supervision of the attention of visual tokens to corresponding language tokens in the LLM module of VLMs will lead to improved performance on visual tasks.To demonstrate that this is indeed the case,we propose a novel loss function that directly supervises the attention of visual tokens.It directly grounds the answer language tokens in images by directing their attention to the relevant visual tokens.This is achieved by aligning the attention distribution of visual tokens to ground truth attention maps with KL divergence.The ground truth attention maps are task-specific but are generated automatically. The obtained KL attention loss (KLAL) when combined with NTPencourages VLMs to attend to relevant visual tokens while generating answer tokens.This results in notable enhancements in performance across several geometric tasks, as shown by our experimental findings.We also introduce a new dataset to evaluate the line tracing abilities of VLMs. Surprisingly, even commercial VLMs do not perform well on this task.

145

CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow

Ruisheng Han ⋅ Kanglei Zhou ⋅ Shuang Chen ⋅ Amir Atapour-Abarghouei ⋅ Hubert P. H. Shum

Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. To address the issue of spurious correlations, the Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions. To tackle unstable temporal refinement in long sequences, the BiT-Flow module explicitly models forward and backward dynamics with a cycle-consistency constraint, producing smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance, validating its effectiveness.