Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 1

Sun 8 Mar 11:15 a.m. PDT — 1 p.m. PDT
Abstract:
Chat is not available.


1
DreamAnywhere: Object-Centric Panoramic 3D Scene Generation

Edoardo Dominici ⋅ Jozef Hladký ⋅ Floor Verhoeven ⋅ Lukas Radl ⋅ Thomas Deixelberger ⋅ Stefan Ainetter ⋅ Philipp Drescher ⋅ Stefan Hauswiesner ⋅ Arno Coomans ⋅ Giacomo Nazzaro ⋅ Konstantinos Vardis ⋅ Markus Steinberger

Recent advances in text-to-3D scene generation have demonstrated significant potential to transform content creation across multiple industries. Although the research community has made impressive progress in addressing the challenges of this complex task, existing methods often generate environments that are only front-facing, lack visual fidelity, exhibit limited scene understanding, and are typically fine-tuned for either indoor or outdoor settings. In this work, we address these issues and propose DreamAnywhere, a modular system for the fast generation and prototyping of 3D scenes. Our system synthesizes a 360° panoramic image from text, decomposes it into background and objects, constructs a complete 3D representation through hybrid inpainting, and lifts object masks to detailed 3D objects that are placed in the virtual environment. DreamAnywhere supports immersive navigation and intuitive object-level editing, making it ideal for scene exploration, visual mock-ups, and rapid prototyping -- all with minimal manual modeling. These features make our system particularly suitable for low-budget movie production, enabling quick iteration on scene layout and visual tone without the overhead of traditional 3D workflows. Our modular pipeline is highly customizable as it allows components to be replaced independently. Compared to current state-of-the-art text and image-based 3D scene generation approaches, DreamAnywhere shows significant improvements in coherence in novel view synthesis and achieves competitive image quality, demonstrating its effectiveness across diverse and challenging scenarios. A comprehensive user study demonstrates a clear preference for our method over existing approaches, validating both its technical robustness and practical usefulness.


2
ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Sibo Dong ⋅ Ismail Shaheen ⋅ Maggie Shen ⋅ Rupayan Mallick ⋅ Sarah Bargal

Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images.Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.


3
Odo: Depth-Guided Diffusion for Identity-Preserving Body Reshaping

Siddharth Khandelwal ⋅ Sridhar Kamath ⋅ Arjun Jain

Human shape editing enables controllable transformation of a person's body shape, such as thin, muscular, or overweight, while preserving pose, identity, clothing, and background. Unlike human pose editing, which has advanced rapidly, shape editing remains relatively underexplored. Current approaches typically rely on 3D morphable models or image warping, often introducing unrealistic body proportions, texture distortions, and background inconsistencies due to alignment errors and deformations. A key limitation is the lack of large-scale, publicly available datasets for training and evaluating body shape manipulation methods.In this work, we introduce the first large-scale dataset of 18,573 images across 1523 subjects, specifically designed for controlled human shape editing. It features diverse variations in body shape, including fat, muscular and thin, captured under consistent identity, clothing, and background conditions. Using this dataset, we propose Odo, an end-to-end diffusion-based method that enables realistic and intutive body reshaping guided by simple semantic attributes. Our approach combines a frozen UNet that preserves fine-grained appearance and background details from the input image with a ControlNet that guides shape transformation using target SMPL depth maps. Extensive experiments demonstrate that our method outperforms prior approaches, achieving per-vertex reconstruction errors as low as 7.5mm, significantly lower than the 13.6mm observed in baseline methods, while producing realistic results that accurately match the desired target shapes.


4
BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

Seong-Eun Hong ⋅ SooBin Lim ⋅ JuYeong Hwang ⋅ Minwook Chang ⋅ Hyeongyeop Kang

Text-to-motion generation allows language-driven animation, yet current models struggle to deliver long-range coherence and fine-grained limb coordination. A competitive system must (i) preserve temporal consistency across hundreds of frames, (ii) synchronize limb motions, and (iii) align nuanced sentences with a spectrum of plausible trajectories. We introduce BiPO, the first part-based bidirectional autoregressive network trained with a lightweight Partial Occlusion regulariser. Each limb attends to both past and future frames for anticipatory coordination, while stochastic masking weakens spurious cross-part dependencies and encourages varied solutions. On HumanML3D and KIT-ML, BiPO lowers FID by 15–30\% relative to MoMask and BAMM, secures the highest human-perceived realism scores, and sets new state-of-the-art results on motion-editing tasks requiring infill from partial sequences. These findings demonstrate that bidirectional reasoning coupled with Partial Occlusion yields a length-agnostic, high-fidelity framework for expressive, language-conditioned motion synthesis.

Recent advances in reinforcement learning (RL) have enabled effective reward-based finetuning of text-to-image diffusion models, improving their alignment with user preferences. However, existing RL methods typically optimize only the denoising UNet while relying on fixed generation strategies, limiting their flexibility and controllability. In this work, we propose ADOPT, an adaptive diffusion policy training framework that unifies the optimization of Classifier-Free Guidance (CFG) scaling and timestep embedding modulation within a single RL paradigm. Specifically, ADOPT learns a prompt-conditioned policy to adjust the CFG strength dynamically and to modulate timestep embeddings via learnable curve-based scaling, enhancing both semantic guidance and temporal understanding of the diffusion process. Extensive experiments demonstrate that ADOPT consistently improves semantic alignment, aesthetic quality, and human preference scores across diverse prompt datasets, while maintaining efficient inference cost. Our results highlight the potential of jointly optimizing adaptive control strategies to unlock greater flexibility and performance for reward-driven diffusion generation.

Recent point cloud frame interpolation methods predict an interpolated frame through the merging of two intermediate frames constructed by scene flow estimation. However, generation errors may accumulate in the scene flow estimation errors since they adopt a generative approach to merge the frames, degrading the interpolation performance. In this paper, we propose a point cloud frame interpolation method with time-aware point cloud sampling and a self-supervised learning strategy, termed TS-PCI. The proposed method introduces a time-aware learning-based point cloud sampling model to merge the two frames into a single frame in a non-generative approach. The proposed method also introduces an attention-based geometry refinement model to improve the geometric quality of the sampled point clouds. Furthermore, the proposed method adopts a self-supervised strategy that dynamically creates ground truth labels for point cloud sampling, allowing the models to be trained in an end-to-end manner. Experimental results on three large-scale datasets show that the proposed method achieves superior performance compared to state-of-the-art methods.

We propose two algorithms for 3D symmetry detection based on enhanced back-projection of vision features extracted from foundation vision models such as DINOv2. Our method enhances back-projection by rendering multiple views of 3D objects, extracting features, and projecting them onto the geometry with two key improvements—Fibonacci view sampling and view rotations—that increase robustness and accuracy. Using these features, we detect symmetry planes and axes through two dedicated algorithms. Experiments on ShapeNet show that our plane detection approach outperforms both traditional geometric and learning-based methods by a wide margin. The method is also efficient, running in seconds on a single 8GB GPU, making it practical for large-scale or real-world applications. Overall, our results demonstrate that enhanced back-projection of vision features offers a simple yet effective framework for solving fundamental 3D geometric problems such as symmetry detection.


8
OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting

Atakan Topaloğlu ⋅ Kunyi Li ⋅ Michael Niemeyer ⋅ Nassir Navab ⋅ Ahmet Tekalp ⋅ Federico Tombari

Sparse-view novel view synthesis is fundamentally ill-posed due to severe geometric ambiguity. Current methods are caught in a trade-off: regressive models are geometrically faithful but incomplete, whereas generative models can complete scenes but often introduce structural inconsistencies. We propose OracleGS, a novel framework that reconciles generative completeness with regressive fidelity for sparse view Gaussian Splatting. Instead of using generative models to patch incomplete reconstructions, our "propose-and-validate" framework first leverages a pre-trained 3D-aware diffusion model to synthesize novel views to propose a complete scene. We then repurpose a multi-view stereo (MVS) model as a 3D-aware oracle to validate the 3D uncertainties of generated views, using its attention maps to reveal regions where the generated views are well-supported by multi-view evidence versus where they fall into regions of high uncertainty due to occlusion, lack of texture, or direct inconsistency. This uncertainty signal directly guides the optimization of a 3D Gaussian Splatting model via an uncertainty-weighted loss. Our approach conditions the powerful generative prior on multi-view geometric evidence, filtering hallucinatory artifacts while preserving plausible completions in under-constrained regions, outperforming state-of-the-art methods on datasets including Mip-NeRF 360 and NeRF Synthetic.


9
UnderWater SLAM with Laser-light sectioning method using ST-GAT

Heyang Gao ⋅ Kazuto Ichimaru ⋅ Takafumi Iwaguchi ⋅ Hiroshi Kawasaki

Multi-line laser ID assignment is crucial for underwater 3D reconstruction but fails when lines fragment. We reformulate this as a graph-based sequence labeling task and propose a novel two-stage hierarchical framework using Spatio-Temporal Graph Attention Networks (ST-GAT). Our method first reasons over a spatio-temporal graph of laser endpoints and intersections to handle local fragmentation, then elevates this to a global segment-level optimization with trajectory-constrained Viterbi decoding to ensure temporal consistency. This GNN-based approach eliminates the reliance on complete epipolar geometry. Experiments on real underwater datasets demonstrate superior reconstruction completeness and temporal stability, especially in challenging environments where traditional methods fail.


10
Leveraging Pretrained Representations for Cross-Modal Point Cloud Completion

Kshitij Kale ⋅ Hrishikesh U ⋅ V Sreenidhe ⋅ Shylaja S

The utility of 3D point clouds in critical applications like robotics is often hindered by their inherent incompleteness, a result of real-world occlusions and limited sensor viewpoints. To overcome this, image-guided 3D point cloud completion aims to reconstruct complete shapes by leveraging a corresponding 2D image. However, current methods typically train a cross-modal network from scratch, often failing to capture the high-level semantic context and complex structural information required for robust reconstruction. This paper challenges that paradigm by demonstrating that preexisting knowledge from large-scale, pretrained vision models can be effectively leveraged to guide the completion process. We introduce a novel Dual Branch Image Encoder, a dedicated module designed to extract and fuse rich semantic priors from a pretrained Vision Transformer with geometric depth cues. This fused representation provides a powerful, multifaceted guide that is integrated into EGIInet, a state-of-the-art point cloud completion network. Our experiments show that by conditioning the completion on these strong, pretrained priors, our method outperforms existing state-of-the-art techniques by 7\% without changing the rest of the architecture, producing more semantically coherent and structurally accurate 3D shapes.


11
Referring Change Detection in Remote Sensing Imagery

Yilmaz Korkmaz ⋅ Jay Paranjape ⋅ Celso de Melo ⋅ Vishal Patel

Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) RCDNet, a cross-modal fusion network designed for referring change detection, and (II) RCDGen, a diffusion-based synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Code and synthetic data will be made publicly available after publication.


12
Inpaint360GS: Efficient Object-Aware 3D Inpainting via Gaussian Splatting for 360° Scenes

Shaoxiang Wang ⋅ Shihong Zhang ⋅ Christen Millerdurai ⋅ Rüdiger Westermann ⋅ Didier Stricker ⋅ Alain Pagani

Despite recent advances in single-object front-facing inpainting using NeRF and 3D Gaussian Splatting (3DGS), inpainting in complex $360^\circ$ scenes remains largely underexplored. This is primarily due to three key challenges: (i) identifying target objects in the 3D field of 360° environments, (ii) dealing with severe occlusions in multi-object scenes, which makes it hard to define regions to inpaint, and (iii) maintaining consistent and high-quality appearance across views effectively. To tackle these challenges, we propose Inpaint360GS, a flexible $360^\circ$ editing framework based on 3DGS that supports multi-object removal and high-fidelity inpainting in 3D space. By distilling 2D segmentation into 3D and leveraging virtual camera views for contextual guidance, our method enables accurate object-level editing and consistent scene completion. We further introduce a new dataset tailored for $360^\circ$ inpainting, addressing the lack of ground truth object-free scenes. Experiments demonstrate that Inpaint360GS outperforms existing baselines and achieves state-of-the-art performance. The dataset and code will be released to facilitate future research.


13
A Multi-Agent Diffusion Approach for MRI Anomaly Segmentation via Modality-Specific LoRA Specialization

Wafa Ghallabi ⋅ Muhammad Zaigham Zaheer ⋅ Ritesh Thawkar ⋅ Omkar Thawakar ⋅ Salman Khan ⋅ Fahad Khan

Unsupervised anomaly segmentation in multi-sequence MRI is a promising way to scale lesion screening, but existing reconstruction-based methods face three persistent issues: they fail to generalize across modalities, they depend on hand-crafted masking or paired translations, and they often require separate models with high inference cost. In this work, we take a stepwise approach to address these limitations. In the first stage, we fully fine-tune a diffusion model on healthy brain MRI slices pooled across T1, T2, and FLAIR, which produces anatomically consistent reconstructions. To further improve, we introduce a lightweight second stage where modality-specific LoRA adapters are trained on top of the pretrained diffusion backbone. A simple router automatically selects the right adapter for each input, effectively turning the system into a modality-aware multi-agent framework. To further stabilize reconstructions, we incorporate a learnable latent-frequency mask that suppresses non-informative spectral components and preserves structural detail. This design allows the model to emphasize healthy anatomy while efficiently capturing modality-dependent contrasts. This two-stage strategy boosts Dice to 88% on BraTS2021 (FLAIR), achieving state-of-the-art performance. Experiments on BraTS2021, ISLES, and ATLAS datasets confirm that the approach consistently improves Dice and SSIM across all modalities, outperforming diffusion, masking, and cycle-based baselines, and offering a practical balance of accuracy, robustness, and efficiency for clinical MRI anomaly detection. Our code and trained model will be publicly released.


14
GenHSI: Controllable Generation of Human-Scene Interaction Videos

Zekun Li ⋅ Rui Zhou ⋅ Rahul Sajnani ⋅ Xiaoyan Cong ⋅ Daniel Ritchie ⋅ Srinath Sridhar

Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation.However, existing solutions face several challenges in generating long videos with rich human–scene interactions (HSI), including unrealistic dynamics and affordance, lack of subject identity preservation, and the need for expensive training.To this end, we propose GenHSI, a training-free method for controllable generation of long HSI videos with 3D awareness.Taking inspiration from movie animation, we subdivide the video synthesis into three stages: (1) script writing, (2) pre-visualization, and (3) animation.Given an image of a scene and a character with a user description, we use these three stages to generate long videos that preserve human identity and provide rich and plausible HSI.Script writing converts a complex text prompt involving a chain of HSI into simple atomic actions that are used in the pre-visualization stage to generate 3D keyframes.To synthesize plausible human interaction poses in 3D keyframes, we utilize pre-trained 2D inpainting diffusion models to generate plausible 2D human interactions based on view canonicalization, which eliminates the need for multi-view fitting in previous works. We then extend these interactions to 3D using robust iterative optimization, informed by contact cues and reasoning from VLMs.Prompted by these 3D keyframes, the pretrained video diffusion models can better generate consistent long videos with plausible dynamics and affordance in a 3D-aware manner.We are the first to synthesize a long video sequence with a chain of HSI actions without training based on the image references of the scene and character.Experiments demonstrate that our method can generate HSI videos that effectively preserve scene content and character identity with plausible human-scene interaction from a single image scene.

Accurately predicting directional radio maps is essential for wireless applications, yet prior approaches primarily focus on omnidirectional signals and typically treat transmitter localization and signal map reconstruction as separate tasks. In omnidirectional settings, predicting the maximum signal location often coincides with the transmitter position, which limits the need for explicit joint modeling. However, in directional propagation—where angular effects, reflections, and building occlusions play critical roles—this assumption no longer holds. To address this gap, we propose SymNet, a unified framework that jointly predicts directional radio maps and transmitter locations from sparse signal measurements. SymNet incorporates a prediction head for transmitter localization alongside radio map reconstruction, enabling simultaneous learning of both tasks. This joint formulation leverages their complementary information and leads to consistent improvements over treating them separately. Experiments on challenging directional scenarios demonstrate that SymNet outperforms state-of-the-art baselines, achieving superior accuracy in both radio map reconstruction and transmitter localization.

Vector quantization (VQ) is a prevalent and fundamental technique that discretizes continuous feature vectors by approximating them using a codebook.As the diversity and complexity of data and models continue to increase, there is an urgent need for high-capacity, yet more compact VQ methods.This paper aims to reconcile this conflict by presenting a new approach called LooC, which utilizes an effective Low-dimensional codebook for Compositional vector quantization.Firstly, LooC introduces a parameter-efficient codebook by reframing the relationship between codevectors and feature vectors, significantly expanding its solution space.Instead of individually matching codevectors with feature vectors, LooC treats them as lower-dimensional compositional units within feature vectors and combines them, resulting in a more compact codebook with improved performance.Secondly, LooC incorporates a parameter-free extrapolation-by-interpolation mechanism to enhance and smooth features during the VQ process, which allows for better preservation of details and fidelity in feature approximation. The design of LooC leads to full codebook usage, effectively utilizing the compact codebook while avoiding the problem of collapse. Thirdly, LooC can serve as a plug-and-play module for existing methods for different downstream tasks based on VQ. Finally, extensive evaluations on different tasks, datasets, and architectures demonstrate that LooC outperforms existing VQ methods, achieving state-of-the-art performance with a significantly smaller codebook.


17
End-to-End Fine-Tuning of 3D Texture Generation using Differentiable Rewards

Amirhossein Zamani ⋅ Tianhao Xie ⋅ Amir Aghdam ⋅ Tiberiu Popa ⋅ Eugene Belilovsky

While recent 3D generative models can produce high-quality texture images, they often fail to capture human preferences or meet task-specific requirements. Moreover, a core challenge in the 3D texture generation domain is that most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To alleviate these issues, we propose an end-to-end differentiable, reinforcement-learning-free framework that embeds human feedback, expressed as differentiable reward functions, directly into the 3D texture synthesis pipeline. By back-propagating preference signals through both geometric and appearance modules of the proposed framework, our method generates textures that respect the 3D geometry structure and align with desired criteria. To demonstrate its versatility, we introduce three novel geometry-aware reward functions, which offer a more controllable and interpretable pathway for creating high-quality 3D content from natural language. By conducting qualitative, quantitative, and user-preference evaluations against state-of-the-art methods, we demonstrate that our proposed strategy consistently outperforms existing approaches. We will make our implementation code publicly available upon acceptance of the paper.


18
IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection

Johannes Meier ⋅ Florian Günther ⋅ Riccardo Marin ⋅ Oussema Dhaouadi ⋅ Jacques Kaiser ⋅ Daniel Cremers

Monocular 3D detection relies on just a single camera and is therefore easy to deploy. Yet, achieving reliable 3D understanding from monocular images requires substantial annotation, and 3D labels are especially costly. To maximize performance under constrained labeling budgets, it is essential to prioritize annotating samples expected to deliver the largest performance gains. This prioritization is the focus of active learning. Curiously, we observed two significant limitations in active learning algorithms for 3D monocular object detection. First, previous approaches select entire images, which is inefficient, as non-informative instances contained in the same image also need to be labeled. Secondly, existing methods rely on uncertainty-based selection, which in monocular 3D object detection creates a bias toward depth ambiguity. Consequently, distant objects are selected, while nearby objects are overlooked.To address these limitations, we propose IDEAL-M3D, the first instance-level pipeline for monocular 3D detection.For the first time, we demonstrate that an explicitly diverse, fast-to-train ensemble improves diversity-driven active learning for monocular 3D. We induce diversity with heterogeneous backbones and task-agnostic features, loss weight perturbation, and time-dependent bagging. IDEAL-M3D shows superior performance and significant resource saving: with just 60\% of the annotations, we achieve similar or better AP$_{3D}$ on KITTI validation and test set results compared to training the same detector on the whole dataset.


19
FAE-Net: Fashion Attribute Editing via Disentangled Latent Conditioning in Diffusion Models

Parvatam Rajith Bhargav ⋅ Gaurab Bhattacharya ⋅ Vivek B S ⋅ Jayavardhana Gubbi

Image editing using generative models has recently advanced through GAN- and diffusion-based techniques. While the current image manipulation methods shows considerable performance in general image editing, their effectiveness drops when extended to fashion attribute editing. This is due to multiple challenges such as category-specific and overlapping attributes, inherent entanglement in real-world datasets that leads to degraded editing quality and unintended attribute shifts. To address these challenges, we propose FAE-Net (Fashion Attribute Editing Network), a latent diffusion framework that leverages disentangled latent projections for precise and reliable attribute manipulation. Our method first disentangles the latent projections to mitigate the inherent entanglement in the data and then conditions the diffusion model with those projections to improve the manipulation control in the presence of overlapping attributes. The attribute presence detector in FAE-Net handles category-specific attributes and prevents invalid attribute manipulations during inference. Extensive experiments on three large-scale datasets demonstrate that our proposed method achieves more controllable, disentangled, and faithful attribute editing compared to state-of-the-art methods.

The rapid growth of e-commerce has intensified the demand for Virtual Try-On (VTO) technologies, enabling customers to realistically visualize products overlaid on their own images. Despite recent advances, existing VTO models face challenges with fine-grained detail preservation, robustness to real-world imagery, efficient sampling, image editing capabilities, and generalization across diverse product categories. In this paper, we present DiT-VTON, a novel VTO framework that leverages an architecture based on a Diffusion Transformer (DiT), renowned for its performance on text-conditioned image generation (text-to-image), adapted here for the image-conditioned VTO task. We systematically explore multiple DiT configurations, including in-context token concatenation, channel concatenation, and ControlNet integration, to determine the best setup for VTO image conditioning. Our findings indicate that token concatenation combined with pose stitching yields the best performance. To enhance robustness, we train the model on an expanded dataset encompassing varied backgrounds, unstructured references, and non-garment categories, demonstrating the benefits of data scaling for VTO adaptability. DiT-VTON also redefines the VTO task beyond garment try-on, offering a versatile Virtual Try-All (VTA) solution capable of handling a wide range of product categories and supporting advanced image editing functionalities, such as pose preservation, precise localized region editing and refinement, texture transfer and object-level customization. Experimental results show that our model surpasses state-of-the-art methods on public datasets VITON-HD and DressCode on the VTO task, achieving superior detail preservation and robustness without reliance on additional image condition encoders. It also surpasses state-of-the-art models that have VTA and image editing capabilities on a varied dataset composed of thousands of product categories. As a result, DiT-VTON significantly advances VTO applicability in diverse real-world scenarios, enhancing both the realism and personalization of online shopping experiences.


21
MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data

Antoine Labatie ⋅ Michael Vaccaro ⋅ Nina Lardiere ⋅ Anatol Garioud ⋅ Nicolas Gonthier

Self-supervised learning (SSL) holds great promise for Earth observation (EO), but standard SSL methods must be adapted to the unique characteristics of EO data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and reconstruction target normalizations for multimodal, multitemporal, and multispectral data. Based on these findings, we propose MAESTRO, a novel adaptation of the Masked Autoencoder (MAE) that combines optimized fusion strategies with a target normalization scheme, introducing an effective multispectral prior as a self-supervisory signal to learn a better deep representations. Evaluated on four diverse EO datasets, MAESTRO sets a new state of the art on tasks with strong multitemporal components, while remaining highly competitive on tasks dominated by a single mono-temporal modality. Code and pretrained models will be released publicly upon publication.


22
TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

Wei-Yuan Cheng ⋅ Kai-Po Chang ⋅ Chi-Pin Huang ⋅ Fu-En Yang ⋅ Frank Wang

Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity with the given video. Through extensive experiments on benchmark datasets, we show that our TA-Prompting is favorable against state-of-the-art VideoLLMs, yielding superior performance on dense video captioning and temporal understanding tasks including moment retrieval and temporalQA.


23
Towards Fast and Scalable Normal Integration using Continuous Components

Francesco Milano ⋅ Jen Jen Chung ⋅ Lionel Ott ⋅ Roland Siegwart

Surface normal integration is a fundamental problem in computer vision, dealing with the objective of reconstructing a surface from its corresponding normal map. Existing approaches require an iterative global optimization to jointly estimate the depth of each pixel, which scales poorly to larger normal maps. In this paper, we address this problem by recasting normal integration as the estimation of relative scales of continuous components. By constraining pixels belonging to the same component to jointly vary their scale, we drastically reduce the number of optimization variables. Our framework includes a heuristic to accurately estimate continuous components from the start, a strategy to rebalance optimization terms, and a technique to iteratively merge components to further reduce the size of the problem. Our method achieves state-of-the-art results on the standard normal integration benchmark in as little as a few seconds and achieves one-order-of-magnitude speedup over pixel-level approaches on large-resolution normal maps.


24
A framework for real-time Surgical Phase Recognition with application to Robot-Assisted Partial Nephrectomy

Marco Mezzina ⋅ Tom Vercauteren ⋅ Tinne Tuytelaars ⋅ Matthew Blaschko

Surgical practice has increasingly integrated advanced technologies to improve procedural outcomes, efficiency, and safety in modern operating rooms. Within this evolving landscape, Automated Surgical Phase Recognition (SPR) leverages Artificial Intelligence to temporally segment surgical workflows into key events, thereby supporting both real-time decision-making and off-line analysis. Despite the potential of SPR, previous research focused on short and linear surgeries, giving limited attention to the development, assessment, and deployment of real-time systems for complex surgical workflows. This work addresses these gaps by targeting the highly-complex and non linear workflow of Robot-Assisted Partial Nephrectomy (RAPN). We develop a real-time SPR system trained on 143 annotated RAPN surgical videos covering 15 distinct phases. The system incorporates a trainable canonical calibration error estimator combined with Viterbi decoding for more reliable outcomes. Additionally, we introduce a novel assessment framework designed to simultaneously evaluate off-line, real-time, and averaged SPR performance, synthesizing historical phase predictions over time. To facilitate practical deployment, we implement the SPR pipeline as an end-to-end application using the NVIDIA Holoscan platform, specifically tailored for real-time inference scenarios. The system was successfully tested during three live RAPN procedures on human patients performed in a collaborating hospital, achieving an average inference latency of 16.65 ms and an accuracy of 68.2%. Results highlight improvement in performance through the integration of Viterbi decoding in this complex surgical scenario, while canonical calibration, despite yielding marginal gains in overall performance, enhance classification reliability. We show the feasibility of deploying a real-time SPR pipeline for RAPN, which holds promise for optimizing OR planning. The application will be available upon acceptance at https://github.com/nvidia-holoscan/holohub

When we train models on biased datasets, they not only reproduce data biases, but can worsen them at test time --- a phenomenon called bias amplification. Many of the current bias amplification metrics (e.g., $BA_{\rightarrow}$, DPA) measure bias amplification only in classification datasets. These metrics are ineffective for image captioning datasets, as they cannot capture the language semantics of a caption. Recent work introduced Leakage in Captioning (LIC), a language-aware bias amplification metric that understands caption semantics. However, LIC has a crucial limitation: it cannot identify the source of bias amplification in captioning models. We propose Directional Bias Amplification in Captioning (DBAC), a language-aware and directional metric that can identify when captioning models amplify biases. DBAC has two more improvements over LIC: (1) it is less sensitive to sentence encoders (a hyperparameter in language-aware metrics), and (2) it provides a more accurate estimate of bias amplification in captions. Our experiments on gender and race attributes in the COCO captions dataset show that DBAC is the only reliable metric to measure bias amplification in captions.


26
Forget Less by Learning Together through Concept Consolidation

Arjun Kaushik Kaushik ⋅ Naresh Kumar Devulapally ⋅ Vishnu Lokhande ⋅ Nalini Ratha ⋅ Venu Govindaraju

Custom Diffusion Models (CDMs) have gained significant attention due to their remarkable ability to personalize generative processes. However, existing CDMs suffer from catastrophic forgetting when continuously learning new concepts. Most prior works attempt to mitigate this issue under the sequential learning setting with a fixed order of concept inflow and neglect inter-concept interactions. In this work, we propose a novel framework - Forget Less by Learning Together (FL2T) - that enables concurrent and order-agnostic concept learning while addressing catastrophic forgetting. Specifically, we introduce a set-invariant inter-concept learning module where proxies guide feature selection across concepts, facilitating improved knowledge retention and transfer. By leveraging inter-concept guidance, our approach preserves old concepts while efficiently incorporating new ones. Extensive experiments, across three datasets, demonstrates that our method significantly improves concept retention and mitigates catastrophic forgetting, highlighting the effectiveness of inter-concept catalytic behavior in incremental concept learning of ten tasks with atleast 2% gain on average Image Alignment scores.


27
Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?

Manuel Benavent-Lledo ⋅ Konstantinos Bacharidis ⋅ Victoria Manousaki ⋅ Konstantinos Papoutsakis ⋅ Antonis Argyros ⋅ José García-Rodríguez

Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent can video aggregation be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.


28
VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction

Stephane Da Silva Martins ⋅ Emanuel Aldea ⋅ Sylvie Le Hégarat-Mascle

Multi-agent trajectory prediction is a key task in computer vision for autonomous systems, particularly in dense and socially interactive environments. Existing methods often struggle to jointly model goal-driven behavior and complex social dynamics, which leads to unrealistic predictions. In this paper, we introduce \textbf{VISTA}, a recursive goal-conditioned transformer architecture that features (1) a cross-attention fusion mechanism to integrate long-term goals with past motion, (2) a social-token attention module enabling fine-grained interaction modeling across agents, and (3) pairwise attention maps to show social influence patterns during inference. Our model enhances the single-agent goal-conditioned approach into a cohesive multi-agent forecasting framework. In addition to the standard evaluation metrics, we also consider trajectory collision rates, which capture the realism of the joint predictions. Evaluated on the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy with dramatically improved interaction modeling. On MADRAS, our approach reduces the average collision rate of strong baselines from 2.14\% to 0.03\%, and on SDD, it achieves a 0\% collision rate while outperforming SOTA models in terms of ADE/FDE and minFDE. These results highlight the model’s ability to generate socially compliant, goal-aware, and interpretable trajectory predictions, making it well-suited for deployment in safety-critical autonomous systems.


29
ChameleonTuner: Automatic ISP Color Tuning in Subjective Scenarios

Zijie Tan ⋅ Yuxin Yue ⋅ Bahador Rashidi

Tuning parameters in camera image signal processing (ISP) modules, such as 3D lookup tables (3D LUTs), is essential for generating high-quality images. In subjective scenarios, variations in field-of-view (FoV) and point-of-view (PoV) between source and target images introduce geometric misalignments, limiting the effectiveness of existing calibration methods that rely on pixel-wise alignment. We propose ChameleonTuner, a novel framework incorporating region-level color correspondences to handle such FoV/PoV variations. Our method leverages multi-objective evolutionary search for 3D LUT optimization, offering a controllable and interpretable alternative to neural network-based approaches. Extensive experiments demonstrate that our proposed framework is effective and efficient in challenging subjective scenarios, while remaining competitive on standard benchmarks without FoV/PoV variations. Compared to the state-of-the-art 3DLUT optimization baseline, ChameleonTuner achieves a 26.7\% PSNR gain and a 49.7\% reduction in $\Delta E$ on DPED, a subjective cross-device color calibration dataset with mild FoV/PoV variations.

This paper addresses the challenge of detecting complex-shaped action tubes in videos. Existing methods assume that actor's position changes slightly in short video clips. Therefore, they either oversimplify the shape of action tubes by representing them as cuboids or learnable positional patterns. However, these solutions may produce an action tube losing the corresponding actor when the actor trajectory becomes complex. This is because they rely solely on position information to determine action tubes, lacking the ability to trace the same actor when their movement patterns are intricate. To address this issue, we propose Actor-related Tubelet (ART), which incorporates actor-specific information when generating action tubes. Regardless of the complexity of an actor's trajectory, ART ensures that an action tube consistently tracks the same actor, relying on actor-specific cues rather than solely on positional information. To assess ART’s effectiveness, we introduce a metric for quantifying tube shape complexity and evaluate ART on three mainstream datasets, MultiSports, UCF101-24 and JHMDB51-21, achieving substantial improvements.


31
Training-free Multi-view 4D Human Motion Reconstruction Virtual Reality System

Yijie Li ⋅ Ce Zheng ⋅ Yijie He ⋅ Joel Julin ⋅ Ryosuke Ichikari ⋅ Satoki Ogiso ⋅ Satoshi Nakae ⋅ Akihiro Sato ⋅ Takeshi Kurata ⋅ Laszlo Jeni

Human mesh recovery offers substantial potential for detailed behavior analysis and understanding of complex human-environment interactions. In this paper, we propose a novel 4D Human Motion Reconstruction Virtual Reality System that integrates advanced 4D multi-view human mesh recovery and high-quality 3D environment reconstruction using 3D Gaussian Splatting (3DGS). Our system seamlessly combines detailed 4D human behavior capture with accurate 3D environment reconstruction, significantly extending traditional visual monitoring approaches. Visualization through an interactive Virtual Reality (VR) platform enables dynamic interaction representation using accurately reconstructed virtual environments and computer-generated (CG) avatars. Experimental results from realistic scenarios validate the effectiveness of our framework in providing immersive experiences and precise human-environment modeling, demonstrating a significant advancement in a practical human-centered representation approach. Our approach consistently outperforms existing state-of-the-art methods, achieving reductions in mesh errors of 24\% in PVE and 32\% in MPJPE on the CHI3D dataset, and 17\% in MPJPE and 64\% in translation error on the Hi4D dataset compared to other multi-view methods.


32
EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

Liangwei Jiang ⋅ Ruida Li ⋅ Zhifeng Zhang ⋅ Shuo Fang ⋅ Chenguang Ma

This paper aims to bring fine-grained expression control while maintaining high-fidelity identity in portrait generation. This is challenging due to the mutual interference between expression and identity. On one hand, fine expression control signals inevitably introduce appearance-related semantics (\textit{e.g.}, facial contours, and ratio), which impact the identity of the generated portrait. On the other hand, even coarse-grained expression control can cause facial changes that compromise identity, since they all act on the face. These limitations remain unaddressed by previous generation methods, which primarily rely on coarse control signals or two-stage inference that integrates portrait animation. Here, we introduce \textbf{EmojiDiff}, the first end-to-end solution that enables simultaneous control of extremely detailed expression (RGB-level) and high-fidelity identity in portrait generation. To address the above challenges, EmojiDiff adopts a two-stage scheme involving decoupled training and fine-tuning. For decoupled training, we innovate \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize high-quality cross-identity expression pairs by separating and optimizing the processes of maintaining expression and altering identity. Training the model with this data, we effectively disentangle fine expression features in the expression template from other extraneous information (\textit{e.g.}, identity, skin). Subsequently, we present \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA) for further fine-tuning. ICA achieves rapid reconstruction and joint supervision of identity and expression information, thus aligning identity representations of images with and without expression control. Experimental results demonstrate that our method significantly outperforms its counterparts, achieving precise expression control with highly maintained identity, and generalizing well to various diffusion models.


33
OW-Rep: Open World Object Detection with Instance Representation Learning

SUNOH LEE ⋅ Minsik Jeon ⋅ Jihong Min ⋅ Junwon Seo

Open World Object Detection(OWOD) addresses realistic scenarios where unseen object classes emerge, enabling detectors trained on known classes to detect unknown objects and incrementally incorporate the knowledge they provide. While existing OWOD methods primarily focus on detecting unknown objects, they often overlook the rich semantic relationships between detected objects, which are essential for scene understanding and applications in open-world environments (e.g., open-world tracking and novel class discovery). In this paper, we extend the OWOD framework to jointly detect unknown objects and learn semantically rich instance embeddings, enabling the detector to capture fine-grained semantic relationships between instances. To this end, we propose modules that leverage the rich and generalizable knowledge of Vision Foundation Models(VFMs) and can be integrated into open-world object detectors. First, the Unknown Box Refine Module uses semantic masks from the Segment Anything Model to accurately localize unknown objects. The Embedding Transfer Module then distills instance-wise semantic similarities from VFM features to the detector's embeddings via a relaxed contrastive loss, enabling the detector to learn a semantically meaningful and generalizable instance feature. Extensive experiments show that our method significantly improves both unknown object detection and instance embedding quality, while also enhancing performance in downstream tasks such as open-world tracking.


34
Cluster-Guided Adversarial Perturbations for Robust Contrastive Learning

Seongyun Seo ⋅ Sungmin Han ⋅ Jeonghyun Lee ⋅ Sangkyun Lee

Adversarial contrastive learning aims to learn robust representations from unlabeled data by integrating adversarial training and contrastive learning. Accordingly, existing methods typically generate adversarial perturbations that maximize the contrastive loss during adversarial training. However, these approaches frequently produce ineffective perturbations, as their effectiveness heavily depends on the semantic similarity among samples within each mini-batch, which is not explicitly controlled. As a result, the improvement in robustness remains limited. To address this, we propose a novel approach that leverages the well-structured representation space learned via contrastive learning, where semantically similar samples cluster well while dissimilar ones are positioned farther apart. Exploiting this clustering structure, we construct adversarial perturbations that move samples away from a group of similar samples and toward a group of dissimilar ones, thereby inducing stronger adversarial effects. Compared to existing approaches, our method achieves significant improvements in robust accuracy by up to 4.75% against the PGD attack and 7.59% against Auto-Attack.


35
A Universal Self-Attention Enhancement for Bridging Low-bit Quantization and Vision Transformers

Jiahe Qian ⋅ Peisong Wang ⋅ Zhengyang Zhuge ⋅ Qinghao Hu ⋅ Jian Cheng

Low-bit quantization of Vision Transformers presents significant challenges due to the intrinsic properties of their Multi-head Self-attention modules. In this work, we investigate the quantization issues specific to this mechanism and identify two critical challenges. First, quantization amplifies discrepancies among attention heads, thereby impairing the model’s capability to focus on the most informative regions. Second, high-magnitude values in the attention maps, which encode essential relational information, exhibit an extremely sharp distribution that renders them especially prone to substantial information loss during quantization. To address these challenges, we propose Quantized-aware Multi-head Self-attention (Q-MHSA), a universal self-attention enhancement module that integrates two lightweight components within Multi-head Self-attention (MHSA). The Cross-head Concordance Module (CCM) enforces adaptive consistency across attention heads, while the Learnable Smoothness Controller (LSC) replaces the fixed normalization factor with an adaptive mechanism that selectively smooths the distribution of high-magnitude attention values while disregarding less informative low values. Designed for seamless integration with any Vision Transformer architecture and quantization-aware training method, Q-MHSA incurs minimal overhead while consistently improving model accuracy. For instance, under the VVTQ quantization framework applied to a 4-bit Swin-S model, the incorporation of Q-MHSA yields a top-1 accuracy of 83.5\%, representing a 0.9\% improvement over the baseline, while incurring a marginal overhead of 0.01\% in both parameters and FLOPs.


36
Overcoming Fine-Grained Visual Challenges in Animal Re-Identification via Semantic Feature Alignment

Yihao Wu ⋅ Di Zhao ⋅ Yuzhuo Li ⋅ Matthew Alajas ⋅ Alistair Glen ⋅ Jingfeng Zhang ⋅ Gillian Dobbie ⋅ Daniel Wilson ⋅ Yun Sing Koh

Identifying individual animals at different points in space and time is vital for effective wildlife monitoring and biodiversity conservation. While existing computer vision methods have shown promise in re-identifying animals, their capability in Animal Re-Identification (Animal ReID) remains restricted by the inherent visual variations, specifically high intra- and low inter-identity variations. High intra-identity variations refer to high visual diversity within the same individual due to pose or form changes and occlusions, and low inter-identity variations refer to subtle visual differences between distinct individuals due to fine-grained appearances. To address these challenges, we propose the Clip-based Animal RE-identification (CARE) framework, which leverages the image-conditioned textual description generation and individual-level semantic feature alignment, mitigating the negative impacts of visual variations in Animal ReID. Crucially, we have packaged CARE into a stand-alone toolkit and piloted it with stakeholders, facilitating real-world wildlife monitoring for biodiversity conservation. Extensive experiments on benchmark and in-the-wild datasets further demonstrate that CARE consistently outperforms state-of-the-art methods, validating its effectiveness in Animal ReID. The code is available at https://anonymous.4open.science/r/CARE-WACV.


37
M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

Hongyu Wang ⋅ Jiayu Xu ⋅ Senwei Xie ⋅ Ruiping Wang ⋅ Jialin Li ⋅ Zhaojie Xie ⋅ Bin Zhang ⋅ Chuyan Xiong ⋅ Xilin CHEN

Multilingual capability is an essential aspect for large multimodal models, since they are usually deployed across various countries and languages. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely unexplored. In this work, we introduce M4U, a novel and challenging benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning. M4U contains 10k samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in six languages. Using M4U, we conduct extensive evaluations of leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools. The evaluation results demonstrate that the state-of-the-art model, GPT-4o, achieves only 47.6% average accuracy on M4U. Additionally, we observe that the leading LMMs exhibit significant language preferences. Our in-depth analysis indicates that leading LMMs, including GPT-4o, struggle to perform reasoning using multilingual information present in both visual and textual context. Specifically, they suffer performance degradation when prompted with cross-lingual multimodal questions.


38
RAT4D: Rig and Animate Objects without Surface Templates in 4D

Mosam Dabhi ⋅ Simon Lucey ⋅ Laszlo Jeni

We present a surface-template-free method for reconstructing dense, rigged (re-animatable) 3D models from monocular videos. By combining robust pose optimization with differentiable Gaussian splatting, this work bridges the gap between flexible template-free approaches and the visual quality of template-based methods. Starting from noisy 2D keypoints, we refine 3D poses through kinematic and temporal constraints, then attach Gaussian primitives to the optimized skeleton for differentiable supervision and rendering. We demonstrate dense, rigged (re-animatable) 3D models without surface templates across humans, animals, insects, and everyday articulated objects, and we empirically show that closing the rendering–pose loop improves 3D lifting from noisy landmarks. A key enabler of these template-free reconstructions is our kinematic optimization, which reduces 3D pose error by 20–25\% relative to template-free baselines; at the same time, our results approach template-based visual metrics (PSNR/SSIM within 5\%). We also demonstrate the method’s practical utility by detecting and correcting geometric inconsistencies in AI-generated videos. While limited to articulated subjects with detectable keypoints, the approach provides a practical pipeline that serves as a drop-in refinement to improve 3D lifting in existing pipelines and enables the creation of rigged 3D assets from casual captures when expensive surface templates (e.g., MoCap-derived) are unavailable.


39
AFL-PRF: Adaptive Federated Learning for Low-Quality Data: Enhancing Performance, Robustness, and Fairness

Pinrui Yu ⋅ Yiming Xie ⋅ Longtian Ye ⋅ Geng Yuan ⋅ Ningfang Mi ⋅ Xue Lin

Federated learning (FL) enables collaborative model training across distributed clients while preserving privacy, yet its decentralized nature makes it vulnerable to poisoned updates and performance degradation under highly skewed data. Prior studies typically treat accuracy, robustness, and fairness separately, leaving open the challenge of a unified solution. We propose AFL-PRF, an adaptive federated learning framework that simultaneously enhances accuracy, robustness, and fairness in adversarial and heterogeneous environments. AFL-PRF integrates three key techniques. First, an exponential adaptive weighting mechanism dynamically scales client updates, suppressing poisoned or unreliable contributions while retaining meaningful signals from benign but low-quality clients. Second, a client prioritization strategy guided by the novel Weight Update Divergence (WUD) score promotes reliable updates and their benign neighbors, preventing malicious gradients from dominating aggregation. Third, sensitivity profiling identifies fully connected (FC) layers as highly vulnerable due to large weight variance, motivating a selective clipping strategy that filters extreme updates in these layers while preserving normal learning dynamics. Extensive experiments on benchmark datasets demonstrate that AFL-PRF consistently outperforms state-of-the-art baselines, achieving over 30% improvement in robustness and 20% enhancement in fairness, while maintaining superior predictive accuracy. By unifying adaptive weighting, client prioritization, and targeted clipping, AFL-PRF establishes a new benchmark for federated learning under poisoned and highly non-IID conditions.


40
Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers

Fanis Mathioulakis ⋅ Gorjan Radevski ⋅ Tinne Tuytelaars

We introduce Eff-GPose, an approach for efficient and generalizable 3D pose estimation from RGB images. Given a query image and a set of posed reference images, our method directly predicts the object’s pose in a single forward pass, without requiring object- or category-specific training. At the core of our framework is a transformer that performs a pose-aware comparison in the latent space, jointly processing enriched global representations from multiple posed references alongside a query. This design enables a favorable balance between accuracy and computational efficiency while remaining simple, scalable, and fully end-to-end. Our results demonstrate that Eff-GPose offers a promising direction toward more efficient pose estimation, particularly for latency-sensitive applications.


41
DreamMakeup: Face Makeup Customization using Latent Diffusion Models

Geon Yeong Park ⋅ Inhwa Han ⋅ Serin Yang ⋅ Yeobin Hong ⋅ Seongmin Jeong ⋅ Heechan Jeon ⋅ Myeongjin Goh ⋅ Sung Yi ⋅ Jin Nam ⋅ Jong Ye

The exponential growth of the global makeup market has paralleled advancements in virtual makeup simulation technology. Despite the progress led by GANs, their application still encounters significant challenges, including training instability and limited customization capabilities. Addressing these challenges, we introduce DreamMakup: a novel training-free Diffusion model based Makeup Customization method, leveraging the inherent advantages of diffusion models for superior controllability and precise real-image editing. DreamMakeup employs early-stopped DDIM inversion to preserve the facial structure and identity while enabling extensive customization through various conditioning inputs such as reference images, specific RGB colors, and textual descriptions. Our model demonstrates notable improvements over existing GAN-based and recent diffusion-based frameworks: improved customization, color-matching capabilities, identity preservation and compatibility with textual descriptions or LLMs with affordable computational costs.

Remote sensing change detection is often complicated by spatial misalignment between image pairs, especially when observations are separated by long temporal gaps such as seasonal or multi-year intervals. Conventional CNN- and transformer-based methods perform well on aligned data, but their reliance on perfect co-registration limits their applicability in practice. Existing approaches that integrate registration and change detection generally demand task-specific training and transfer poorly across domains. We present a lightweight, modular pipeline that strengthens robustness without retraining the underlying change detection models. The framework combines rapid per-image LoRA adaptation with a compact flow refinement module trained under supervision. To mitigate large appearance differences, we generate intermediate morphing frames via a diffusion-based semantic interpolator. Consecutive frames are aligned using a registration backbone (e.g., RoMa), and the composed flows are further corrected through a residual refinement network. The refined flow is then applied to co-register the original image pairs, enabling more reliable downstream change detection. Extensive experiments on LEVIR-CD, DSIFN-CD, and WHU-CD demonstrate that the proposed pipeline significantly improves both registration accuracy and change detection performance, especially in scenarios with substantial spatial and temporal variations.


43
AdaptViG: Adaptive Vision GNN with Exponential Decay Gating

Mustafa Munir ⋅ Mostafijur Rahman ⋅ Radu Marculescu

Recent advancements in vision models have been dominated by Transformers and, more recently, Vision Graph Neural Networks (ViGs). While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware pruning strategy called Exponential Decay Gating. This gating mechanism uses a division-free, numerically stable function to selectively activate long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.7\% top-1 accuracy, outperforming ViG-B by 0.4\% while using 80\% fewer parameters and 84\% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78\% fewer parameters.


44
MooTrack360: A Novel Fisheye Camera Dataset for Robust Multi Diary Cow Detection and Tracking

Rasmus Christiansen ⋅ Toan Nguyen ⋅ Lasse Malskær ⋅ Leon Bodenhagen ⋅ Dirk Kraft

MooTrack360 is a novel top-down fisheye dataset designed to support the development of robust camera surveillance and monitoring systems in large-scale, real-world environments. While centered around continuous livestock monitoring of Holstein dairy cows, the dataset addresses general challenges in computer vision such as fisheye distortion, overlapping fields of view, variable lighting, and occlusions. It includes 102,747 annotated cow instances across 1,500 images, each labeled as \textit{"standing"} or \textit{"lying"}, along with a 1-hour annotated video sequence for tracking evaluation. In addition, several unannotated sample sequences are included to support development and qualitative analysis. The dataset enables detection and tracking under conditions ranging from daylight to infrared-assisted nighttime imaging, and facilitates both application-specific and generalized model evaluation. A detailed calibration pipeline based on the Double Sphere Camera model is provided to support distortion correction and precise spatial localization. An accompanying end-to-end training framework further addresses challenges such as illumination changes and occlusions. Benchmarks using state-of-the-art detection and tracking methods demonstrate the dataset’s potential to advance research in non-invasive, camera-based monitoring across domains. The complete dataset—including annotated images, tracking sequences, unannotated sample videos, calibration footage, and all supporting files—along with the full codebase, will be available after the double-blind peer review at: zenodo.org

An ordinary digital camera typically captures an image of a directly visible scene by measuring•the light intensity (and color) arriving at each pixel of its image sensor from each small patch making up the scene. Conventional photography considers the measured light to be solely informative of the directly visible scene. However, recent research efforts have shown that subtle variations in the measured light intensity can also enable the imaging of scenes outside the direct line of sight. These methods exploit preexisting obstructions, which cast barely perceptible, but highly informative, soft shadows onto the observed planar surface. Whereas most prior works assume that exploitable occluders are partly or wholly known, or almost planar, a recent work blended a trained diffusion-based sampling to reconstruct the hidden occluding structures in 3D jointly with a transverse 2D radiosity map of all other hidden non-occluding structures. This work proposes a fully-trained novel \textbf{multipath decoding UNet} (MDUNet) architecture, wherein the multimodal, multipath decoder parallels recent physics-based methods whose successes come from explicitly separating the representations and reconstructions of the occluding and non-occluding hidden scene structures. However, by sharing a latent feature representation among them, MDUNet still tightly couples the occluding and non-occluding reconstruction pathways. As such, MDUNet improves inference time over $\mathbf{100\times}$ over a state-of-the-art diffusion-based method, and by $\mathbf{1000}\bm{\times}$ over traditional optimization-based methods, while also improving reconstruction quality. In addition, MDUNet is trained solely in simulation, but generalizes to real experimental data, while maintaining accuracy and stability even as ambient illumination increases.


46
Interleaved Vision-and-Language Generation via Generative Voken

Kaizhi Zheng ⋅ Xuehai He ⋅ Xin Wang

The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding.However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, ViLGen, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows ViLGen is better than the baseline model on more than 56\% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.

Intraoral scan (IoS) provides high-resolution data on the tooth crown, but does not contain information on the tooth root and thus has limitations in applications requiring 3D models of the whole tooth, e.g., virtual dental simulators. In this paper, we consider a diffusion-based model for root completion from IoS crowns. A key challenge is the lack of ground truth, i.e., the scan data of roots are typically unavailable. To train our model, we instead use the Cone-Beam CT (CBCT) data matched to IoS images, and use its crown as input and root as the pseudo-ground truth. Due to the difference in input data between training (CBCT crown) and inference (IoS crown), there is an issue of domain shift. To address the issue, we take a coarse-to-fine approach: we make a coarse prediction of roots using Transformer encoder; introduce Perturbed Patch Generator (PPG) which generates patches from coarse points and perturbs them with noise for a robust prediction against the domain shift; and use Transformer denoiser for refined reconstruction. We also propose loss functions designed to facilitate the training of the denoiser with perturbed patches. Experiments show that our method outperforms prior techniques in various benchmark evaluations, demonstrating its robust performance in generating high-quality root data. The source code will be publicly released upon acceptance.

The rapid advancements of generative text-to-image~(T2I) models and their open accessibility has enabled users to generate high-quality, photorealistic images of humans.Ethical challenges, particularly the deliberate generation of child sexual abuse material (CSAM), have been widely recognized. By contrast, the unintentional creation of such content has received little scholarly attention. The legal risks associated with this phenomenon nevertheless pose a significant threat to the increasing number of users of generative models.To investigate this issue, we conduct a comprehensive systematic evaluation of the potential of state-of-the-art T2I models to generate CSAM against users' intentions. % that can be hosted locally.We systematically generate datasets with prompts specifying adult subjects. Using age estimation models, we analyze the datasets regarding age compliance across different visual demographic properties and prompt variations.Our findings show that the six examined prominent T2I models generate images depicting underage individuals despite explicit adult-oriented prompts. Across various dataset settings, Stable Diffusion 3.5 Large and Qwen-Image generate the highest proportion of persons classified as underage in our experiments.We share insights and strategies to mitigate the risk of generating CSAM.


49
SGPMIL: Sparse Gaussian Process Multiple Instance Learning

Andreas Lolos ⋅ Stergios Christodoulidis ⋅ Aris Moustakas ⋅ Jose Dolz ⋅ Maria Vakalopoulou

Multiple Instance Learning (MIL) offers a natural solution for settings where only coarse, bag-level labels are available, without having access to instance-level annotations. This is usually the case in digital pathology, which consists of gigapixel sized images. While deterministic attention-based MIL approaches achieve strong bag-level performance, they often overlook the uncertainty inherent in instance relevance. In this paper, we address the lack of uncertainty quantification in instance-level attention scores by introducing SGPMIL, a new probabilistic attention-based MIL framework grounded in Sparse Gaussian Processes (SGP). By learning a posterior distribution over attention scores, SGPMIL enables principled uncertainty estimation, resulting in more reliable and calibrated instance relevance maps. Our approach not only preserves competitive bag-level performance but also significantly improves the quality and interpretability of instance-level predictions under uncertainty. SGPMIL extends prior work by introducing feature scaling in the SGP predictive mean function, leading to faster training, improved efficiency, and enhanced instance-level performance. Extensive experiments on multiple well-established digital pathology datasets highlight the effectiveness of our approach across both bag- and instance-level evaluations. Our code is available at: \url{https://anonymous.4open.science/r/SGPMIL_anonymous-EB81/README.md}.

Vision-Language Models (VLMs) effectively integrate visual and textual information, often relying on shared embedding spaces to align modalities. However, the extent to which these spaces capture complex, subjective human judgments, such as perceived facial trustworthiness and attractiveness, and whether they replicate associated human social biases, remains underexplored. This paper investigates the representation of subjective face attributes within Multimodal LLM and VLM embedding spaces, examining whether these representations encode human-like biases and assessing their interpretability. Using probing techniques on face datasets annotated with human judgments, we analyze the structure of VLM embeddings (e.g., from CLIP-like models). Our findings demonstrate that similarity scores between face image and textual description in the VLM embedding space align with human ratings of subjective attributes like trustworthiness and attractiveness, and crucially, these representations exhibit correlations and demographic disparities mirroring known biases in human social perception. Furthermore, we show that the use of variable context via face and attribute-specific captions can significantly improve the alignment of the VLM embedding space with human impressions. Interpreting the embedded social biases highlights the need for critical evaluation and bias-aware development of VLMs to mitigate the risk of perpetuating harmful stereotypes in downstream applications that involve Human-AI interaction.


51
M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

Ju-Hsuan Weng ⋅ Jia-Wei Liao ⋅ Cheng-Fu Chou ⋅ Jun-Cheng Chen

Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can themselves become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: (1) text prompts, (2) learned embeddings, and (3) inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding a comprehensive set of five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and latent inversion, with Concept Reproduction Rate (CRR) exceeding 90\% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, delivers practical safeguards for building more reliable protective generative models.


52
Revisiting Layer Normalization for Point Cloud Test Time Adaptation

Moslem Yazdanpanah ⋅ Ali Bahri ⋅ Mehrdad Noori ⋅ Sahar Dastani ⋅ Samuel Barbeau ⋅ David OSOWIECHI ⋅ Gustavo Vargas Hakim ⋅ Ismail Ayed ⋅ Christian Desrosiers

We analyze Layer Normalization (LN) from a domain (batch) perspective and explain why BatchNorm-style test-time fixes often fail on Transformer backbones. As feature dimension and batch size grow, the per-feature batch marginals after LN's pre-affine step concentrate at mean $\approx 0$ and variance $\approx 1$, making cross-batch re-standardization unnecessary and often harmful. This yields a simple rule: keep the pre-affine LN intact and adjust only the post-affine mean and gain. We instantiate this with \textbf{LN-TTA}, a backpropagation-free and source-free, test-time adaptation that performs a single forward pass and uniformly reparameterizes each LN layer. On three corrupted 3D point-cloud suites (ScanObjectNN-C, ModelNet40-C, ShapeNet-C), LN-TTA improves over Source-Only by $+12.35$, $+15.58$, and $+3.03$ points, surpasses backpropagation baselines (e.g., TENT), and sustains up to $93$ samples/s, on average $39\times$ faster and $5\times$ more memory-efficient than the next-best backprop-free method. The implementation will be publicly available.


53
Learning to Animate Images from A Few Videos to Portray Delicate Human Actions

Haoxin Li ⋅ Yingchen Yu ⋅ Qilong Wu ⋅ Hanwang Zhang ⋅ Song Bai ⋅ Boyang Li

Despite recent progress, video generative models still struggle to generate delicate human actions (e.g., gymnastics), particularly when they are required to start from a user-provided reference image. In this paper, we explore the task of learning to animate images into videos that portray delicate human actions using a small number of videos --- 16 or fewer --- which reduces the need for extensive data collection and enhances practicality for real-world applications. Learning generalizable motion patterns that smoothly transition from user-provided reference images in such a few-shot setting is highly challenging. We propose FLASH (Few-shot Learning to Animate and Steer Humans), which enhances generalization of motion by training the model to reconstruct a video using the motion features and cross-frame correspondences extracted from another video with the same motion but different appearance. This encourages the learning of transferable motion and mitigates overfitting to the appearance in limited training data. Additionally, FLASH extends the decoder with additional layers to propagate details from the reference image to generated frames, improving transition smoothness. Human judges overwhelmingly favor FLASH, with 65.78% of 488 responses prefer FLASH over baselines. We strongly recommend watching the videos in the webpage or the Supplemental Material, as motion artifacts are hard to notice from images.


54
One-Shot Fine-Grained Re-Identification of Paint Marked Honey Bees using Vision Foundation Models

Luke Meyers ⋅ Josué A. Rodríguez-Cordero ⋅ Remi Megret

Accurate re-identification (ReID) of individual insects is crucial for quantitative studies of pollinator behavior, with key applications in biological research and ecological monitoring.This work leverages vision foundation models to enable the fine-grained ReID of honey bees from video with a single paint marking and a single reference track. Such marking avoids the disruption required for gluing tags or painting color codes. We present a new challenging dataset of 9495 images and 45 identities, obtained at outdoor bee feeders with significant pose and illumination changes. Our one-shot capable approach first pre-processes video footage to extract pose-normalized bee crops and remove the background using a segmentation foundation model (e.g., SAM2). It then uses a self-supervised visual foundation model (e.g., DINOv3) for image and patch embeddings, coupled with contrastive metric learning and track information to generate robust ID embeddings for ReID. Compared to existing methods, our approach significantly reduces training data requirements.Our extensive studies show how the choices of the different steps of the pipeline impact performance, offering practical insights for future animal re-identification.

Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment. In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement. Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, which is offering a scalable and cost-efficient alternative to manual annotation. All source codes will be made publicly available at https://street2orbit.github.io.


56
Mitigating Backdoor Attacks via Trigger Reconstruction and Model Hardening

Guanhong Tao ⋅ Siyuan Cheng ⋅ Guangyu Shen ⋅ Yingqi Liu ⋅ Shengwei An ⋅ ZHUO ZHANG ⋅ Zhenting Wang ⋅ Hanxi Guo ⋅ Xiangyu Zhang

Backdoor attacks are among the most prominent security threats to deep learning models. Traditional backdoors rely on fixed trigger patterns (e.g., a red square) that existing defenses can often effectively remove. However, recent attacks embed semantic triggers that vary with the input and blend with meaningful features, rendering prior defenses ineffective. We propose MARTINI, a novel backdoor mitigation framework that addresses both traditional and semantic backdoors. MARTINI reconstructs backdoor samples via a dedicated trigger reconstruction procedure, producing malicious inputs that replicate the injected attack effect across a spectrum of attacks. Using these reconstructed samples paired with their correct labels, MARTINI then hardens the model through retraining to neutralize the targeted misclassification. Our evaluation on 14 types of backdoor attacks in image classification shows that MARTINI can reduce the attack success rate (ASR) from 96.56% to 5.17% on average, outperforming 12 state-of-the-art backdoor removal approaches, which at best reduce the ASR to 26.56%. It can also mitigate backdoors in self-supervised learning, object detection and NLP sentiment analysis.


57
Network-agnostic distortion-robust projections for wide-angle image understanding

Akshaya Athwale ⋅ Ola Ahmad ⋅ Jean-Francois Lalonde

Due to their increased field of view, wide-angle lenses are increasingly used in applications such as VR, security, or in autonomous driving. Typically, existing models either ignore wide-angle distortions or ``undistort'' images to a perspective projection, often resulting in severe stretching. More recent distortion-aware architectures address these issues, yet they impose substantial computational burdens and limit the use of powerful pre-trained vision backbones. In this work, we revisit the undistortion strategy by exploring alternative projection functions beyond the conventional perspective model.Specifically, we investigate square-to-disc mapping functions, most notably, the elliptical grid map (EGM) projection, which minimizes stretching. We show how EGM projection can be combined with the known lens distortion curves to achieve distortion invariance directly in image space. This network-agnostic approach seamlessly integrates with existing deep learning architectures, allowing fine-tuning on pre-trained models trained on large perspective datasets, while adapting to both seen and unseen wide-angle lenses without re-training each time the lens changes during evaluation. We perform experiments on the semantic segmentation task, comparing methods on zero-shot adaptation to unseen lenses from different wide-angle lenses. Our extensive experiments show that using the EGM projection with existing segmentation models significantly outperforms baselines when trained on bounded distortion levels and tested across both seen and out-of-distribution distortions. Furthermore, EGM projection achieves improved performance on real-world datasets, highlighting the robustness and practicality of our approach in real-world applications.


58
4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos

Shanshan Zhong ⋅ Jiawei Peng ⋅ Zehan Zheng ⋅ Zhongzhan Huang ⋅ Wufei Ma ⋅ Guofeng Zhang ⋅ Qihao Liu ⋅ Alan Yuille ⋅ Jieneng Chen

Reconstructing animatable 3D animals from videos traditionally depends on sparse semantic keypoints to fit parametric models. Acquiring these keypoints is labor-intensive, and detectors trained on limited animal datasets are often unreliable. We propose \textbf{4D-Animal}, a keypoint-free framework that reconstructs animatable 3D animals directly from videos. Our method employs a dense feature network to map 2D image representations to SMAL parameters, improving both efficiency and stability. Additionally, we introduce a hierarchical alignment strategy that leverages silhouette, part-level, pixel-level, and temporal cues from pretrained 2D models, ensuring accurate and temporally coherent reconstructions. Extensive experiments demonstrate that 4D-Animal outperforms both model-based and model-free baselines on dog dataset. Moreover, the high-quality 3D assets generated by our method can benefit other 3D tasks, underscoring its potential for large-scale applications. The code will be released online.

Recent advances in diffusion models have significantly improved image and video synthesis. In addition, several concept control methods have been proposed to enable fine-grained, continuous, and flexible control over free-form text prompts. However, these methods not only require intensive training time and GPU memory usage to learn the sliders or embeddings but also need to be retrained for different diffusion backbones, limiting their scalability and adaptability. To address these limitations, we introduce Text Slider, a lightweight, efficient and plug-and-play framework that identifies low-rank directions within a pre-trained text encoder, enabling continuous control of visual concepts while significantly reducing training time, GPU memory consumption, and the number of trainable parameters. Furthermore, Text Slider supports multi-concept composition and continuous control, enabling fine-grained and flexible manipulation in both image and video synthesis. We show that Text Slider enables smooth and continuous modulation of specific attributes while preserving the original spatial layout and structure of the input. Text Slider achieves significantly better efficiency: 5$\times$ faster training than Concept Slider and 47$\times$ faster than Attribute Control, while reducing GPU memory usage by nearly 2$\times$ and 4$\times$, respectively.


60
Cluster-based Pseudo-labeling for Semi-Supervised LiDAR Semantic Segmentation

Qingju Guo ⋅ Shuang Li ⋅ Jing Geng ⋅ Binhui Xie ⋅ Jiawei Shan ⋅ Wei Li

The costly annotation process has driven the development of semi-supervised learning (SSL) approaches. Existing semi-supervised LiDAR segmentation methods typically process entire point clouds directly, aiming to assign labels to all points at the scene scale. However, the large number of points, combined with their sparse and irregular nature, makes it challenging to learn scene-level optimization objectives, especially in SSL settings where labeled data are insufficient. This paper presents a Cluster-based pseudo-LAbeling Semi-Supervised technique, called CLASS. CLASS is designed to divide point clouds into several small, pure clusters, thereby decomposing challenging scene-scale segmentation task into more manageable cluster-scale classification and segmentation tasks, enabling the generation of high-quality pseudo labels for unlabeled data. CLASS possesses three key properties. i) Task simplicity: our pseudo-labeling process is based on simpler cluster-scale classification and segmentation tasks, resulting in ease of learning. ii) Labeling effectiveness: CLASS can generate pseudo-labels comparable to ground truth using only approximately 10% labeled data. iii) Universal versatility: CLASS exhibits flexibility regarding LiDAR representations (e.g., BEV, voxel, and range view). Comprehensive experiments on popular LiDAR segmentation benchmarks demonstrate its superiority.


61
Human Pose Aggregation for Multi-View Temporal Video Alignment

Fabien Delattre ⋅ Tsung-Wei Huang ⋅ Guan-Ming Su ⋅ Erik Learned-Miller

When multiple videos of a scene are taken from differing viewpoints without precise synchronization, it can be difficult to temporally align them after the fact. Often the metadata or audio needed to do so is missing or inaccurate. But human motion in such videos can provide a strong signal for identifying matching time points across videos, through analysis of pose and movement. In this work, we leverage view-invariant human pose features to synchronize videos. Unlike previous human pose-based alignment techniques, our method can align videos containing multiple people without performing tracking or re-identification across views. We achieve this by aggregating pose information from multiple people into a single frame descriptor. This also enables fast $\mathcal{O}(n\log n)$ search for the optimal alignment. This simple but effective strategy leads to major and consistent improvements over existing human-based and visual feature temporal alignment techniques.


62
Marshaled Learning: Bridging Large Neural Networks with Memory-Constrained Trusted Execution Environments in Federated Learning

Shiwei Ding ⋅ Xiaoyong Yuan ⋅ Zhenlin Wang ⋅ Lan Zhang ⋅ Giuseppe Ateniese

Despite the privacy-oriented design, federated learning (FL) remains vulnerable to privacy breaches due to the exposure of model update snapshots throughout training. To achieve robust privacy-preserving FL that protects both data and model privacy, Trusted Execution Environments (TEEs) offer a promising solution by isolating code and data within a secure memory enclave. However, the memory capacity of commonly used TEEs still constrains the training of large-scale neural networks, such as GPT, creating significant challenges within these secure enclaves and thereby limiting the full potential of TEEs in federated learning. To address this limitation, we propose Marshaled Learning, a solution to protect FL privacy for both data and model owners while enabling large neural network training within memory-constrained TEEs. To overcome memory constraints, we first partition the neural network and distribute subnetworks to clients in alignment with their TEE memory capacities. We also facilitate end-to-end training for global optimization by enabling knowledge propagation across client-side TEEs. Given the distributed nature of subnetworks, we introduce a dynamic knowledge propagation mechanism to enhance knowledge transfer in FL. This diversified propagation accelerates FL on heterogeneous data and mitigates critical straggler effects common in distributed training. We also analyze the convergence of Marshaled Learning under conditions of data heterogeneity. Our theoretical and empirical results demonstrate the effectiveness and efficiency of Marshaled Learning over existing FL algorithms in the constrained memory scenarios. Marshaled Learning outperforms the baseline methods by around 2% to 5% accuracy with much faster convergence. Furthermore, we implement Marshaled Learning in the real-world TEE and show that Marshaled Learning only incurs around 1 ∼ 3× computational overhead compared with non-TEE environments but ensures strong privacy preservation.

Remote photoplethysmography (rPPG) is a crucial technique for non-contact heart rate (HR) estimation using facial videos, gaining significance in driver monitoring systems where contact-based measurements are impractical. Existing rPPG methods often rely on either RGB or NIR data, each susceptible to limitations under motion artifacts and varying illumination in real-world driving scenarios. To address these challenges, we introduce a novel RGB-NIR fusion model tailored for robust rPPG and HR estimation in dynamic vehicle environments. Our approach features two main contributions, an NIR-specific decoder that facilitates effective cross-modal knowledge transfer from RGB to NIR, enhancing model adaptability, and a dual autoencoder architecture for efficient feature disentanglement and reconstruction, mitigating noise from driver motion and changing lighting conditions. Comprehensive evaluations, including inter- and cross-dataset testing and ablation studies across various driving and garage conditions, demonstrate that our model achieves superior performance on the MR-NIRP car dataset, showcasing significant robustness in complex vehicular environments.


64
Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation

Xiwen Chen ⋅ Wenhui Zhu ⋅ Peijie Qiu ⋅ Hao Wang ⋅ Huayu Li ⋅ Haiyu Wu ⋅ XUANZHAO DONG ⋅ Aris Sotiras ⋅ Yalin Wang ⋅ Abolfazl Razi

Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. Prompt learning has emerged as an efficient and effective strategy to adapt VLMs while preserving their pre-trained knowledge. However, existing methods still lead to overfitting and degrade zero-shot generalization. To address this challenge, we propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions between pre-trained and fine-tuned models. Unlike conventional point-wise constraints, OT naturally captures cross-instance relationships and expands the feasible parameter space for prompt tuning, allowing a better trade-off between adaptation and generalization. Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment. Extensive experiments on benchmark datasets demonstrate that our simple yet effective method can outperform existing prompt learning strategies in base-to-novel generalization, cross-dataset evaluation, and domain generalization without additional augmentation or ensemble techniques.


65
Zero-Shot Video Deraining with Video Diffusion Models

Tuomas Varanka ⋅ Juan Bello Bello ⋅ Hyeongwoo Kim ⋅ Pablo Garrido ⋅ Xu YAO

Existing video deraining methods are often trained on paired datasets, either synthetic, which limits their ability to generalize to real-world rain, or captured by static cameras, which restricts their effectiveness in dynamic scenes with background and camera motion. Furthermore, recent works in fine-tuning diffusion models have shown promising results, but the fine-tuning tends to weaken the generative prior, limiting generalization to unseen cases. In this paper, we introduce the first zero-shot video deraining method for complex dynamic scenes that does not require synthetic data nor model fine-tuning, by leveraging a pretrained text-to-video diffusion model that demonstrates strong generalization capabilities. By inverting an input video into the latent space of diffusion models, its reconstruction process can be intervened and pushed away from the model's concept of rain using negative prompting. At the core of our approach is an attention switching mechanism that we found is crucial for maintaining dynamic backgrounds as well as structural consistency between the input and the derained video, mitigating artifacts introduced by naive negative prompting. Our approach is validated through extensive experiments on real-world rain datasets, demonstrating substantial improvements over prior methods and showcasing robust generalization without the need for supervised training.


66
Joint Modeling of Corruption-Driven and Information-Limited Uncertainty for Robust 3D Gaussian Splatting

Zeji Hui ⋅ Amirali Khodadadian Gostar ⋅ WeiQin Chuah ⋅ Alireza Bab-Hadiashar ⋅ Ruwan Tennakoon

Real-time 3D Gaussian Splatting (3DGS) has emerged as an efficient, high-fidelity alternative to neural radiance fields for novel view synthesis, enabling second-scale training and rendering via GPU rasterization. However, when input image collections contain transient disturbances (e.g., dynamic objects, exposure variations, motion blur) or suffer from sparse view coverage at scene boundaries, 3DGS performance degrades significantly due to reconstruction artifacts such as ghosting, floating points, and blurred surfaces. In this work, we present a unified framework that jointly addresses two types of artifacts: (1) corruption-driven artifacts, caused by transient or occluded content; and (2) information-limited artifacts, caused by insufficient multi-view observations. Our method leverages the training gradient signal, as well as the shape and spatial distribution of Gaussians, to adaptively suppress unreliable splats through a soft-masking strategy, without relying on any pretrained segmentation or feature networks.Extensive experiments on two real-world datasets with dynamic scenes and sparse camera trajectories demonstrate that our approach outperforms state-of-the-art robust 3DGS and uncertainty-pruning techniques in artifact suppression and reconstruction fidelity, while preserving real-time performance.


67
Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness

Erh-Chung Chen ⋅ Pin-Yu Chen ⋅ I-Hsin Chung ⋅ Che-Rung Lee

The security and robustness of deep neural networks (DNNs) have become increasingly critical as these systems are deployed in sensitive applications. While introducing adversarial examples during training has proven effective for improving robustness, this approach imposes substantial computational burdens that many users cannot afford, and no certified models have been deployed commercially. More concerning, state-of-the-art methods that further enhance robustness by incorporating additional examples from external datasets or generative models increase training costs by orders of magnitude. In this paper, we propose a cost-efficient approach that achieves comparable or superior robustness by leveraging the theorem of Lipschitz continuity. Our technique remaps the input domain into a constrained range, effectively reducing the Lipschitz constant and enhancing model resilience against adversarial perturbations. Unlike conventional adversarial training, our method requires only a single scan of the dataset without gradient estimation, making it remarkably efficient. Our approach integrates seamlessly with existing adversarially trained models to further boost their robustness. Experiments demonstrate its generalizability across various model architectures and datasets. When combined with models trained without additional generative data, our method achieves robustness comparable to or exceeding that of models using extensive supplementary data. These results open a promising direction for significantly reducing computational costs while maintaining or improving defensive capabilities of robust neural networks.


68
CADE: Continual Weakly-supervised Video Anomaly Detection with Ensembles

Satoshi HASHIMOTO ⋅ Tatsuya Konishi ⋅ Tomoya Kaichi ⋅ Kazunori Matsumoto ⋅ Mori Kurokawa

Video anomaly detection (VAD) has long been studied as a crucial problem in public security and crime prevention. In recent years, weakly-supervised VAD (WVAD) have attracted considerable attention due to their easy annotation process and promising research results. While existing WVAD methods tackle mainly on static datasets, the possibility that the domain of data can vary has been neglected. To adapt such domain-shift, the continual learning (CL) perspective is required because otherwise additional training only with new coming data could easily cause performance degradation for previous data, i.e., forgetting. Therefore, we propose a brand-new approach, called Continual Anomaly Detection with Ensembles (CADE) that is the first work combining CL and WVAD viewpoints. Specifically, CADE uses the Dual-Generator(DG) to address data imbalance and label uncertainty in WVAD. We also found that forgetting exacerbates the "incompleteness'' where the model becomes biased towards certain anomaly modes, leading to missed detections of various anomalies. To address this, we propose to ensemble Multi-Discriminator (MD) that capture missed anomalies in past scenes due to forgetting, using multiple models. Extensive experiments show that CADE significantly outperforms existing VAD methods on the common multi-scene VAD datasets, such as ShanghaiTech and Charlotte Anomaly datasets.

Survival prediction from medical imaging is a critical challenge in computational oncology, with high clinical relevance for patient stratification and treatment planning. However, current Deep Learning methods suffer from three core limitations: they assume complete modality availability, overlook local-to-global cross-modal interactions, and disregard modality-specific signal reliability during optimization. To address these issues, we introduce PaRaChute, a novel Deep Learning framework for robust multimodal survival prediction from heterogeneous and partially missing imaging data. PaRaChute integrates modality-specific pretrained encoders with adapter networks that align radiology and histopathology features into a shared latent space. A Dynamic Contextual Embedding mechanism captures biologically grounded local correlations between pathology and radiology and channels them through a multi-head cross-attention fusion module to guide global survival prediction, while adaptively handling missing modality scenarios. Furthermore, a Gradient Curvature Steering module improves convergence in incomplete data regimes by regularizing gradients via local curvature alignment. Experiments on three CPTAC and TCGA derived cancer cohorts show that PaRaChute achieves a C-index of 0.8367 with full modality input, and it retains strong performance under missing modality conditions (0.7488) while producing clinically meaningful risk stratifications, as confirmed by Kaplan–Meier analysis.


70
How to Design and Train Your Implicit Neural Representation for Video Compression

Matthew Gwilliam ⋅ Roy Zhang ⋅ Namitha Padmanabhan ⋅ Hongyang Du ⋅ Abhinav Shrivastava

Implicit neural representation (INR) methods for video compression have recently achieved visual quality and compression ratios that are competitive with traditional pipelines. However, due to the need for per-sample network training, the encoding speeds of these methods are too slow for practical adoption. We develop a library to allow us to disentangle and review the components of methods from the NeRV family, reframing their performance in terms of not only size-quality trade-offs, but also impacts on training time. We uncover principles for effective video INR design and propose a state-of-the-art configuration of these components, Rabbit NeRV (RNeRV). When all methods are given equal training time (equivalent to 300 NeRV epochs) for 7 different UVG videos at 1080p, RNeRV achieves +1.27% PSNR on average compared to the best-performing alternative for each video in our NeRV library. We then tackle the encoding speed issue head-on by investigating the viability of hyper-networks, which predict INR weights from video inputs, to disentangle training from encoding to allow for real-time encoding. We propose masking the weights of the predicted INR during training to allow for variable, higher quality compression, resulting in 1.7% improvements to both PSNR and MS-SSIM at 0.037 bpp on the UCF-101 dataset, and we increase hyper-network parameters by 0.4% for 2.5%/2.7% improvements to PSNR/MS-SSIM with equal bpp and similar speeds.


71
Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?

Annika Mütze ⋅ Sadia Ilyas ⋅ Christian Dörpelkus ⋅ Matthias Rottmann

Open-vocabulary object detectors such as Grounding DINO are trained on vast and diverse data, achieving remarkable performance on challenging datasets. Due to that, it is unclear where to find their limitations, which is of major concern when using in safety-critical applications. Real-world data does not provide sufficient control, required for a rigorous evaluation of model generalization. In contrast, synthetically generated data allows to systematically explore the boundaries of model competence/generalization. In this work, we address two research questions: 1) Can we challenge open-vocabulary object detectors with generated image content? 2) Can we find systematic failure modes of those models? To address these questions, we design two automated pipelines using stable diffusion to inpaint unusual objects with high diversity in semantics, by sampling multiple substantives from WordNet and ChatGPT. On the synthetically generated data, we evaluate and compare multiple open-vocabulary object detectors as well as a classical object detector. The synthetic data is derived from two real-world datasets, namely LostAndFound, a challenging out-of-distribution (OOD) detection benchmark, and the NuImages dataset. Our results indicate that inpainting can challenge open-vocabulary object detectors in terms of overlooking objects. Additionally, we find a strong dependence of open-vocabulary models on object location, rather than on object semantics. This provides a systematic approach to challenge open-vocabulary models and gives valuable insights on how data could be acquired to effectively improve these models.


72
Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis

Imanol Estepa ⋅ Jesús Rodríguez-de-Vera ⋅ Ignacio Sarasua ⋅ Bhalaji Nagarajan ⋅ Petia Radeva

While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they either extract information from discriminative pretrained models or rely solely on semantic token reconstruction, which requires an external tokenizer during training --- introducing a significant computational overhead. In this work, we introduce Sorcen, a novel Unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our novel Contrastive objective, leverages the generative capabilities of Sorcen and eliminates the need for additional image crops or augmentations during training. Sorcen generates contrastive positive samples, called Echoes, directly in the semantic token space using the reconstruction objective. This on-the-fly Echo generation, enables Sorcen to operate exclusively on precomputed tokens, eliminating the need for an online tokenizer during training. Sorcen significantly reduces the computational overhead by 60.8% compared to token reconstruction SoTA. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively. Additionally, Sorcen establishes as a new single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.


73
DICE: Discrete Inversion Enabling Controllable Editing for Masked Generative Models

Sen Zhang ⋅ Quan Dao ⋅ Ligong Han ⋅ Song Wen ⋅ Minhao Bai ⋅ Di Liu ⋅ Han Zhang ⋅ Felix Juefei-Xu ⋅ Chaowei Tan ⋅ Bo Liu ⋅ Martin Min ⋅ Kang Li ⋅ Faez Ahmed ⋅ Akash Srivastava ⋅ Hongdong Li ⋅ Junzhou Huang ⋅ Dimitri Metaxas

Recent advances in discrete diffusion models have demonstrated strong performance in image generation and masked language modeling, yet they remain limited in their capacity for controlled content editing. We propose DICE (Discrete Inversion for Controllable Editing), a novel framework that pioneers precise inversion capabilities for discrete diffusion models, including both masked generative and multinomial diffusion variants. Our key innovation lies in capturing noise sequences and masking patterns during reverse diffusion process, enabling both accurate reconstruction and flexible editing without relying on predefined masks or attention-based manipulations. Through comprehensive experiments across image and text modalities using models such as Paella, VQ-Diffusion, RoBERTa and LLaDA, we demonstrate that DICE successfully maintains high fidelity to the original data while significantly expanding editing capabilities. These results establish new possibilities for fine-grained content manipulation in discrete spaces


74
Deepfake Detection that Generalizes Across Benchmarks

Andrii Yermakov ⋅ Jan Čech ⋅ Jiri Matas ⋅ Mario Fritz

The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of a pre-trained CLIP vision encoder. The proposed method, LNCLIP-DF, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and latent space augmentations.We conducted an extensive evaluation on 13 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained CLIP model. The code will be made publicly available upon acceptance.


75
Unified Video Anomaly Detection Model for Detecting Different Anomaly Types

Kijung Lee ⋅ Youngwan Jo ⋅ Sunghyun Ahn ⋅ Sanghyun Park

Video anomaly detection (VAD) is a crucial task for public safety and workforce reduction. Due to the rarity of abnormal events and the high cost of data collection, one-class classification (OCC) methods are extensively used. OCC methods are divided into object- and frame-centric approaches, each with its limitations. Object-centric methods fail to detect nonobject anomalies because they focus solely on objects, whereas frame-centric methods struggle to identify abnormalities due to a higher background rate than the foreground rate in video frames. To this end, we define three types of abnormal events, namely, human, appearance, and nonobject anomalies, and propose a unified VAD (UniVAD) model that effectively detects each defined anomaly type. UniVAD comprises three streams, namely, skeleton, local-visual, and global-visual, and each stream focuses on a specific type of anomaly. In addition, each stream uses an autoencoder; thus, we introduce the feature future past prediction task, which predicts past and future features based on present feature to suppress the strong generalization capacity of autoencoders. We validate the proposed model on three public benchmarks, ShanghaiTech, UBnormal, and NWPUCampus, and demonstrate that it achieves state-of-the-art performance by a significant margin.


76
SCORP: Scene-Consistent Object Refinement via Proxy Generation and Tuning

Ziwei Chen ⋅ Ziling Liu ⋅ Zitong Huang ⋅ Mingqi Gao ⋅ Feng Zheng

Viewpoint missing of objects is common in scene reconstruction, as camera paths typically prioritize capturing the overall scene structure rather than individual objects. This makes it highly challenging to achieve high-fidelity object-level modeling while maintaining accurate scene-level representation. Addressing this issue is critical for advancing downstream tasks requiring high-fidelity object reconstruction. In this paper, we introduce Scene-Consistent Object Refinement via Proxy Generation and Tuning (SCORP), a novel 3D enhancement framework that leverages 3D generative priors to recover fine-grained object geometry and appearance under missing views. Starting with proxy generation by substituting degraded objects using a 3D generation model, our method then progressively refines geometry and texture by aligning each proxy to its degraded counterpart in 7-DoF pose, followed by correcting spatial and appearance inconsistencies through registration-constrained enhancement. This two-stage proxy tuning ensures the high-fidelity geometry and appearance of the original object in unseen views while maintaining consistency in spatial positioning, observed geometry, and appearance. Across challenging benchmarks, our method achieves consistent gains over recent state-of-the-art baselines on both novel view synthesis and geometry completion tasks. Our codes will be made publicly available to support future research.


77
CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones

Giacomo Pacini ⋅ Lorenzo Bianchi ⋅ Luca Ciampi ⋅ Nicola Messina ⋅ Giuseppe Amato ⋅ Fabrizio Falchi

Class-agnostic counting (CAC) aims to estimate the number of objects in images without being restricted to predefined categories. However, while current exemplar-based CAC methods offer flexibility at inference time, they still rely heavily on labeled data for training, which limits scalability and generalization to many downstream use cases. In this paper, we introduce CountingDINO, the first training-free exemplar-based CAC framework that exploits a fully unsupervised feature extractor. Specifically, our approach employs self-supervised vision-only backbones to extract object-aware features, and it eliminates the need for annotated data throughout the entire proposed pipeline. At inference time, we extract latent object prototypes via ROI-Align from DINO features and use them as convolutional kernels to generate similarity maps. These are then transformed into density maps through a simple yet effective normalization scheme. We evaluate our approach on the FSC-147 benchmark, where we consistently outperform a baseline based on an SOTA unsupervised object detector under the same label- and training-free setting. Additionally, we achieve competitive results -- and in some cases surpass -- training-free methods that rely on supervised backbones, non-training-free unsupervised methods, as well as several fully supervised SOTA approaches. This demonstrates that label- and training-free CAC can be both scalable and effective. Code: https://anonymous.4open.science/r/CountingDINO-4590.

Point cloud registration is a fundamental task in 3D vision. Most existing methods only use geometric information for registration. Recently proposed RGB-D registration methods primarily focus on feature fusion or improving feature learning, which limits their ability to exploit image information and hinders their practical applicability. In this paper, we propose ViGG, a robust RGB-D registration method using mutual guidance. First, we solve clique alignment in a visual-geometric combination form, employing a geometric guidance design to suppress ambiguous cliques. Second, to mitigate accuracy degradation caused by noise in visual matches, we propose a visual-guided geometric matching method that utilizes visual priors to determine the search space, enabling the extraction of high-quality, noise-insensitive correspondences. This mutual guidance strategy brings our method superior robustness, making it applicable for various RGB-D registration tasks. The experiments on 3DMatch, ScanNet and KITTI datasets show that our method outperforms recent state-of-the-art methods in both learning-free and learning-based settings. Code is available at [Anonymous] (provided in supplementary material).


79
Gaussian Representations for Video

Sachin Shah ⋅ Anustup Choudhury ⋅ Guan-Ming Su ⋅ Jaclyn Pytlarz ⋅ Christopher Metzler ⋅ Trisha Mittal

We introduce Gaussian representations for videos (GaRV), a novel video encoding and decoding scheme based upon 3D Gaussians. Unlike traditional representations, which encode videos as sequences of frames, or neural representations, which encode videos within the weights of a neural network, we encode videos as a collection of 3D Gaussians within a space-time volume. The key advantage of our approach is that it enables efficient and flexible rasterization-based video decoding. With a slight drop in overall compression rate, GaRV offers a 8-50$\times$ improvement in decoding time and 2.5-15$\times$ reduction in GPU memory compared with neural counterparts. Existing Gaussian video techniques require 2-30$\times$ more disk space, while also using more GPU resources than GaRV.Moreover, GaRV offers unique flexibility in how and when pixels are decoded: One can non-sequentially decode frames/regions without penalty and can selectively decode regions at high-resolution to enable low-cost foveated video decoding.

State-of-the-art object trackers primarily model appearance relations between the image template and the search region with Siamese networks. However, this well-established approach has a limited ability to leverage both motion and semantic cues of the target object, leading to increasing errors in challenging scenarios like drastic appearance changes and similar-looking distractors. To address the above weaknesses, we propose a novel tracking framework with Motion and Semantic Reasoning (MSRTrack), integrating short-term motion modeling and distinctive semantic features for robust tracking across diverse conditions. Powered by vision large language models (VLLMs) and the Segment Anything Model 2 (SAM2), MSRTrack identifies unique semantic attributes of the target, exploits motion cues across consecutive frames, and complements appearance-based trackers with strong semantic and dynamic reasoning capabilities. Unlike previous vision language tracking (VLT) methods that rely on broad captioning, MSRTrack automatically focuses on a concise set of key semantic attributes of the target, substantially improving target lost recovery and distractor rejection. MSRTrack achieves state-of-the-art performance across multiple tracking benchmarks, with 2.2% improvement on the LaSOT dataset, 9.5% improvement on the VastTrack dataset, and 1.4% on the TNL2K dataset.


81
Enabling High-Quality In-the-Wild Imaging from Severely Aberrated Metalens Bursts

Debabrata Mandal ⋅ Zhihan Peng ⋅ Yujie Wang ⋅ Praneeth Chakravarthula

Metalenses offer nanoscale control of light, enabling ultra-thin, lightweight optics that could revolutionize handheld consumer imaging, and augmented and virtual reality. However, their adoption is hindered by severe chromatic aberrations, light scattering, limited broadband performance, and low optical efficiency. In contrast, burst imaging, widely used in smartphone cameras, enhances handheld photography by reducing noise, improving high-dynamic range (HDR) imaging, and increasing resolution. Building on these insights, we design and prototype a $12,000\times$ thinner metalens compared to conventional compound optical lenses and introduce a multi-image restoration framework for noise, over or under saturation and aberration removal, specifically tailored for handheld metalens cameras. Our framework features a lightweight network, memory efficient burst fusion and adaptive correction techniques to restore high-quality images from extreme degraded metalens captures. We evaluate our framework on a new large-scale metalens dataset and validate its effectiveness with several state-of-the art burst imaging and restoration algorithms.

In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. On the one hand, contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. On the other hand, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method surpasses state-of-the-art (SOTA) approaches in few-shot learning with over 90\% AUC ROC and improves retrieval tasks by 2-3\% across multiple datasets. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods.


83
Beyond Realism: Learning the Art of Expressive Composition with StickerNet

Haoming Lu ⋅ David Kocharian ⋅ Humphrey Shi

As a widely used operation in image editing workflows, image composition has traditionally been studied with a focus on achieving visual realism and semantic plausibility. However, in practical editing scenarios of the modern content creation landscape, many compositions are not intended to preserve realism. Instead, users of online platforms motivated by gaining community recognition often aim to create content that is more artistic, playful, or socially engaging. Taking inspiration from this observation, we define the expressive composition task, a new formulation of image composition that embraces stylistic diversity and looser placement logic, reflecting how users edit images on real-world creative platforms. To address this underexplored problem, we present StickerNet, a two-stage framework that first determines the composition type, then predicts placement parameters such as opacity, maåsk, location, and scale accordingly. Unlike prior work that constructs datasets by simulating object placements on real images, we directly build our dataset from 1.8 million editing actions collected on an anonymous online visual creation and editing platform, each reflecting user-community validated placement decisions. This grounding in authentic editing behavior ensures strong alignment between task definition and training supervision. User studies and quantitative evaluations show that StickerNet outperforms common baselines and closely matches human placement behavior, demonstrating the effectiveness of learning from real-world editing patterns despite the inherent ambiguity of the task. This work introduces a new direction in visual understanding that emphasizes expressiveness and user intent over realism.

Artistic text generation involves rendering textual content in visually creative and contextually appropriate designs, such as floral or geometric patterns. Despite advancements in Large Language Models (LLMs) like DALL-E and Qwen, maintaining both text accuracy and aesthetic alignment is a persistent challenge in this domain. This paper presents an analysis of current Vision-Language Models (VLMs) in generating artistic text, aiming to diagnose performance gaps and guide improvements in text accuracy and multimodal alignment. The objective of this study is to proposes a comprehensive analysis framework to evaluate and improve artistic text generation by assessing models across three key dimensions: character-level text accuracy, semantic similarity, and visual-context alignment. A custom dataset of 1,000 prompts, each specifying a word and artistic style to support systematic evaluation. We benchmarked the performance of DALL·E, Qwen-VL, and Qwen-2.5B on text accuracy by comparing the generated text to the original prompt and measuring perceptual alignment using CLIP embeddings. The results shows better performance of DALL-E model compared to the other LLM, espicially in better generated text accuracy. This study highlights the potential of adopting LLMs for artistic text generation. Furthermore, we present exploratory results using ControlNet as an enhancement module, demonstrating its potential to improve text structure and alignment in generated images. Our findings expose critical limitations in current VLMs and offer actionable insights for advancing multimodal generation architectures, datasets, and evaluation strategies, as well as provide a framework for improving text accuracy and aesthetic coherence in creative applications, paving the way for advancements in multimodal AI systems.


85
MEDAL: multi-modal MEta-space Distillation and ALignment for Visual Compatibility Learning

Dween Sanny ⋅ Vinay Verma ⋅ Prateek Sircar ⋅ Deepak Gupta

Visual compatibility recommendation systems aim to surface compatible items (e.g. pants, shoes) that harmonise with a user‑selected product (e.g., shirt). Existing methods struggle in three key aspects: they rely on global CNN representations that overlook fine‑grained local cues critical for visual pairing; they force all categories into a single latent space, ignoring the fact that compatibility rules differ across product‑type pairs; and they demand costly, expert‑annotated outfit labels. We introduce MEDAL(Meta‑space Distillation and Alignment ), a self‑supervised framework that addresses all three challenges simultaneously. MEDAL (i) employs a local–global augmentation curriculum inside a teacher–student ViT to emphasise patch‑level texture and pattern similarities while suppressing confounding global shape cues; (ii) partitions the joint feature manifold into learnable, pair‑specific meta‑spaces so that, for example, {shirt,pants} and {pants,shoes} relationships are modelled with distinct projection masks; and (iii) replaces manual labels with distantly supervised KD, harvesting pseudo‑compatible sets via object detection on web images, thus scaling to millions of real‑world examples. We further fuse perceptually uniform LUV colour histograms to capture global colour harmony often missed by pure vision transformers.Extensive experiments on Polyvore disjoint/non‑disjoint and a 2M‑image in‑house dataset show state‑of‑the‑art gains of up to +3.72/+2.7FITB and +9.58R@10 over the strongest baseline, whilst cutting annotation cost to zero. Qualitative studies confirm that MEDAL retrieves stylistically coherent outfits and correctly penalises mismatched colour palettes.


86
PS3: Part level instance segmentation in 3D

HONG-XUAN YEN ⋅ Chiamin Chen ⋅ Yanqing Wang ⋅ Yu-Lun Liu ⋅ Min Sun

Open-vocabulary 3D segmentation allows exploration of 3D environments using unrestricted natural language queries. Current approaches to open-vocabulary 3D instance segmentation largely concentrate on recognizing object-level instances but face difficulties when dealing with more fine-grained elements of a scene, such as object parts. Some previous work constructs hierarchical open-vocabulary 3D scene representations by geometric over-segmentation, which can't identify parts with similar geometry. In this work, we introduce PS3, an approach to generate 3D part proposals from multi-view 2D masks. PS3 outperforms baselines that rely on geometric over-segmentation in scene-scale open-vocabulary 3D part segmentation.


87
Saliency-Guided DETR for Moment Retrieval and Highlight Detection

Aleksandr Gordeev ⋅ Vladimir Dokholyan ⋅ Irina Tolstykh ⋅ Maksim Kuprashevich

With the rapid growth of video content available, the ability to search for specific moments within videos using textual queries has become increasingly relevant. This is crucial in many scenarios, from surveillance cameras where it may be necessary to find specific events in extensive video streams to searching for exciting movie scenes. However, existing approaches for video Moment Retrieval and Highlight Detection often struggle to effectively align text and video features, limiting their performance. We argue that utilizing recent foundational video models designed for video-text alignment can overcome these limitations. We propose a novel architecture that utilizes such models to test this hypothesis. Combined with our novel Saliency-Guided Cross Attention mechanism and a hybrid DETR architecture, our approach provides significantly improved results. To further enhance our approach, we developed InterVid-MR - a large-scale, high-quality dataset specifically designed for a pretraining stage. Extensive experiments and comparisons with current state-of-the-art methods confirm the effectiveness of the approach, achieving 58.8 mAP on QVHighlights, 60.7 R@1 mIoU on Charades-STA, and 42.4 R@1 mIoU on TACoS. These results highlight the efficiency and scalability of the method for video-language tasks in both zero-shot and fine-tuning scenarios.

Accurate dose calculation is critical in Gamma Knife Radiosurgery (GKRS), especially in regions near the skull where tissue heterogeneity can significantly alter dose distributions. Although the convolution algorithm-based dose calculation (Conv dose) using CT enables heterogeneity correction for an accurate treatment plan, it introduces additional clinical burdens, including longer planning times and increased radiation exposure. This study is the first to explore using the conditional wavelet Denoising Diffusion Probabilistic Model (cwDDPM) to generate radiation dose distributions. cwDDPM exploits the inherent sparsity of radiation dose maps in the wavelet domain to achieve more efficient learning and sampling. As a result, it produces synthetic convolution-based (sConv) doses quickly and accurately, without relying on CT imaging or computationally intensive convolution calculations. Quantitative results across isodose overlap and dose-volume metrics demonstrate that cwDDPM achieves high fidelity to the ground truth Conv dose and performs comparably to a state-of-the-art diffusion model while reducing inference time by up to 45-fold. cwDDPM also demonstrates robustness across different tumor locations, especially near the skull, where heterogeneity correction is most critical. These results suggest that cwDDPM is a promising solution for rapid, CT-free Conv dose generation in GKRS planning.


89
DermEVAL: A Dermatologist-Reviewed Benchmark for Multimodal Large Language Models

Hongjin Zhao ⋅ Weihao Li ⋅ Zhenyue Qin ⋅ Ge-Peng Ji ⋅ Yang Liu ⋅ Tom Gedeon ⋅ Nick Barnes

Clinical photographs play a crucial role in conversational computer-aided diagnosis, particularly in dermatology. However, existing skin disease benchmarks have notable limitations, including insufficient dataset size, the sole presence of categorical labels, the lack of expert inspections, and limited diversity in annotations. To address these shortcomings, we introduce DermEVAL, a large-scale benchmark specifically designed to evaluate the performance of Multimodal Large Language Models (MLLMs) in dermatology. Our benchmark includes image-text pairs depicting 16 distinct skin diseases, featuring a total of 11,347 representative images drawn from various dermatological datasets, carefully selected and annotated with the guidance of dermatologists. DermEVAL enables two primary tasks: visual question answering (VQA) and medical report generation (MRG), designed to simulate real-world medical diagnostics. We evaluate the performance of MLLMs in dermatology using multiple metrics, including traditional metrics and GPT-4V-based assessments. Our results indicate that accurately diagnosing skin diseases remains challenging for state-of-the-art MLLMs. We demonstrate that fine-tuning MLLMs using DermEVAL significantly improves their accuracy on dermatological tasks. We will release our code and benchmark.


90
SynPlay: Large-Scale Synthetic Human Data with Real-World Diversity for Aerial-View Perception

Jinsub Yim ⋅ Hyungtae Lee ⋅ Sungmin Eum ⋅ Yi-Ting Shen ⋅ Yan Zhang ⋅ Heesung Kwon ⋅ Shuvra Bhattacharyya

We introduce \textbf{SynPlay}, a large-scale synthetic human dataset purpose-built for advancing multi-perspective human identification, with a predominant focus on aerial-view perception. SynPlay departs from traditional synthetic datasets by addressing a critical but underexplored challenge: identifying humans in aerial scenes where subjects often occupy only tens of pixels in the image. In such scenarios, fine-grained details like facial features or textures become irrelevant, shifting the burden of recognition to human motion, behavior, and interactions. To meet this need, SynPlay implements a novel rule-guided motion generation framework that combines real-world motion capture with motion evolution graphs. This design enables human actions to evolve dynamically through high-level game rules rather than predefined scripts, resulting in effectively uncountable motion variations. Unlike existing synthetic datasets—which either focus on static visual traits or reuse a limited set of mocap-driven actions—SynPlay captures a wide spectrum of spontaneous behaviors, including complex interactions that naturally emerge from unscripted gameplay scenarios. SynPlay also introduces an extensive multi-camera setup that spans UAVs at random altitudes, CCTVs, and a freely roaming UGV, achieving true near-to-far perspective coverage in a single dataset. The majority of instances are captured from aerial viewpoints at varying scales, directly supporting the development of models for long-range human analysis—a setting where existing datasets fall short. Our data contains over 73k images and 6.5M human instances, with detailed annotations for detection, segmentation, and keypoint tasks. Extensive experiments demonstrate that training with SynPlay significantly improves human identification performance, especially in few-shot and data-scarce scenarios. SynPlay will be publicly released upon acceptance.


91
Do generative video models understand physical principles?

Saman Motamed ⋅ Laura Culp ⋅ Kevin Swersky ⋅ Priyank Jaini ⋅ Robert Geirhos

AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn "world models" that discover laws of physics---or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our anonymous project page is here; data and code are open-sourced; code is available from the supplementary materials.


92
ZonUI-3B: Competitive GUI Grounding with a 3B VLM Trained on a Single Consumer GPU

ZongHan Hsieh ⋅ SHENGJING YANG ⋅ TZER-JEN WEI

In this paper, we present ZonUI-3B, a lightweight Vision-Language Model (VLM) that can be fully trained on a single consumer-grade GPU (RTX 4090) while delivering performance comparable to significantly larger models on GUI grounding tasks. The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks—including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro—highlights ZonUI-3B's exceptional accuracy, achieving 84.9\% on ScreenSpot and 86.4\% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios.


93
No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts

Girolamo Macaluso ⋅ Lorenzo Mandelli ⋅ Mirko Bicchierai ⋅ Stefano Berretti ⋅ Andrew Bagdanov

Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text–motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model’s generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.

The performance of deep learning-based models for X-ray prohibited item detection heavily relies on large-scale, diverse datasets, which are often unavailable. While data augmentation offers a promising solution, prevalent methods ignore the fundamental principles of X-ray imaging, leading to artifacts such as distorted material properties and unnatural thickness perturbations. To bridge this gap, we present MIX, a physics-grounded data augmentation pipeline. The core idea of MIX is to manipulate image attributes in a way that reflects real-world physical variations. Our contributions are twofold: (1) To address material ambiguity, MIX modulates foreground pseudo-colors by directly manipulating hue and saturation, informed by the relationship between color and effective atomic number. This forces the model to learn more robust material representations. (2) To simulate variations in object density and thickness, MIX introduces a novel thickness perturbation technique based on X-ray attenuation principles. This significantly improves the model's adaptability to geometric changes. Our proposed method seamlessly integrates with existing detectors and yields substantial performance gains across multiple benchmarks. Our work not only provides an effective augmentation solution but also highlights the critical need for domain-specific approaches in X-ray computer vision.


95
Reverse Personalization

Han-Wei Kung ⋅ Tuomas Varanka ⋅ Nicu Sebe

Recent text-to-image diffusion models have demonstrated remarkable ability to generate realistic facial images conditioned on textual prompts and human identities. This has enabled the creation of personalized facial imagery. However, existing prompt-based methods for removing or modifying identity-specific features rely either on the subject being well-represented in the distribution of the pre-trained model or require model fine-tuning for specific identities. In this work, we analyze the identity generation process in diffusion models and introduce a reverse personalization framework for effective face anonymization. Our approach leverages conditional diffusion inversion, allowing direct manipulation of images without relying on text prompts. To generalize beyond subjects present in the model's training data, we incorporate an identity-guided conditioning branch. Unlike prior anonymization methods, which lack the ability to control facial attributes, our framework supports flexible, attribute-controllable anonymization. We demonstrate that our method achieves state-of-the-art performance in identity removal, attribute preservation, and image quality, offering a practical and scalable solution for privacy-preserving face generation.

Contrastive learning methods in self-supervised settings have primarily focused on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. However, this conventional approach overlooks the potential benefits of jointly pre-training both encoder and decoder. In this paper, we propose DeCon, an efficient encoder-decoder self-supervised learning (SSL) framework that supports joint contrastive pre-training. We first extend existing SSL architectures to accommodate diverse decoders and their corresponding contrastive losses. Then, we introduce a weighted encoder-decoder contrastive loss with non-competing objectives to enable the joint pre-training of encoder-decoder architectures. By adapting an established contrastive SSL framework for dense prediction tasks, DeCon achieves new state-of-the-art results: on COCO object detection and instance segmentation when pre-trained on COCO dataset; across almost all dense downstream benchmark tasks when pre-trained on COCO+ and ImageNet-1K. Our results demonstrate that joint pre-training enhances the representation power of the encoder and improves performance in dense prediction tasks. This gain persists across heterogeneous decoder architectures, various encoder architectures, and in out-of-domain limited-data scenarios.

We propose a novel framework for training 3D-aware Generative Adversarial Networks (GANs) from a collection of 2D images, effectively learning both image distribution and 3D geometric configurations without relying on strong 3D priors such as camera poses, depth information, or target-specific 3D models. To achieve these goals, we introduce hyper-pose embeddings alongside a novel pose disentanglement technique that effectively separates pose and scene information. This crucial disentanglement helps the generative model overcome the inherent conflict between learning photo-realism and accurate 3D geometry. Furthermore, we propose soft contrastive learning to robustly handle the continuous nature of camera poses, and a non-match loss to further enhance disentanglement and refine embedding training. With extensive experiments, we show the outstanding performance of our method in 3D-aware image synthesis, particularly on challenging datasets with complex or diverse objects.

Image inpainting aims to restore missing regions in images in a visually plausible and structurally consistent manner. However, existing methods often struggle with irregular holes and complex structural patterns due to the limitations of fixed-kernel convolutions and static attention mechanisms. In this paper, we propose Mask-Aware Deformable Inpainting Network (MADIN), an image inpainting framework that enables position-aware control based on mask information. The proposed model employs a Two-stage Offset Estimator that jointly utilizes query features and mask signals to reliably predict reference positions even within masked regions. Moreover, the introduction of Adaptive Offset Range Scaling allows the model to flexibly access broader contextual information by adjusting the offset magnitude according to the masking ratio.By effectively combining convolutional operations with attention mechanisms, MADIN integrates both local and global information while maintaining spatial structure without requiring explicit positional encoding. Extensive quantitative and qualitative experiments on the CelebA-HQ and Places2 datasets demonstrate that MADIN achieves superior restoration performance, despite having a lightweight structure of only 29M parameters and an inference speed under 48 ms. Our method outperforms existing state-of-the-art approaches across key metrics including PSNR, SSIM, LPIPS, and FID.


99
Salience-SGG: Enhancing Unbiased Scene Graph Generation with Iterative Salience Estimation

Runfeng Qu ⋅ Ole Hall ⋅ Pia Bideau ⋅ Julie Ouerfelli-Ethier ⋅ Martin Rolfs ⋅ Klaus Obermayer ⋅ Olaf Hellwich

Scene Graph Generation (SGG) suffers from a long-taileddistribution, where a few predicate classes dominate whilemany others are underrepresented, leading to biased mod-els that underperform on rare relations. Unbiased-SGGmethods address this by implementing debiasing strategies,but often at the cost of spatial understanding—resulting inover-reliance on semantic priors. We introduce Salience-SGG, a novel framework featuring an Iterative SalienceDecoder (ISD) that emphasizes triplets with salient spatialstructures. To support this, we propose semantic-agnosticsalience labels guiding ISD. Evaluations on Visual Genome,Open Images V6, and GQA-200 show that Salience-SGGachieves state-of-the-art performance and improves exist-ing Unbiased-SGG methods.

Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast plays a comparatively minor role. These findings highlight critical limitations of current MLLMs for safety-critical wildfire monitoring and underscore the need for methods that improve early-stage smoke localization.


101
Visibility guided Self-Supervised Occlusion Resilient Human Pose Estimation

Arindam Dutta ⋅ Sarosij Bose ⋅ Rohit Kundu ⋅ Calvin-Khang Ta ⋅ Saketh Bachu ⋅ Konstantinos Karydis ⋅ Amit Roy-Chowdhury

Occlusion remains a significant challenge for existing human pose estimation algorithms, often resulting in inaccurate and anatomically implausible predictions. Although recent occlusion-robust methods report strong performance, they typically rely heavily on supervised learning and privileged information, such as multiview data or temporal sequences. Furthermore, these models often fail under domain changes. Domain-adaptive human pose estimation seeks to mitigate this issue; however, when occlusions are present in the target domain, a common occurrence in real-world applications, performance of these algorithms deteriorates significantly. To address these challenges, we propose VisOR, a novel Visibility guided Self-Supervised algorithm for Occlusion-Resilient Human Pose Estimation. VisOR achieves robustness to both domain shifts and occlusions by integrating contextual reasoning with iterative pseudo-label refinement. It mitigates the overfitting to noisy labels from occluded regions via a visibility-driven curriculum learning strategy, which progressively introduces the model to increasingly occluded training samples. Additionally, VisOR is regularized by a learned human pose prior that maintains anatomical plausibility throughout the adaptation process. Recognizing the scarcity of human pose datasets with realistic occlusions, we introduce BOW: Blended Occlusions in-the-Wild, a rigorously constructed context-aware synthetic benchmark designed to evaluate the occlusion resilience of human pose estimation algorithms. BOW offers a diverse range of context-aware occlusions across both indoor and outdoor environments, simulating real-world conditions. Through extensive experiments, we demonstrate that VisOR outperforms current state-of-the-art methods by ~7% in challenging occluded human pose estimation benchmarks and provides a baseline performance on BOW, against existing algorithms.

Sketch colorization is highly demanded in the field of art, as it offers a valuable tool for artists, designers, and illustrators to explore novel possibilities and express their creativity. Given a sketch, it can be colorized in various styles, such as rendering a facial sketch with different hair colors. To achieve this, previous works often resorted to sampling style codes from a simple Gaussian distribution to produce varied images. However, due to the inherent scarcity of information in sketches, the resulting colorized images often exhibit noticeable artifacts and lack of diversity. In this paper, our aim is to generate informative style representations that can effectively compensate for missing content information in sketches while enabling diverse generations. We improve style granularity by extracting style information from CLIP, and achieve the disentanglement of content and style by establishing semantic correspondence between sketches and color images in the CLIP space. Furthermore, to alleviate the artifacts and blending issues caused by semantic deficiency, we simultaneously train a recolorization model in an end-to-end manner. The recolorization model shares the same style space as the colorization model. In this way, we can construct multiple pseudo-sketch-image pairs for each sketch, which are used to provide pixel-level supervision for the colorization model, thus significantly facilitating the learning of semantic correspondence. Experiments demonstrate that our method effectively mitigates artifacts in colorized results and produces more semantically rich colors.


103
GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection

Yu Wang ⋅ Juhyung Ha ⋅ Frangil Ramirez ⋅ Yuchen Wang ⋅ David Crandall

Active Speaker Detection (ASD) seeks to determine who is speaking at each moment by modeling the complex interplay between audio and visual modalities. While most state-of-the-art approaches rely on late fusion, combining multimodal features only at high semantic levels, they often fail to capture the fine-grained cross-modal interactions present at lower layers, interactions that are critical for robust performance in unconstrained scenarios. In this work, we introduce \textbf{GateFusion}, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8\% mAP (+9.4\%), 86.1\% mAP (+2.9\%), and 96.1 mAP (+0.5\%) on Ego4D, UniTalk, and WASD, respectively, and delivers competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments further demonstrate the generalization capability of our model, while comprehensive ablations validate the complementary benefits of each proposed component.


104
MR-Pruner: Training-free Multi-resolution Visual Token Pruning for Multi-modal Large Language Models

Seunghoon Han ⋅ Hyewon Lee ⋅ Soyoung Park ⋅ Jong-Ryul Lee ⋅ Sungsu Lim

Large Language Models (LLMs) extended to multi-modal inputs have led to Multi-modal LLMs (MLLMs) that perform strongly on vision-language tasks. Recent MLLMs adopt multi-resolution inputs to capture both global context and local details, but this substantially increases visual tokens and computational cost. Existing pruning methods reduce redundancy but are designed for single-resolution settings, overlooking the characteristics of multi-resolution tokens. We observe two key properties: tokens from different resolutions follow distinct distributions of information content, and tokens across resolutions exhibit mutual complementarity, such that pruning one type can often be compensated by the other. Based on this observation, we propose Multi-Resolution Token Pruning method (MR-Pruner), a training-free, graph-based pruning framework for multi-resolution MLLMs. MR-Pruner incorporates three components—Intra-resolution, Cross-resolution Token Scoring, and Informativeness-aware Token Pruning—that adaptively allocate pruning ratios and facilitate information propagation across resolutions. Experiments on eight benchmarks show that MR-Pruner achieves superior efficiency–performance trade-offs. For example, when only 10\% of the visual tokens are retained, it leads to an average performance degradation of 3.6\%. For reproducibility, the source code is available at https://anonymous.4open.science/r/MR-Pruner.


105
SpikeRain: Towards Energy-Efficient Single Image Deraining with Spiking Neural Networks

Md Tanvir Islam ⋅ Inzamamul Alam ⋅ Sambit Bakshi ⋅ Khan Muhammad ⋅ Javier Del Ser ⋅ Sangtae Ahn

With the rapid deployment of vision systems on edge devices, energy-efficient and temporally aware image deraining models are increasingly needed. We propose SpikeRain, a spiking neural network (SNN) that achieves competitive deraining performance with substantially lower computational cost than conventional artificial neural networks (ANNs). Unlike ANN-based approaches with dense activations and high memory demands, SpikeRain leverages the event-driven sparse-firing nature of spiking neurons for efficient temporal integration and contextual learning. Built on an encoder-decoder framework, SpikeRain incorporates three spiking native modules: a Dense Spiking Residual Block (DSRB) for temporal integration and feature reuse, a Multi-Dimensional Spiking Attention (MDSA) module to model temporal channel spatial dependencies, and an Adaptive Residual Feature Enhancement (ARFE) block with gated attention to refine salient features. Experiments on synthetic and real-world benchmarks show that SpikeRain achieves state-of-the-art PSNR and SSIM while reducing parameters by approximately 40\% and FLOPs by approximately 89\%, with energy efficiency on par with existing SNN methods. These results highlight the potential of SNNs for real-time low-power image restoration on neuromorphic platforms. Anonymous code is~\href{https://anonymous.4open.science/r/SpikeRain-D18C/}{\textbf{available}} for review.


106
MBTI: Metric-Based Textual Inversion for Fine-Grained Image Generation

ByungKwan Chae ⋅ Youngjae Choi ⋅ Heewon Kim

Textual Inversion has recently gained attention for generating diverse text-guided images by learning a custom class from just a few reference images. However, the generated images often struggle to distinguish between fine-grained classes with similar visual characteristics. To address this challenge, we propose a novel technique for fine-grained image generation called Metric Based Textual Inversion (MBTI). MBTI leverages inter-class relationships from reference images of different classes to encode their characteristics into new pseudo-words, enhancing fine-grained image generation. Learning inter-class information is facilitated by maximizing the distances between the pseudo-words in the text embedding space. MBTI employs a simple selection rule for embeddings and a basic distance metric. Experimental results demonstrate that MBTI successfully generates images for fine-grained classes with distinct characteristics, which are crucial for accurately identifying the image classes. By leveraging its ability to highlight and preserve fine-grained details as a data augmentation technique, MBTI also significantly enhances the performance of fine-grained image classification.

Variations in environment and sensor (ES) conditions—lighting, ISO, shutter, and aperture—cause domain shifts that degrade visual recognition. While a recent robustness benchmark, ImageNet-ES, shows that these shifts differ from conventional augmentations, collecting physically recaptured data is costly and hard to scale. We present CycleGAN-ES, a per-condition unpaired translation framework that simulates ES-style variations from a small set of real targets. Trained on Tiny-ImageNet and ImageNet-ES domain pairs with as few as 200 images per target domain and minimal tuning, CycleGAN-ES produces a synthetic counterpart, ImageNet-sES (IN-sES). The generated images exhibit high-fidelity ES effects both qualitatively and quantitatively, reproducing characteristic exposure and noise behaviors (e.g., highlight clipping at long exposure and increased high-ISO noise). In benchmark evaluations, augmenting training with ImageNet-sES improves robustness to ES shifts on ImageNet-ES, achieves complementary gains when combined with standard augmentation strategies, and transfers to other corruption domains such as ImageNet-C. The learned translators further transfer to new datsets (e.g., CIFAR-100) without retraining. To the best of our knowledge, this is the first systematic study of ES simulation anchored to real recaptures at ImageNet scale. Our results establish ES simulation as a scalable, practical route to incorporating ES-driven style diversity into training pipelines and lay the groundwork for broader real-world robustness evaluation.


108
MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

Seojeong Park ⋅ Jiho Choi ⋅ Kyungjune Baek ⋅ Hyunjung Shim

Video Moment Retrieval (MR) aims to localize moments within a video based on a given natural language query. Given the prevalent use of platforms like YouTube for information retrieval, the demand for MR techniques is significantly growing. Recent DETR-based models have made notable advances in performance but still struggle with accurately localizing short moments. Through data analysis, we identified limited feature diversity in short moments, which motivated the development of MomentMix. MomentMix generates new short-moment samples by employing two augmentation strategies: ForegroundMix and BackgroundMix, each enhancing the ability to understand the query-relevant and irrelevant frames, respectively. Additionally, our analysis of prediction bias revealed that short moments particularly struggle with accurately predicting their center positions and length of moments. To address this, we propose a Length-Aware Decoder, which conditions length through a novel bipartite matching process. Our extensive studies demonstrate the efficacy of our length-aware approach, especially in localizing short moments, leading to improved overall performance. Our method surpasses state-of-the-art DETR-based methods on benchmark datasets, achieving the highest R1 and mAP on QVHighlights and the highest R1@0.7 on TACoS and Charades-STA (such as a 9.62% gain in R1@0.7 and an 16.9% gain in mAP average for QVHighlights). The code is available at https://anonymous.4open.science/r/LA-8E4B.


109
From Lightweight CNNs to SpikeNets: Benchmarking Accuracy–Energy Tradeoffs with Pruned Spiking SqueezeNet

Radib Kabir ⋅ Tawsif Tashwar Dipto ⋅ Mehedi Ahamed ⋅ Sabbir Ahmed ⋅ Md Hasanul Kabir

Spiking Neural Networks (SNNs) are increasingly studied as energy-efficient alternatives to Convolutional Neural Networks (CNNs), particularly for edge intelligence. However, prior work has largely emphasized large-scale models, leaving the design and evaluation of lightweight CNN-to-SNN pipelines underexplored. In this paper, we present the first systematic benchmark of lightweight SNNs obtained by converting compact CNN architectures into spiking networks, where activations are modeled with Leaky-Integrate-and-Fire (LIF) neurons and trained using surrogate gradient descent under a unified setup. We construct spiking variants of ShuffleNet, SqueezeNet, MnasNet, and MixNet, and evaluate them on CIFAR-10, CIFAR-100, and TinyImageNet, measuring accuracy, F1-score, parameter count, computational complexity, and energy consumption. Our results show that SNNs can achieve up to 15.7× higher energy efficiency than their CNN counterparts while retaining competitive accuracy. Among these, the SNN variant of SqueezeNet consistently outperforms other lightweight SNNs. To further optimize this model, we apply a structured pruning strategy that removes entire redundant fire modules, yielding a pruned architecture, SNN SqueezeNet-P. This pruned model improves CIFAR-10 accuracy by 6% and reduces parameters by 19% compared to the original SNN SqueezeNet. Crucially, it narrows the gap with CNN SqueezeNet, achieving nearly the same accuracy (only 1% lower) but with an 88.1% reduction in energy consumption due to sparse spike-driven computations. Together, these findings establish lightweight SNNs as practical, low-power alternatives for edge deployment, highlighting a viable path toward deploying high-performance, low-power intelligence on the edge.


110
AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

Neeraj Anand ⋅ Rishabh Jain ⋅ Sohan Patnaik ⋅ Balaji Krishnamurthy ⋅ Mausoom Sarkar

There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.


111
BanglaProtha: Evaluating Vision Language Models in Underrepresented Long-tail Cultural Contexts

Md Fahim ⋅ Md Sakib Ul Rahman Sourove ⋅ Akm Mazumder ⋅ Md Ishmam ⋅ Md Adib ⋅ Fariha Tanjim Shifat ⋅ Fabiha Haider ⋅ Md Bhuiyan

The advanced multimodal processing of current vision language models (VLMs) has prompted rigorous benchmarking in multicultural settings, revealing a clear inclination toward Western culture. While the bias likely stems from the predominance of Western-centric images in the VLM pretraining data, the resulting long-tail distribution problem is only exacerbated in underrepresented cultural settings, such as Bengali. Our work explores this problem through an aspect-based evaluation of several classes of VLMs on the rich Bengali culture. Our BanglaProtha dataset is a VQA dataset, containing images that encapsulate Bengali cultural elements, questions in native Bengali, and semantically similar multiple-choice answer options. Our experiments provide behavioral insights of VLMs across prompting & fine-tuning strategies, cultural aspects, model size, and augmentation methods. Our work serves as a diagnostic tool for addressing and mitigating inequalities in multicultural and multilingual settings, thereby bringing efforts to democratize AI systems.

This paper introduces CommonForms, a web-scale dataset for form field detection. It casts the problem of form field detection as object detection: given an image of a page, predict the location and type (Text Input, Choice Button, Signature) of form fields. The dataset is constructed by filtering Common Crawl to find PDFs that have fillable elements. Starting with 8 million documents, the filtering process is used to arrive at a final dataset of roughly 55k documents that have over 450k pages. Analysis shows that the dataset contains a diverse mixture of languages and domains; one third of the pages are non-English, and among the 14 classified domains, no domain makes up more than 25% of the dataset.In addition, this paper presents a family of form field detectors, FFDNet-Small and FFDNet-Large, which attain a very high average precision on the CommonForms test set. Each model cost less than $500 to train. Ablation results show that high-resolution inputs are crucial for high-quality form field detection, and that the cleaning process improves data efficiency over using all PDFs that have fillable fields in Common Crawl. A qualitative analysis shows that they outperform a popular, commercially available PDF reader that can prepare forms. Unlike the most popular commercially available solutions, FFDNet can predict checkboxes in addition to text and signature fields. This is, to our knowledge, the first large scale dataset released for form field detection, as well as the first open source models. The dataset, models, and code will be released after review.


113
Beyond Real Weights: Hypercomplex Representations for Stable Quantization

Jawad Ibn Ahad ⋅ Maisha Rahman ⋅ Amrijit Biswas ⋅ Muhammad Kabir ⋅ Robin Krambroeckers ⋅ Sifat Momen ⋅ Nabeel Mohammed ⋅ Shafin Rahman

Vision language models (VLMs) demand immense parameter capacity to align high-dimensional visual features with linguistic representations, making them highly sensitive to quantization. We introduce a hypercomplex quantization framework that encodes model weights in complex space, $\mathbb{C}^n$ rather than $\mathbb{R}^n$, where a single complex weight simultaneously represents coupled real and imaginary components. Formally, we view quantization as an isomorphism $\varphi: \mathbb{R}^2 \to \mathbb{C}$, allowing each quantized parameter to preserve both magnitude and angular phase information under constrained bit-widths. This coupling reduces representational redundancy while maintaining alignment fidelity between modalities. In practice, replacing large feed-forward projections with hypercomplex operators yields a parameterization that is half the size in storage but twice as expressive per weight, stabilizing training dynamics even under aggressive quantization. Beyond compression, hypercomplex quantization provides a natural inductive bias for multimodal fusion, since visual embeddings are inherently spatial phase-rich and thus more faithfully preserved in hypercomplex form. Our framework enables VLMs to sustain high cross-modal alignment accuracy while operating with significantly compressed memory footprints, offering a principled path toward efficient yet stable multimodal intelligence.


114
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

Kyeongseon Kim ⋅ Baek Seong-Eun ⋅ Lee Jung-Mok ⋅ Tae-Hyun Oh

While Scalable Vector Graphic (SVG) codes appear as either plain text or visually as images, they are structured representations that encode geometric and layout information. However, existing methods typically convert SVGs into raster image, discarding their structural details. Similarly, previous sentence embedding methods generate high-quality text embeddings but do not extend to structured or visual modalities such as SVGs. To address these challenges, we propose the first training-free multimodal embedding method that uses a Multimodal Large Language Model (MLLM) to project text, images, and SVG code into an aligned space. Our method consists of two main components: (1) multimodal Explicit One-word Limitation (mEOL), which produces compact, semantically grounded embeddings across modalities without training; and (2) a semantic SVG module that rewrites SVG code by generating missing or non-descriptive components through visual reasoning. This lets the model embed structural signals overlooked in prior work. Our approach not only introduces the first SVG retrieval setting but also achieves strong empirical performance, surpassing prior methods including training-based models by up to +20.5\% Recall@1 on a repurposed VGBench dataset. These results demonstrate that structural cues can significantly enhance semantic alignment in multimodal embeddings, enabling effective retrieval without any fine-tuning.


115
Contrastive Integrated Gradients: A Feature Attribution-Based Method for Explaining Whole Slide Image Classification

Anh Vu ⋅ Tuan Vo ⋅ Ngoc Bui ⋅ Nam Le ⋅ AKASH AWASTHI ⋅ Huy Vo ⋅ Thanh-Huy Nguyen ⋅ Zhu Han ⋅ Chandra Mohan ⋅ Hien Nguyen

Interpretability is essential in Whole Slide Image (WSI) analysis for computational pathology, where understanding model predictions helps build trust in AI-assisted diagnostics. While Integrated Gradients (IG) and related attribution methods have shown promise, applying them directly to WSIs introduces challenges due to their high-resolution nature. These methods capture model decision patterns but may overlook class-discriminative signals that are crucial for distinguishing between tumor subtypes. In this work, we introduce Contrastive Integrated Gradients (CIG), a novel attribution method that enhances interpretability by computing contrastive gradients in logit space. First, CIG highlights class-discriminative regions by comparing feature importance relative to a reference class, offering sharper differentiation between tumor and non-tumor areas. Second, CIG satisfies the axioms of integrated attribution, ensuring consistency and theoretical soundness. Third, we propose two attribution quality metrics, MIL-AIC and MIL-SIC, which measure how predictive information and model confidence evolve with access to salient regions, particularly under weak supervision. We validate CIG across three datasets spanning distinct cancer types: CAMELYON16 (breast cancer metastasis in lymph nodes), TCGA-RCC (renal cell carcinoma), and TCGA-Lung (lung cancer). Experimental results demonstrate that CIG yields more informative attributions both quantitatively, using MIL-AIC and MIL-SIC, and qualitatively, through visualizations that align closely with ground truth tumor regions, underscoring its potential for interpretable and trustworthy WSI-based diagnostics.

Whole Slide Images (WSIs) are high-resolution digital scans widely used in medical diagnostics. Due to their immense size, WSI classification is typically approached using Multiple Instance Learning (MIL), where a slide is partitioned into individual tiles, disrupting its spatial structure.Recent MIL methods often incorporate spatial context through rigid spatial assumptions (e.g. fixed kernels), which limit their ability to capture the intricate tissue structures crucial for an accurate diagnosis.To address this limitation, we propose Probabilistic Spatial Attention MIL (PSA-MIL), anovel attention-based MIL framework that integrates spatial context into the attention mechanism through learnable distance-decayed priors, formulated within a probabilistic interpretation of self-attention as a posterior distribution. This formulation enables a dynamic inference of spatial relationships during training, eliminating the need for predefined assumptions often imposed by previous approaches. Additionally, we introduce a diversity loss that encourages spatial variations among attention heads, ensuring each head captures distinct representations.Furthermore, we address the computational challenge that long sequences, such as those in WSI analysis, pose for transformer-based architectures by introducing a spatial pruning strategy for the posterior, thereby reducing computational costs while maintaining performance.Together, PSA-MIL enables a more data-driven and adaptive integration of spatial context, moving beyond predefined constraints.Extensive experiments on multiple datasets and tasks demonstrate that our method outperforms both contextual and non-contextual models, setting a new state-of-the-art while significantly reducing computational costs.


117
Fully Unsupervised Self-debiasing of Text-to-Image Diffusion Models

Korada Sri Vardhana ⋅ Shrikrishna Lolla ⋅ Soma Biswas

Text-to-image (T2I) diffusion models have achieved widespread success due to their ability to generate high-resolution, photorealistic images. These models are trained on large-scale datasets, like LAION-5B, often scraped from the internet. However, since this data contains numerous biases, the models inherently learn and reproduce them, resulting in stereotypical outputs. We introduce \textbf{SelfDebias}, a fully unsupervised test-time debiasing method applicable to any diffusion model that uses a UNet as its noise predictor. SelfDebias identifies semantic clusters in an image encoder's embedding space and uses these clusters to guide the diffusion process during inference, minimizing the KL divergence between the output distribution and the uniform distribution. Unlike supervised approaches, SelfDebias does not require human-annotated datasets or external classifiers trained for each generated concept. Instead, it is designed to automatically identify semantic modes. Extensive experiments show that SelfDebias generalizes across prompts and diffusion model architectures, including both conditional and unconditional models. It not only effectively debiases images along key demographic dimensions while maintaining the visual fidelity of the generated images, but also more abstract concepts for which identifying biases is also challenging. The code will be made public upon acceptance.


118
Model-free Domain Adaptation for Concealed Multimodal Large-Language Models

Yu Mitsuzumi ⋅ Akisato Kimura ⋅ Hisashi Kashima

Multimodal large-language models (MLLMs) exhibit remarkable capability for various vision tasks but still struggle with the domain-shift problem, in which their performance degrades for data from unfamiliar domains.Since the latest MLLMs often conceal their model resources (i.e., data, parameters, and outputs) from training purposes, current domain adaptation methods cannot satisfactorily address this problem due to their dependence on those resources.To this end, we introduce a novel domain adaptation setup, ``model-free domain adaptation (MFDA)'' of MLLMs, to investigate whether we can address domain adaptation problems without using any resources of the concealed models.As a proof of concept for MFDA, we built a method named model-transferable domain-adaptable visual prompting (MTDA-VP).In the training, this method executes cross-model visual prompting on surrogate models with a domain adaptation objective so that the visual prompts simultaneously acquire model transferability and domain adaptability.In the testing, we can adapt the concealed MLLMs to the target domain by just inputting the test images with the trained prompt into the models.Besides, we developed two techniques, cross-model pseudo labeling (CMPL) and cross-model gradient alignment (CMGA), to further enhance model transferability and domain adaptability of the visual prompts.We empirically confirmed that MFDA-VP improved the performance of several MLLMs with large margins on two datasets.

Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination and confidently describe objects or attributes not present in the image. Current training-free interventions struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding guided by the model’s confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination. Data and code are available at~\url{https://anonymous.4open.science/r/CAAC-5D7F/}.


120
LiDAR-DHMT: LiDAR-Adaptive Dual Hierarchical Mask Transformer for Robust Freespace Detection and Semantic Segmentation

Siyu Chen ⋅ Ting Han ⋅ Changshe Zhang ⋅ Xin Luo ⋅ Huan Chen ⋅ Meiliu Wu ⋅ Guorong Cai ⋅ jinhe su

Inaccurate freespace detection remains a significant challenge to the safety of autonomous driving. However, we observe that current multisource fusion approaches rely on converting LiDAR point clouds into depth maps, often lose crucial 3D geometric cues. This compromises the spatial consistency of predictions, especially in complex urban scenes. To address this limitation, we propose LiDAR-DHMT (LiDAR-Adaptive Dual-branch Hierarchical Mask Transformer), a novel framework designed for spatial-consistent freespace detection and semantic segmentation. Our key innovation lies in the introduction of a 3D Relative Position Bias module, which effectively captures LiDAR's inherent spatial priors. This is coupled with a Dynamic Bias Attention mechanism that adaptively incorporates the 3D positional cues into the Transformer's attention computation, enhancing spatial coherence. Additionally, we employ a Mask Interaction module and a global-local fusion strategy to jointly model contextual semantics and fine-grained structural details. Extensive experiments conducted on the KITTI Road, KITTI-360, Cityscapes datasets demonstrate that LiDAR-DHMT consistently outperforms existing state-of-the-art methods, achieving a competitive 97.59% F1 score in freespace detection and 69.45% and 84.4% mIoU in semantic segmentation. Our findings suggest that LiDAR-DHMT offers a practical solution for deploying robust freespace perception in complex urban driving environments.


121
Mixed Diffusion for 3D Indoor Scene Synthesis

Siyi Hu ⋅ Diego Martín Arroyo ⋅ Stephanie Debats ⋅ Fabian Manhardt ⋅ Luca Carlone ⋅ Federico Tombari

Generating realistic 3D scenes is an area of growing interest in computer vision and robotics. However, creating high-quality, diverse synthetic 3D content often requires expert intervention, making it costly and complex. Recently, efforts to automate this process with learning techniques, particularly diffusion models, have shown significant improvements in tasks like furniture rearrangement. However, applying diffusion models to floor-conditioned indoor scene synthesis remains under-explored. This task is especially challenging as it requires arranging objects in continuous space while selecting from discrete object categories, posing unique difficulties for conventional diffusion methods. To bridge this gap, we present MiDiffusion, a novel mixed discrete-continuous diffusion model designed to synthesize plausible 3D indoor scenes given a floor plan and pre-arranged objects. We represent a scene layout by a 2D floor plan and a set of objects, each defined by category, location, size, and orientation. Our approach uniquely applies structured corruption across mixed discrete semantic and continuous geometric domains, resulting in a better-conditioned problem for denoising. Evaluated on the 3D-FRONT dataset, MiDiffusion outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis. Additionally, it effectively handles partial object constraints via a corruption-and-masking strategy without task-specific training, demonstrating advantages in scene completion and furniture arrangement tasks.


122
PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models

Zilu Guo ⋅ Hongbin Lin ⋅ Zhihao Yuan ⋅ Chaoda Zheng ⋅ Pengshuo Qiu ⋅ Dongzhi Jiang ⋅ Renrui Zhang ⋅ Chun-Mei Feng ⋅ Zhen Li

3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics. We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information. By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation. We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations. To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels. Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively. We will release the code, datasets, and benchmark.


123
MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation

Nico Catalano ⋅ Stefano Samele ⋅ Paolo Pertino ⋅ Matteo Matteucci

Few Shot Segmentation aims to segment novel object classes given only a handful of labeled examples, enabling rapid adaptation with minimal supervision.Current literature crucially lacks a selection method that goes beyond visual similarity between the query and example images, leading to suboptimal predictions. We present MARS, a plug-and-play ranking system that leverages multimodal cues to filter and merge mask proposals robustly. Starting from a set of mask predictions for a single query image, we score, filter, and merge them to improve results. Proposals are evaluated using multimodal scores computed at local and global levels. Extensive experiments on COCO-20i, Pascal-5i, LVIS-92i, and FSS-1000 demonstrate that integrating all four scoring components is crucial for robust ranking, validating our contribution.As MARS can be effortlessly integrated with various mask proposal systems, we deploy it across a wide range of top-performer methods and achieve new state-of-the-art results on multiple existing benchmarks. Code will be available upon acceptance.

Deep neural networks exploit shortcuts—spurious correlations like laterality markers (spatial) or scanner-specific noise (spectral)—that severely compromise generalization in medical imaging. While recent work addresses individual shortcut types through model architecture or loss modifications, there is an argument for preprocessing the data itself, providing a more model-agnostic and visually interpretable approach. Furthermore, many healthcare applications face multiple concurrent shortcuts that are both spatial and spectral, which existing methods struggle to handle. We present SilverLining, an attention-based preprocessing framework that simultaneously identifies and mitigates both spatial and spectral shortcuts without introducing new spurious correlations. Our key insight is that naive removal of shortcut features can itself create new shortcuts, where models learn to exploit the removal patterns as new spurious correlations. We address this through a novel confounder-free correction strategy that maintains consistent preprocessing patterns across all classes in both spatial and frequency domains, preventing new confounders. Extensive experiments demonstrate SilverLining's effectiveness: achieving 0.87 AUC on controlled vision tasks and 0.94 AUC on counter-shortcut medical imaging evaluation where shortcuts are reversed; and improving cross-institutional chest X-ray classification from 0.72 to 0.77 AUC. Our data-centric approach provides an effective solution for reducing multiple types of data shortcuts without architectural modifications, creating preprocessed datasets that improve model robustness.


125
Distilling Diversity and Control in Diffusion Models

Rohit Gandikota ⋅ David Bau

Distilled diffusion models suffer from a critical limitation: reduced sample diversity compared to their base counterparts. In this work, we uncover that despite this diversity loss, distilled models retain the fundamental concept representations of base models. We demonstrate control distillation - where control mechanisms like Concept Sliders and LoRAs trained on base models can be seamlessly transferred to distilled models and vice-versa, effectively distilling control without any retraining. This preservation of representational structure prompted our investigation into the mechanisms of sample-diversity collapse during distillation. To understand how distillation affects diversity, we utilize x0 visualization as an analysis and debugging tool to reveal how models predict final outputs at intermediate steps. Through x0 visualization, we identify generation artifacts, inconsistencies, and demonstrate that initial diffusion timesteps disproportionately determine output diversity, while later steps primarily refine details. Based on these insights, we introduce diversity distillation - a hybrid inference approach that strategically employs the base model for only the first critical timestep before transitioning to the efficient distilled model. Our experiments demonstrate that this simple modification not only restores the diversity capabilities from base to distilled models but surprisingly exceeds it, while maintaining nearly the computational efficiency of distilled inference, all without requiring additional training or model modifications.

We present a novel approach for generating realistic speaking and talking faces by synthesizing a person’s voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation. Our experiments and analysis through standard metrics demonstrate the effectiveness of our model. We also developed a domain-specific dataset for the problem. All model checkpoints, code, and the proposed dataset can be found at https://github.com/narratingForYou/NarratingForYou.


127
Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

Mizanur Rahman Jewel ⋅ Mohamed Elmahallawy ⋅ Sanjay Madria ⋅ Samuel Frimpong

Underground mining disasters create extreme environmental conditions—pervasive darkness, dense dust, and structural collapse—that severely degrade visual information, making accurate situational awareness exceptionally difficult for both human responders and conventional vision systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention that robustly aligns visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that combines global and region-focused embeddings; and (iii) Resource-Efficient Transformer-Based Language Model that generates expressive captions with minimal compute cost. Importantly, we present the Underground Mine Disaster (UMD) dataset, the first large-scale image-caption corpus of real underground disaster scenes with expert annotations, for training and rigorous evaluation. Extensive experiments on the UMD dataset and related benchmarks show that MDSE significantly outperforms state-of-the-art vision-language captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments. These results demonstrate that MDSE's multimodal architecture and domain-focused data substantially improve situational awareness for underground emergency response.


128
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Carlos Plou ⋅ Cesar Borja ⋅ Ruben Martinez-Cantin ⋅ Ana Murillo

Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM’s answers. We also introduce FALCON-Bench, a benchmark that extends question answering problem to Video Answer Search—requiring models to return both the answer and its supporting temporal window for open-ended questions in hour‑long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open‑source 7B VLMs and comparable agents in FALCONBench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT‑4o on single‑detail tasks while slashing inference cost by roughly an order of magnitude.


129
AUTOCORRELATION-BASED FIDUCIAL MARKERS FOR TRACEABILITY

BENCHEIKH ISMAIL ⋅ Max Dunitz ⋅ Marie d'Autume ⋅ Marc Pic ⋅ Enric Meinhardt-Llopis ⋅ Gabriele Facciolo ⋅ Pablo Musé

Classical approaches to the rectification of a single image of a product, without stereo correspondences, require spatial landmarks. These landmarks are often conspicuous as they are generally built from high-contrast elementary shapes that can be detected with simple algorithms. To rectify complex deformations or ensure the robustness of homography rectification to landmark occlusion or tampering, one can use chessboard patterns of markers with elements that break quadrilateral symmetry, such as the three eyes of a QR code. However, these marker boards are even more conspicuous than a single marker. Motivated by traceability applications, which require stealth and robust fiducial markers that can rectify complex deformations, we introduce self-rectifying textures. These stealth textures place fiducial markers in the image autocorrelation. In this way, arbitrary crops of the texture can be rectified using only these spatially invariant statistical properties. Affine transformations of an image correspond to linear transformations of the autocorrelation, without phase component. Exploiting this fact, self-rectifying textures enable local estimation of the linear component of a planar deformation by identifying landmarks in the autocorrelation image, such as peaks, whose location in the untransformed texture is known. The translation component can be recovered independently via the phase correlation. A rectifying map, modulo translations, can also be fit directly to local observations of the differential of the deformation, without access to the rectified texture or need for phase correlation. Self-rectifying textures can be used for communication, watermarking, authentication, surface identification, calibration, and geometry processing.


130
GrounDiff: Diffusion-Based Ground Surface Generation from Digital Surface Models

Oussema Dhaouadi ⋅ Johannes Meier ⋅ Jacques Kaiser ⋅ Daniel Cremers

Digital Terrain Models (DTMs) represent the bare-earth elevation and are important in numerous geospatial applications. Such data models cannot be directly measured by sensors and are typically generated from Digital Surface Models (DSMs) derived from LiDAR or photogrammetry. Traditional filtering approaches rely on manually tuned parameters, while learning-based methods require well-designed architectures, often combined with post-processing. To address these challenges, we introduce Ground Diffusion (GrounDiff), the first diffusion-based framework that iteratively removes non-ground structures by formulating the problem as a denoising task. We incorporate a gated design with confidence-guided generation that enables selective filtering. To increase scalability, we further propose Prior-Guided Stitching (PrioStitch), which employs a downsampled global prior automatically generated using GrounDiff to guide local high-resolution predictions. We evaluate our method on the DSM-to-DTM translation task across diverse datasets, showing that GrounDiff consistently outperforms deep learning-based state-of-the-art methods, reducing RMSE by up to 93% on ALS2DTM and up to 47% on USGS benchmarks. In the task of road reconstruction, which requires both high precision and smoothness, our method achieves up to 81% lower distance error compared to specialized techniques on the GeRoD benchmark, while maintaining competitive surface smoothness using only DSM inputs, without task-specific optimization. Our variant for road reconstruction, GrounDiff+, is specifically designed to produce even smoother surfaces, further surpassing state-of-the-art methods.

Machine learning is evolving towards high-order models that necessitate pre-training on extensive datasets, a process associated with significant overheads. Traditional models, despite having pre-trained weights, are becoming obsolete due to architectural differences that obstruct the effective transfer and initialization of these weights. To address these challenges, we introduce a novel framework, QuadraNet V2, which leverages quadratic neural networks to create efficient and sustainable high-order learning models. Our method initializes the primary term of the quadratic neuron using a standard neural network, while the quadratic term is employed to adaptively enhance the learning of data non-linearity or shifts. This integration of pre-trained primary terms with quadratic terms, which possess advanced modeling capabilities, significantly augments the information characterization capacity of the high-order network. By utilizing existing pre-trained weights, QuadraNet V2 reduces the required GPU hours for training by 90\% to 98.4\% compared to training from scratch, demonstrating both efficiency and effectiveness.


132
AutoSew: A Geometric Approach to Stitching Prediction with Graph Neural Networks

Pablo Ríos ⋅ Elena Garces ⋅ Jorge Lopez-Moreno

Automating garment assembly from sewing patterns remains a significant challenge due to the lack of standardized annotation protocols and the frequent absence of semantic cues. Existing methods often rely on panel labels or handcrafted heuristics, which limit their applicability to real-world, non-conforming patterns. We present AutoSew, a fully automatic, geometry-based approach for predicting stitch correspondences directly from 2D pattern contours. AutoSew formulates the problem as a graph matching task, leveraging a Graph Neural Network to capture local and global geometric context, and employing a differentiable optimal transport solver to infer stitching relationships—including multi-edge connections. To support this task, we update the GarmentCodeData dataset modifying over 18k patterns with realistic multi-edge annotations, reflecting industrial assembly scenarios. AutoSew achieves 96% F1-score and successfully assembles 73.3% of test garments without error, outperforming existing methods while relying solely on geometric input. Our results demonstrate that geometry alone can robustly guide stitching prediction, enabling scalable garment assembly without manual input.

Continuous eye tracking is critical for applications in human-computer interaction, including biometric authentication, gaze-based systems, and affective-cognitive modeling. Recent interest in neuromorphic event cameras has grown due to their sub-microsecond latency in capturing eye movement dynamics. However, existing event-based eye-tracking methods face challenges such as limited labels, sub-optimal event accumulation, and a lack of frameworks that fail to capture fine-grained temporal relationships within event volumes. To address these, we propose a directed acyclic graph-based semi-supervised approach with a framework that is adaptable across multiple closely related tasks, including gaze tracking, pupil tracking, and eye-based emotion recognition. Our approach enables efficient spatiotemporal learning with 95.5% parameter reduction compared to existing methods, achieving significant performance improvements: 38.75% improvement in pupil tracking accuracy, 68% and 63% reductions in gaze angle error on EV-Eye and EBV-Eye datasets respectively, and 3.3% improvement in emotion recognition across all evaluated datasets.


134
Federated Model Synchronization for Diagnostic Redefinition through a Novel Selective Parameter Unlearning

Mayank Kundalwal Kundalwal ⋅ Mamta Mamta ⋅ Deepak Mishra ⋅ Asif Ekbal

Federated learning (FL) allows multiple medical institutions to collaboratively train machine learning models without sharing sensitive patient data, preserving privacy. However, as medical guidelines and disease classifications change over time, existing models can become outdated and may need updates to stay relevant. We propose a novel approach that efficiently updates federated models by selectively removing outdated knowledge without requiring full retraining. Our approach uses gradient-based Shapley value approximations to identify and modify the most important model parameters linked to obsolete diagnostic categories. This enables precise unlearning of outdated information while preserving performance on current diagnoses. We validate our method on the PathMNIST and COVID-19 Radiography datasets, showing that it can effectively eliminate specific diagnostic classes with minimal loss in accuracy for relevant conditions. Our method only requires a single communication round among clients and offers better control than previous techniques by targeting individual parameters instead of whole channels. This makes it especially useful for keeping federated medical models aligned with evolving medical knowledge.

The single-image super-resolution domain has witnessed a significant performance improvement due to the advancement of deep learning models. However, most of the deep learning models use an integer-scale and scale-specific model for super-resolution. Separate scale-specific networks require huge memory during deployment. Therefore, a single model for any random scale image super-resolution is an old age demand. Unlike existing solutions based on implicit representation functions, we propose a fully convolutional arbitrary scale upscaling module. Our proposed module consists of fewer parameters and consumes less memory and inference time than existing ones. As it is based on a simple convolutional neural network, it has the flexibility to be adapted to any other networks for arbitrary scale transformation. We also show that the proposed upscaling module can be extended to super-resolution under homographic transformation. We perform extensive experiments on widely used benchmark datasets, and experimental findings show the comparative performance of our proposed upscaling module as compared to recently developed approaches, while it provides ad-hoc benefits of being simple and computationally inexpensive.