Poster Session
Poster Session 6 + Refreshments
Cycle-consistent Multi-graph Matching for Self-supervised Annotation of C. Elegans
Sebastian Stricker ⋅ Christoph Karg ⋅ Lisa Hutschenreiter ⋅ Bogdan Savchynskyy ⋅ Dagmar Kainmueller
In this work we present a novel approach for unsupervised multi-graph matching, which applies to problems for which a Gaussian distribution of keypoint features can be assumed. We leverage cycle consistency as loss for self-supervised learning, and determine Gaussian parameters through Bayesian Optimization, yielding a highly efficient approach that scales to large datasets. Our fully unsupervised approach enables us to reach the accuracy of state-of-the-art supervised methodology for the biomedical use case of semantic cell annotation in 3D microscopy images of the worm C. elegans. To this end, our approach yields the first unsupervised atlas of C. elegans, i.e. a model of the joint distribution of all of its cell nuclei, without the need for any ground truth cell annotation. This advancement enables highly efficient semantic annotation of cells in large microscopy datasets, overcoming a current key bottleneck. Beyond C. elegans, our approach offers fully unsupervised construction of cell-level atlases for any model organism with a stereotyped body plan down to the level of unique semantic cell labels, and thus bears the potential to catalyze respective biomedical studies in a range of further species.
Automated Suturing Skill Assessment in Robot-assisted Surgery from Endoscopic Videos using Clinically-guided Evaluation Criteria
Atharva Deo ⋅ Ujjwal Pasupulety ⋅ Nicholas Matsumoto ⋅ Jay Moran ⋅ Cherine Yang ⋅ Jeanine Kim ⋅ Rafal Kocielnik ⋅ Aurash Naser-Tavakolian ⋅ Andrew Hung
Surgery continues to be perceived as an art, where proficiency is primarily achieved through years of experience. Artificial Intelligence research has yielded insight into the performance of expert surgeons and their associations with patient outcomes. Clinician expertise has led to the development of systematic assessments for fundamental skills (e.g., End-to-End Assessment of Suturing Expertise [EASE]) that contribute to positive outcomes. However, evaluating these skills requires manual expert review of endoscopic videos and is prone to inconsistencies between human raters. In this work, we present AutoEASE, the first end-to-end pipeline to automatically assess suturing performance from raw endoscopic video data using EASE rubrics. Our system utilizes a Mixture of Expert models (MoE) ; Multiscale vision transformers and 3D convolutional neural networks trained on Robot-assisted Radical Prostatectomy videos with over 13000 data points. For a given stitch clip, the MoE pipeline first determines each phase (needle handling, driving, withdrawal) of a continuous stitch and predicts a binary score (fail / ideal) for seven sub-skills based on rubrics defined in EASE. AutoEASE achieves 0.98 AUC while detecting each phase. For EASE score prediction, the complete end-to-end pipeline attains $\geq$ 0.77 AUC in sub-skills associated with needle handling and driving. The promising performance of AutoEASE at the individual stitch level demonstrates the feasibility of developing more sophisticated assessment and reporting tools for complete surgical procedures objectively and at scale.
Deep Image Decomposition for Medical Imaging Anonymization and Curation
Yael Elkin ⋅ Gal Arie ⋅ Tammy Raviv Raviv
Medical scans often include patient identifiers and clinical annotations that must be removed prior to data sharing or use in downstream model training. With machine learning now central to clinical imaging analysis, reliable removal of such non-imaging artifacts is essential for preserving patient privacy, reducing bias, and improving data quality. However, this crucial curation step is frequently overlooked or addressed heuristically.We present a deep learning framework that automatically detects and removes overlaid text, markers, and other non-imaging elements from clinical scans while restoring the underlying image content. The model comprises two components: a detection module that localizes non-imaging regions, and a dual-generator architecture for unsupervised image decomposition, where one generator reconstructs the imaging content and the other produces the non-imaging components. Unlike conventional inpainting, our method bypasses explicit segmentation by leveraging explainable AI (XAI) maps from the detection module to guide artifact masking and restoration.We demonstrate robust curation performance on three datasets, one MRI and two ultrasound, for both public and private sources.Results show high visual quality (Turing-test validated) and strong quantitative scores (SSIM, PSNR, FID). Importantly, training downstream classification and segmentation models with scans curated by our method substantially improves results compared to models trained on data containing overlaid annotations. In fact, our performance on various metrics (e.g., accuracy, F1 score, IoU, and Dice) is comparable to those obtained with clean, marker-free training data. Code is included with the submission. Our private dataset will be released upon acceptance.
Intraoperative 2D/3D Registration via Spherical Similarity Learning and Differentiable Levenberg-Marquardt Optimization
Minheng Chen ⋅ Youyong Kong
Intraoperative 2D/3D registration aligns preoperative 3D volumes with real-time 2D radiographs, enabling accurate overlay of additional auxiliary anatomical information that is not visible in intraoperative imaging onto the surgical scene. This provides precise localization of instruments and implants, enhancing surgical accuracy and safety.A recently proposed fully differentiable similarity learning framework, which enables neural networks to approximate the geodesic distance between two poses on the manifold in SE(3), has garnered considerable attention. It greatly increases the capture range of registration and mitigates the effects of substantial disturbances on registration. However, existing methods approximate manifold in Riemannian geometry within Euclidean space, leading to inaccurate portrayal of manifold's local structure, with a lengthy convergence process. To address the above limitations, we explore similarity learning on non-Euclidean spherical feature spaces to improve the ability to capture and fit complex manifold features.We extract feature embeddings using a CNN-Transformer encoder, project them into spherical space, and approximate their geodesic distances with Riemannian geodesic distances in the bi-invariant SO(4) space. This enables the learning of a more expressive and geometrically consistent deep similarity metric, enhancing the network’s ability to distinguish subtle pose differences.Fully differentiable Levenberg-Marquardt optimization is adopted to replace the existing gradient descent method to accelerate the convergence of the search during inference phase.Extensive experiments and ablation studies on real and synthetic datasets demonstrate that our approach achieves superior registration accuracy in both patient-specific and patient-agnostic scenarios.
ACuRE: Accurate Continuity-Regularized SpO2 Estimation Using Liquid Time-Constant Networks
Shahzad Ahmad ⋅ DR. MISHRA ⋅ Sania Bano ⋅ Sukalpa Chanda ⋅ Yogesh Rawat
Blood oxygen saturation (SpO$_2$) is a vital measure of respiratory and circulatory health, essential for detecting hypoxemia in conditions like chronic obstructive pulmonary disease and heart failure. Current non-contact SpO$_2$ estimation methods using remote photoplethysmography (rPPG) struggle with motion artifacts, illumination variability, and limited temporal modeling, hindering their practical use. We propose ACuRE, a novel framework that integrates a two-branch 3D-ResNet-18 for AC/DC signal separation, Liquid Time-Constant (LTC) networks for continuous-time dynamics, and a physics-informed partial differential equation (PDE) loss based on mass conservation. ACuRE overcomes these challenges by isolating pulsatile (AC) and baseline (DC) signals for enhanced robustness, using LTC networks to capture nonlinear physiological dynamics, and applying PDE regularization to ensure signal continuity. This achieves a significant reduction in mean absolute error compared to baselines, with strong performance under motion and illumination stress. Evaluated across multiple datasets, ACuRE demonstrates robust accuracy and generalization, offering a scalable solution for video-based health monitoring in telemedicine and low-resource settings.
CAST: Evaluating Multi-Object Trackers with Context-Aware Switch and Transfer Scores
Jin Bai ⋅ Gregory Hager
Multi-object tracking (MOT) has been a subject of intensive research fordecades. Multiple standard datasets and benchmarks have been set up, and severalevaluation metrics, such as MOTA, IDF1 and HOTA. These metrics have become thede facto standard for comparing and ranking trackers on standardized datasets tomeasure progress. In this paper, we focus on MOTA and HOTA, and present a studyof cases where these metrics' behaviors may not be desirable. In addition, wedemonstrate how they might not be ideal when used as a tool to inspect atracker's failure cases. We point out that these issues are related to the sizesof the context windows in which they measure association quality, where MOTA istoo nearsighted while HOTA can be too holistic depending on the task settings.In this paper, we rethink the familiar notion of identity switches (IDSw)proposed in MOTA, and propose a generalized version of it by introducing acontext window when evaluating the ID assignment choice for each detection. Weshow that the proposed metric named CAST mitigates the limitations of HOTA andMOTA, and demonstrate its usefulness when diagnosing model failures. Our codeand toolkit will be available for the community to advance both the developmentand application of MOT.
Advancing Player Identification and Tracking with Global ID Fusion (GIF)
Karol Wojtulewicz ⋅ Minxing Liu ⋅ Niklas Carlsson
Rapid player motion, occlusions, substitutions, and jersey changes pose major challenges for identity-consistent tracking in sports. Existing multi-object tracking (MOT) methods struggle in long-term and multi-perspective settings like broadcast footage, where views change frequently. To address this, we first introduce MuPNIT, the first MOT and ReID benchmark to capture these multi-perspective dynamics along with long-term appearance variations across seasons, teams, and jersey changes. Second, we propose Global ID Fusion (GIF), a novel context-aware tracking-with-identification framework that enables robust tracking of both seen and unseen players. Unlike prior approaches, GIF performs single pass global ID association and supports zero-shot identity recognition. Our approach achieves state-of-the-art results, improving HOTA by 25.3% and IDF1 by 79.5% over OC-SORT. Finally, to assess identity consistency, we introduce five Global ID metrics that reveal tradeoffs in tracking stability. By bridging MOT and ReID, our work advances identity-aware player tracking in sports and sets a new benchmark with applications in sports analytics, surveillance, and long-term person search.
Distilling What and Why: Enhancing Driver Intention Prediction with MLLMs
SAINITHIN ARTHAM ⋅ Avijit Dasgupta ⋅ Shankar Gangisetty ⋅ Jawahar CV
Predicting a drivers’ intent (e.g., turns, lane changes) is a critical capability for modern Advanced Driver Assistance Systems (ADAS). While recent Multimodal Large Language Models (MLLMs) show promise in general vision-language tasks, we find that zero-shot MLLMs still lag behind domain-specific approaches for Driver Intention Prediction (DIP). To address this, we introduce DriveXplain, a zero-shot frame- work based on MLLMs that leverages rich visual cues such as optical flow and road semantics to automatically generate both intention maneuver (what) and rich natural language explanations (why). These maneuver-explanation pairs are then distilled into a compact MLLM, which jointly learns to predict intentions and corresponding explanations. We show that incorporating explanations during training leads to substantial gains over models trained solely on labels, as distilling explanations instills reasoning capabilities by enabling the model to understand not only what decisions to make but also why those decisions are made. Comprehensive experiments across structured (Brain4Cars, AIDE) and unstructured (DAAD) datasets demonstrate that our approach achieves state-of-the-art results in DIP task, outperforming zero-shot and domain-specific baselines. We also present ablation studies to evaluate key design choices in our frame- work. This work sets a direction for more explainable and generalizable intention prediction in autonomous driving systems. We plan to release our codebase to support research.
LASER: Lip Landmark Assisted Speaker Detection for Robustness
Le Thien Phuc Nguyen ⋅ Zhuoran Yu ⋅ Yong Jae Lee
Active Speaker Detection (ASD) aims to identify who is speaking in complex visual scenes. While humans naturally rely on lip-audio synchronization, existing ASD models often misclassify non-speaking instances when lip movements and audio are unsynchronized. To address this, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER), which explicitly incorporates lip landmarks during training to guide the model’s attention to speech-relevant regions. Given a face track, LASER extracts visual features and encodes 2D lip landmarks into dense maps. To handle failure cases such as low resolution or occlusion, we introduce an auxiliary consistency loss that aligns lip-aware and face-only predictions, removing the need for landmark detectors at test time. LASER outperforms state-of-the-art models across both in-domain and out-of-domain benchmarks. To further evaluate robustness in realistic conditions, we introduce LASER-bench, a curated dataset of modern video clips with varying levels of background noise. On the high-noise subset, LASER improves mAP by 3.3 and 4.3 points over LoCoNet and TalkNet, respectively, demonstrating strong resilience to real-world acoustic challenges.
VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
Ying Cheng ⋅ Yu-Ho Lin ⋅ Min-Hung Chen ⋅ Fu-En Yang ⋅ Shang-Hong Lai
Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.
Diffusion models have emerged as the leading approach for style transfer, yet they struggle with photo-realistic transfers, often producing painting-like results or missing detailed stylistic elements. Current methods inadequately address unwanted influence from original content styles and style reference content features. We introduce SCAdapter, a novel technique leveraging CLIP image space to effectively separate and integrate content and style features. Our key innovation systematically extracts pure content from content images and style elements from style references, ensuring authentic transfers. This approach is enhanced through three components: Controllable Style Adaptive Instance Normalization (CSAdaIN) for precise multi-style blending, KVS Injection for targeted style integration, and a style transfer consistency objective maintaining process coherence. Comprehensive experiments demonstrate SCAdapter significantly outperforms state-of-the-art methods in both conventional and diffusion-based baselines. By eliminating DDIM inversion and inference-stage optimization, our method achieves at least 2x faster inference than other diffusion-based approaches, making it both more effective and efficient for practical applications.
T2LF: LLM-Guided Multimodal Diffusion for Text-to-Light Field Synthesis
Soyoung Yoon ⋅ Namhyuk Ahn ⋅ In Kyu Park
We present a novel text-driven approach for light field (LF) synthesis. Existing methods typically generate LFs from given images, requiring users to find reference images, which makes it difficult to construct the desired scene directly and limits scene diversity. Moreover, existing methods are mainly designed for limited baselines from training datasets, making it difficult to implement various viewpoint changes and consequently limiting the flexibility of motion. In contrast, our method directly synthesizes LFs from user-provided text descriptions by leveraging the scene understanding capabilities of a multi-modal large language model (LLM) and the generative power of a diffusion model. Given a text prompt describing the desired LF, the multimodal LLM extracts relevant information for LF synthesis, which then guides a diffusion model to produce diverse scenes and motions. This approach enables LF synthesis even with a pre-trained model not initially designed for this purpose, requiring only minimal fine-tuning. The proposed framework enables visually diverse LF synthesis with only text input. Experimental results demonstrate that the synthesized LFs exhibit geometric consistency and achieve advanced synthesis quality compared to existing methods.
VideoSketcher: A Training-Free Approach for Coherent Video Sketch Transfer
Huining Li ⋅ Bangzhen Liu ⋅ Rui Yang ⋅ Yang Zhou ⋅ Chenshu Xu ⋅ Xufang PANG ⋅ Shengfeng He
Generating high-quality sketches from video requires a nuanced understanding of semantic content and visual structure, particularly for complex scenes across diverse sketch styles. Efficient and flexible video-to-sketch style transformation remains a significant challenge. We introduce \textit{VideoSketcher}, a training-free framework for style-controllable sketch video generation that preserves frame structure while applying specified sketch aesthetics. Leveraging text-to-image diffusion models, VideoSketcher utilizes strong semantic priors without the need for extensive training. Our approach enforces temporal consistency by retaining latent information across frames and employs a Time-Linked Attention mechanism to capture structural elements from the source video and inject stylistic information from the reference image. To bridge the semantic gap between sketches and original video content, we introduce Sketch Directive Amplification for selective transfer of stylistic features. Additionally, a Stroke Graph Regularization strategy, comprising line and point loss, refines line consistency in the latent space. Extensive experiments validate VideoSketcher's superior temporal stability and fidelity across diverse sketch styles and content. Video demos can be found in the supplementary materials.
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
Yan-Bo Lin ⋅ Kevin Lin ⋅ Zhengyuan Yang ⋅ Linjie Li ⋅ Jianfeng Wang ⋅ Chung-Ching Lin ⋅ Xiaofei Wang ⋅ Gedas Bertasius ⋅ Lijuan Wang
In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AVED-Bench, designed explicitly for zero-shot audio-video editing. AVED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AVED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AVED demonstrates superior results on both AVED-Bench and the recent OAVE dataset to validate its generalization capabilities.
SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis
Hou In Ivan Tam ⋅ Hou In Derek Pun ⋅ Austin Wang ⋅ Angel Chang ⋅ Manolis Savva
Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics often measure realism by comparing generated scenes to a set of ground-truth scenes, but they overlook how well scenes follow the input text and capture implicit expectations of plausibility. We present SceneEval, an evaluation framework designed to address these limitations. SceneEval introduces fine-grained metrics for explicit user requirements—including object counts, attributes, and spatial relationships—and complementary metrics for implicit expectations such as support, collisions, and navigability. Together, these provide interpretable and comprehensive assessments of scene quality. To ground evaluation, we curate SceneEval-500, a benchmark of 500 text descriptions with detailed annotations of expected scene properties. This dataset establishes a common reference for reproducible and systematic comparison across scene generation methods. We evaluate six recent scene generation approaches using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results identify significant gaps in current methods, underscoring the need for further research toward practical and controllable scene synthesis.
IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers
Gihwan Kim ⋅ Jemin Lee ⋅ Hyungshin Kim
Previous Quantization-Aware Training (QAT) methods for Vision Transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a PTQ framework for fully integer-only Vision Transformers without retraining. We present novel approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44%p (avg. 1.78%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT is the first fully integer-only PTQ method for Vision Transformers, surpassing partially integer-based PTQ methods in both W8A8 and W4A8 quantization and achieving comparable accuracy to QAT methods.
MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data
Siarhei Sheludzko ⋅ Dhimitrios Duka ⋅ Bernt Schiele ⋅ Hilde Kühne ⋅ Anna Kukleva
Contrastive learning has become a fundamental approach in both uni-modal and multi-modal frameworks. This learning paradigm pulls positive pairs of samples closer while pushing negatives apart. In the uni-modal setting (e.g., image-based learning), previous research has shown that the strength of these forces can be controlled through the temperature parameter.In this work, we propose Multi-Modal Temperature and Margin Schedules, extending the concept of uni-modal temperature scheduling to multi-modal contrastive learning. Our method dynamically adjusts the temperature in the contrastive loss during training, modulating the attraction and repulsion forces in the multi-modal setting.Additionally, recognizing that standard multi-modal datasets often follow imbalanced, long-tail distributions, we adapt the temperature based on the local distribution of each training sample. Specifically, samples from dense clusters are assigned a higher temperature to better preserve their semantic structure.Furthermore, we demonstrate that temperature scheduling can be effectively integrated within a max-margin framework, thereby unifying the two predominant approaches in multi-modal contrastive learning: InfoNCE loss and max-margin objective. We evaluate our approach on four widely used image- and video-language datasets, Flickr30K, MSCOCO, EPIC-KITCHENS-100, and YouCook2, and show that our dynamic temperature and margin schedules improve performance and lead to new state-of-the-art results in the field.
Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training
Kaixuan Lu ⋅ Mehmet Onurcan Kaya ⋅ Dim Papadopoulos
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 $\texttt{val}$ set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code and models upon acceptance.
Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients
Niklas Penzel ⋅ Joachim Denzler
Deep learning models achieve high predictive performance but lack intrinsic interpretability, hindering our understanding of the learned prediction behavior.Existing local explainability methods focus on associations, neglecting the causal drivers of model predictions. Other approaches adopt a causal perspective but primarily provide global, model-level explanations.However, for specific inputs, it's unclear whether globally identified factors apply locally.To address this limitation, we introduce a novel framework for local interventional explanations by leveraging recent advances in image-to-image editing models. Our approach performs gradual interventions on semantic properties to quantify the corresponding impact on a model's predictions using a novel score, the expected property gradient magnitude. We demonstrate the effectiveness of our approach through an extensive empirical evaluation on a wide range of architectures and tasks.First, we validate it in a synthetic scenario and demonstrate its ability to locally identify biases.Afterward, we apply our approach to investigate medical skin lesion classifiers, analyze network training dynamics, and study a pre-trained CLIP model with real-life interventional data.Our results highlight the potential of interventional explanations on the property level to reveal new insights into the behavior of deep models.
Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone
Tristan Amadei ⋅ Enric Meinhardt-Llopis ⋅ Benedicte Bascle ⋅ Corentin ABGRALL ⋅ Gabriele Facciolo
Image-based localization in GNSS-denied environments is critical for UAV autonomy. State-of-the-art methods typically achieve this by matching onboard aerial images to a database of geo-referenced satellite images. However, these methods are fundamentally data-hungry, requiring large-scale, paired satellite and UAV imagery for training, which is often expensive and impractical. To address this, we propose a novel training paradigm that eliminates the need for any UAV data during training by learning to localize from satellite-view reference images alone. This is enabled by a data augmentation strategy that simulates the challenging visual domain shift from satellite to real-world UAV imagery. We introduce CAEVL, an efficient model designed to leverage this paradigm. To validate our approach, we release ViLD, a new, challenging dataset of real-world UAV images.
Where is the Watermark? Interpretable Watermark Detection at the Block Level
Maria Bulychev ⋅ Neil Grant Marchant ⋅ Benjamin Rubinstein
Recent advances in generative AI have enabled the creation of highly realistic digital content, raising concerns around authenticity, ownership, and misuse. While watermarking has become an increasingly important mechanism to trace and protect digital media, most existing image watermarking schemes operate as black boxes, producing global detection scores without offering any insight into how or where the watermark is present. This lack of transparency impacts user trust and makes it difficult to interpret the impact of tampering. In this paper, we present a post-hoc image watermarking method that combines localised embedding with region-level interpretability. Our approach embeds watermark signals in the discrete wavelet transform domain using a statistical block-wise strategy. This allows us to generate detection maps that reveal which regions of an image are likely watermarked or altered. We show that our method achieves strong robustness against common image transformations while remaining sensitive to semantic manipulations. At the same time, the watermark remains highly imperceptible. Compared to prior post-hoc methods, our approach offers more interpretable detection while retaining competitive robustness. For example, our watermarks are robust to cropping up to half the image.
PointSt3R: Point Tracking through 3D Ground Correspondence
Rhodri Guerrier ⋅ Adam Harley ⋅ Dima Damen
Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper we propose to adapt them for the task point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ($+34.3\%$ on EgoPoints static vs. CoTracker2). As these models are trained exclusively on static correspondence data, we propose to combine the reconstruction loss with training for dynamic correspondence, fine-tuning MASt3R using a relatively small amount of dynamic synthetic data. Specifically, we achieve competitive 2D point tracking results on a number of datasets (e.g. 71.0 $\delta_{avg}$ on TAP-Vid-DAVIS compared to 75.7 for CoTracker2) without any temporal knowledge. Furthermore, we also show that 3D tracking can actually be improved on TAP-Vid-3D PStudio ($+1.9\%$ when compared against CoTracker3+ZoeDepth).
Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation
Daniel Kienzle ⋅ Katja Ludwig ⋅ Julian Lorenz ⋅ Shin'ichi Satoh ⋅ Rainer Lienhart
Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world.This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video.To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task.This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data.We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates.By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.
FairVLM: Enhancing Fairness and Prompt Sensitivity in Vision Language Models for Medical Image Segmentation
Md Motiur Rahman ⋅ Saeka Rahman ⋅ Smriti Bhatt ⋅ Miad Faezipour
Vision-language models (VLMs) have demonstrated substantial promise in medical image segmentation by utilizing radiology reports as prompts to segment regions of interest. However, VLM deployment in clinical settings is challenged by two intertwined issues: i) demographic bias, where performance varies across demographic groups, and ii) prompt sensitivity, where semantically similar prompts yield inconsistent outputs. These challenges are interconnected; demographic underrepresentation can worsen a model’s sensitivity to prompts, and prompt instability can more heavily affect certain demographic groups. In this study, we present FairVLM, a unified framework that addresses both demographic disparity and prompt sensitivity in VLMs. FairVLM integrates three key components: (1) Semantic-Retaining Counterfactual Prompting (SRCP), which generates clinically consistent and diverse prompt variations via large language models; (2) Demographic-Aware Feature Normalization (DAFN), a lightweight module that mitigates latent representation bias across demographic groups; and (3) a Fairness-Calibrated Loss (FCL) that explicitly penalizes performance disparities while encouraging prompt consistency. Extensive evaluations on the Harvard-FairSeg dataset show that FairVLM significantly improves equity-scaled segmentation. It also reduces demographic disparity (DI) by over 65\% and relative performance gap (RPG) by over 60\%, while maintaining or boosting overall accuracy. FairVLM is robust to prompt changes, with less than 0.5\% performance drop across varied prompts, and also generalizes well on unseen datasets. These findings present FairVLM as a new state-of-the-art, robust, and adaptable framework for a fair, prompt-invariant vision-language model. Code and data are available at https://github.com/.../FairVLM.
SHaSaM: Submodular Hard Sample Mining for Fair Facial Attribute Recognition
Anay Majee ⋅ Rishabh Iyer
Deep neural networks often inherit social and demographic biases from annotated data during model training, leading to unfair predictions, especially in the presence of sensitive attributes like race, age, gender etc. Existing methods fall prey to the inherent data imbalance between attribute groups and inadvertently emphasize on sensitive attributes, worsening unfairness and performance.To surmount these challenges, we propose SHaSaM (Submodular Hard Sample Mining), a novel combinatorial approach that models fairness-driven representation learning as a submodular hard-sample mining problem. Our two-stage approach comprises of SHaSaM-MINE, which introduces a submodular subset selection strategy to mine hard positives and negatives — effectively mitigating data imbalance, and SHaSaM-LEARN, which introduces a family of combinatorial loss functions based on Submodular Conditional Mutual Information to maximize the decision boundary between target classes while minimizing the influence of sensitive attributes. This unified formulation restricts the model from learning features tied to sensitive attributes, significantly enhancing fairness without sacrificing performance. Experiments on CelebA and UTKFace demonstrate that SHaSaM achieves state-of-the-art results, with up to 2.7 points improvement in model fairness (Equalized Odds) and a 3.5% gain in Accuracy, with ~2.5x faster convergence compared to existing methods.
DM3Net: Dual-Camera Super-Resolution via Domain Modulation and Multi-scale Matching
CONG GUAN ⋅ Jiacheng Ying ⋅ Osamu Yoshie ⋅ Yuya Ieiri
Dual-camera super-resolution is highly practical for smartphone photography that primarily super-resolve the wide-angle images using the telephoto image as a reference. In this paper, we propose $DM^3$Net, a novel dual-camera super-resolution network based on Domain Modulation and Multi-scale Matching. To bridge the domain gap between the high-resolution domain and the degraded domain, we learn two compressed global representations from image pairs corresponding to the two domains. To enable reliable transfer of high-frequency structural details from the reference image, we design a multi-scale matching module that conducts patch-level feature matching and retrieval across multiple receptive fields to improve matching accuracy and robustness. Moreover, we also introduce Key Pruning to achieve a significant reduction in memory usage and inference time with little model performance sacrificed. Experimental results on three real-world datasets demonstrate that our DM$^3$Net outperforms the state-of-the-art approaches.
SuperRivolution: Fine-Scale Rivers from Coarse Temporal Satellite Imagery
Rangel Daroya ⋅ Subhransu Maji
Satellite missions provide valuable optical data for monitoring rivers at diverse spatial and temporal scales. However, accessibility remains a challenge: high-resolution imagery is ideal for fine-grained monitoring but is typically scarce and expensive compared to low-resolution imagery. To address this gap, we introduce SuperRivolution, a framework that improves river segmentation resolution by leveraging information from time series of low-resolution satellite images. We contribute a new benchmark dataset of 9,810 low-resolution temporal images paired with high-resolution labels from an existing river monitoring dataset. Using this benchmark, we investigate multiple strategies for river segmentation, including ensembling single-image models, applying image super-resolution, and developing end-to-end models trained on temporal sequences. SuperRivolution significantly outperforms single-image methods and baseline temporal approaches, narrowing the gap with supervised high-resolution models. For example, the F1 score for river segmentation improves from 60.9% to 80.5%, while the state-of-the-art model operating on high-resolution images achieves 94.1%. Similar improvements are also observed in river width estimation tasks. Our results highlight the potential of publicly available low-resolution satellite archives for fine-scale river monitoring.
Exemplar-free class-incremental learning (EFCIL) aims to retain old knowledge acquired in the previous task while learning new classes, without storing the previous images due to storage constraints or privacy concerns. In EFCIL, the plasticity-stability dilemma, learning new tasks versus catastrophic forgetting, is a significant challenge, primarily due to the unavailability of images from earlier tasks. In this paper, we introduce adversarial pseudo-replay (APR), a method that perturbs the images of the new task with adversarial attack, to synthesize the pseudo-replay images online without storing any replay samples. During the new task training, the adversarial attack is conducted on the new task images with augmented old class mean prototypes as targets, and the resulting images are used for knowledge distillation to prevent semantic drift. Moreover, we calibrate the covariance matrices to compensate for the semantic drift after each task, by learning a transfer matrix on the pseudo-replay samples. Our method reconciles stability and plasticity, achieving state-of-the-art performance on both cold-start and warm-start settings of the standard EFCIL benchmarks.
DiffRegCD: Integrated Registration and Change Detection with Diffusion Features
Seyedehanita Madani ⋅ Rama Chellappa ⋅ Vishal Patel
Change detection (CD) is critical in computer vision and remote sensing, with applications in monitoring, disaster response, and urban analysis. Most CD models assume co-registered inputs, but real imagery often suffers from parallax, viewpoint shifts, or long temporal gaps, leading to severe misalignment. Conventional register-then-detect pipelines and recent joint frameworks (e.g., BiFA, ChangeRD) remain limited: they rely on regression-only flow, global homographies, or synthetic perturbations that fail under large displacements. We propose DiffRegCD, an integrated framework that couples dense registration and change detection. DiffRegCD reformulates correspondence as a Gaussian-smoothed classification task, delivering sub-pixel accuracy and stable training. It builds on frozen multi-scale features from a pretrained denoising diffusion model, which provide invariance to viewpoint and illumination variation. Supervision is enabled by controlled affine perturbations applied to standard CD datasets, yielding paired ground truth for both flow and change detection without pseudo-labels. Experiments on aerial (LEVIR-CD, DSIFN-CD, WHU-CD, SYSU-CD) and ground-level (VL-CMU-CD) datasets show that DiffRegCD outperforms recent baselines and remains robust under wide temporal and viewpoint variation, establishing diffusion features and classification-based correspondence as a strong foundation for integrated CD.
HDR Reconstruction Boosting with Training-Free and Exposure-Consistent Diffusion
Yo-Tin Lin ⋅ Sykai Chen ⋅ Hou-Ning Hu ⋅ Yen-Yu Lin ⋅ Yu-Lun Liu
Single LDR to HDR reconstruction remains challenging for over-exposed regions where traditional methods often fail due to complete information loss. We present a training-free approach that enhances existing indirect HDR reconstruction methods through diffusion-based inpainting. Our method combines text-guided diffusion models with SDEdit refinement to generate plausible content in over-exposed areas while maintaining consistency across multi-exposure LDR images. Unlike previous approaches requiring extensive training, our method seamlessly integrates with existing indirect HDR reconstruction techniques through an iterative compensation mechanism that ensures luminance coherence across multiple exposures. We demonstrate significant improvements in both perceptual quality and quantitative metrics on standard HDR datasets and in-the-wild captures. Results show that our method effectively recovers natural details in challenging scenarios while preserving the advantages of existing HDR reconstruction pipelines.
Optimizing against Infeasible Inclusions from Data for Semantic Segmentation through Morphology
Shamik Basu ⋅ Luc Van Gool ⋅ Christos Sakaridis
State-of-the-art semantic segmentation models are typically optimized in a data-driven fashion, minimizing solely per-pixel or per-segment classification objectives on their training data. This purely data-driven paradigm often leads to absurd segmentations, especially when the domain of input images is shifted from the one encountered during training. For instance, state-of-the-art models may assign the label "road" to a segment that is included by another segment that is respectively labeled as "sky". However, the ground truth of the existing dataset at hand dictates such inclusion is not feasible. Our method, Infeasible Semantic Inclusions (InSeIn), first extracts explicit inclusion constraints that govern spatial class relations from the semantic segmentation training set at hand in an offline, data-driven fashion, and then enforces a morphological yet differentiable loss that penalizes violations of these constraints during training to promote prediction feasibility. InSeIn is a light-weight plug-and-play method, constitutes a novel step towards minimizing infeasible semantic inclusions in the predictions of learned segmentation models, and yields consistent and significant performance improvements over diverse state-of-the-art networks across the ADE20K, Cityscapes, and ACDC datasets. Code and models will be made publicly available.
3D Cell Oversegmentation Correction via Geo-Wasserstein Divergence
Peter Chen ⋅ Bryan Chang ⋅ Olivia Creasey ⋅ Julie Sneddon ⋅ Zev Gartner ⋅ Yining Liu
3D cell segmentation methods are often hindered by \emph{oversegmentation}, where a single cell is incorrectly split into multiple fragments. This degrades the final segmentation quality and is notoriously difficult to resolve, as oversegmentation errors often resemble natural gaps between adjacent cells. Our work makes two key contributions. First, for 3D cell segmentation, we are the first work to formulate oversegmentation as a concrete problem and propose a geometric framework to identify and correct these errors. Our approach builds a pre-trained classifier using both 2D geometric and 3D topological features extracted from flawed 3D segmentation results. Second, we introduce a novel metric, Geo-Wasserstein divergence, to quantify changes in 2D geometries. This captures the evolving trends of cell mask shape in a geometry-aware manner. We validate our method through extensive experiments on in-domain plant datasets, including both synthesized and real oversegmented cases, as well as on out-of-domain animal datasets to demonstrate transfer learning performance. An ablation study further highlights the contribution of the Geo-Wasserstein divergence. A clear pipeline is provided for end-users to build pre-trained models to any labeled dataset.
TopoRec: Point Cloud Recognition Using Topological Data Analysis
Anirban Ghosh ⋅ Iliya Kulbaka ⋅ Ian Dahlin ⋅ Ayan Dutta
Point cloud-based object/place recognition remains a problem of interest in applications such as autonomous driving, scene reconstruction, and localization. Extracting a meaningful global descriptor from a query point cloud that can be matched with the descriptors of the database point clouds is a challenging problem. Furthermore, when the query point cloud is noisy or has been transformed (e.g., rotated), it adds to the complexity. To this end, we propose a novel methodology, named TopoRec, which utilizes Topological Data Analysis (TDA) for extracting local descriptors from a point cloud, thereby eliminating the need for resource-intensive GPU-based machine learning training. More specifically, we used the ATOL vectorization method to generate vectors for point clouds. To test the quality of the proposed TopoRec technique, we have implemented it on multiple real-world (e.g., Oxford RobotCar, NCLT) and realistic (e.g., ShapeNet) point cloud datasets for large-scale place and object recognition, respectively. Unlike existing learning-based approaches such as PointNetVLAD and PCAN, our method does not require extensive training, making it easily adaptable to new environments. Despite this, it consistently outperforms both state-of-the-art learning-based and handcrafted baselines (e.g., M2DP, ScanContext) on standard benchmark datasets, demonstrating superior accuracy and strong generalization.
MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping
Vineet Bhat ⋅ Naman Patel ⋅ Prashanth Krishnamurthy ⋅ Ramesh Karri ⋅ Farshad Khorrami
Robotic manipulation of unseen objects via natural language commands remains challenging. Language driven robotic grasping (LDRG) predicts stable grasp poses from natural language queries and RGB-D images. We propose MapleGrasp, a novel framework that leverages mask-guided feature pooling for efficient vision-language driven grasping. Our two-stage training first predicts segmentation masks from CLIP-based vision-language features. The second stage pools features within these masks to generate pixel-level grasp predictions, improving efficiency, and reducing computation. Incorporating mask pooling results in a 7% improvement over prior approaches on the OCID-VLG benchmark. Furthermore, we introduce RefGraspNet, an open-source dataset eight times larger than existing alternatives, significantly enhancing model generalization for open-vocabulary grasping. MapleGrasp scores a strong grasping accuracy of 89\% when compared with competing methods in the RefGraspNet benchmark. Our method achieves comparable performance to larger Vision-Language-Action models on the LIBERO benchmark, and shows significantly better generalization to unseen tasks. Real-world experiments on a Franka arm demonstrate 73% success rate with unseen objects, surpassing competitive baselines by 11%. Code will be released after publication.
Remote Sensing Forestry Similarity Convolution
Shikuan Wang ⋅ Yuangong Chen ⋅ Jianzhou Gong ⋅ Lingyi Meng ⋅ Mengquan Wu ⋅ Longxing Liu ⋅ Haiwei Yuan ⋅ Guo Mingbin
Recent advancements in convolutional neural networks (CNNs) have significantly propelled the field of remote sensing forestry mapping. However, traditional convolution operations exhibit inherent limitations in extracting complex forest features: their fixed receptive fields struggle to accommodate multi-scale forest attributes, and their insufficient focus on background information impairs the overall feature representation. To address these challenges, we propose Similar Convolution (SimConv), which introduces dynamic convolution kernel size selection by modeling feature relationships. SimConv adaptively adjusts the receptive field based on the semantic relevance of input features, enhancing the capture of forestry background information and improving the distinction between target features. Building upon this, we introduce SIMNet, a feature extraction network that integrates SimConv at its core. Experimental results across multiple remote sensing datasets demonstrate that SIMNet outperforms existing methods in terms of feature extraction accuracy.
RampWatch: An In-the-Wild Dataset and Text-Guided Detection Framework for Recreational Vessels
Malik Muhammad Asim ⋅ Claire Smallwood ⋅ Abdullah Tariq ⋅ Johnny Lo ⋅ Syed Zulqarnain Gilani
Detecting small, recreational vessels in coastal environments remains a persistent challenge due to complex backgrounds, dynamic lighting conditions, and the scarcity of annotated data for non-commercial maritime traffic. Despite their socio-economic significance, recreational boats are underrepresented in existing datasets and are poorly detected by standard object detectors, particularly in open-vocabulary scenarios. To address this gap, we present RampWatch, an in-the-wild dataset curated from surveillance footage at multiple boat ramps. RampWatch provides instance annotations across seven categories of recreational vessels, captured under diverse weather, lighting, and occlusion conditions. To benchmark detection in this domain, we introduce YOLO-TG, a novel detection framework that augments YOLOv11 with a text encoder for open-vocabulary recognition and a self-attention module for enhanced spatial reasoning. YOLO-TG adopts a dual-stream design: visual features are extracted via a hierarchical YOLO backbone, while semantic embeddings from natural language prompts are encoded by a frozen language encoder. These modalities are fused via lightweight cross-modal attention, enabling text-guided detection without retraining. YOLO-TG achieves a 12% relative improvement in mAP@50–95 over strong YOLOv11 baselines on RampWatch, and demonstrates robust cross-domain generalization, with gains of +22% on the Singapore Maritime Dataset and +4.3% on the Split Port Ship Classification Dataset. These results highlight the effectiveness of cross-modal grounding and domain-specific datasets for advancing open-world maritime surveillance.
Enhancing Reverse Distillation with Core Exemplar Learning for Unified Multi-Class Anomaly Detection
Heechul Lim ⋅ Min-Soo Kim ⋅ Hyun-Boo Lee ⋅ Suk-Ju Kang ⋅ Kang-Wook Chon ⋅ Haeyun Lee
In electronics manufacturing, anomaly detection methods face significant challenges due to class distribution imbalance and training instability when handling multiple classes simultaneously under varying imaging conditions. To address these challenges, we propose Reverse Distillation with Core-Exemplar Learning (RDCEL), a unified anomaly detection framework incorporating domain adaptation and novel metric learning strategies. RDCEL integrates unsupervised domain adaptation to align the covariance of the source and target domains and uses soft label-based coreset learning to handle diverse class distributions. It also leverages a coreset repulsion loss to minimize redundancy among coreset representations, fostering a more stable and dispersed embedding space across multiple classes. By aligning spatial statistics across different classes, RDCEL effectively addresses inter-class discrepancies, enabling consistent anomaly scoring under a unified test setting. Extensive experiments show RDCEL significantly outperforms state-of-the-art methods on MVTec AD and VisA datasets, achieving superior accuracy, stable AUROC performance, and faster convergence.
Leveraging Sparsity for Privacy in Collaborative Inference
Maximilian Hoefler ⋅ Karsten Mueller ⋅ Wojciech Samek
Collaborative inference (CI) is hampered by high communication costs and privacy risks, with existing defenses often forcing a trade-off between efficiency and formal privacy guarantees. In this work, we present a framework that leverages activation sparsity as a dual-purpose mechanism to address both challenges simultaneously. Our approach uses a lightweight Sparse Autoencoder (SAE) to learn a sparse representation, which is then protected by a novel two-channel noise mechanism grounded in information theory. This design provides a tunable privacy budget while remaining computationally inexpensive. Evaluations on CIFAR-10, Tiny-ImageNet, and FaceScrub show that our method achieves a state-of-the-art privacy-utility trade-off, sustaining high accuracy at sparsity levels of up to 97\%, while offering superior resilience against strong model inversion attacks. Our results underline that sparsity can be transformed from an effective compression tool into a powerful and theoretically-grounded privacy defense, paving the way for more practical and trustworthy CI systems. We provide code at https://github.com/an7123/privacy_ci.
Improvise, Adapt, Overcome — Telescopic Adapters for Efficient fine-tuning of Vision Language Models in Medical Imaging
Ujjwal Mishra ⋅ VINITA SHUKLA ⋅ Praful Hambarde ⋅ Amit Shukla
Adapting Vision Language Segmentation Models (VLSMs) to medical imaging domains requires significant computational overhead when using conventional fine-tuning approaches. Existing Parameter-Efficient Fine-Tuning (PEFT) methods apply uniform adapter dimensions across all transformer layers, leading to suboptimal parameter allocation and reduced adaptation efficiency. We introduce Telescopic Adapters, a novel PEFT framework that employs depth-aware scaling to progressively increase adapter capacity from shallow to deep transformer layers. Our method integrates lightweight bottleneck modules within CLIPSeg's vision and text encoders, with adapter dimensions dynamically scaled based on layer depth and semantic relevance. Using only 613k trainable parameters—244× fewer than end-to-end fine-tuning, Telescopic Adapters achieve superior performance across five diverse medical datasets spanning polyp segmentation, skin lesion detection, and breast ultrasound imaging. Comprehensive ablation studies demonstrate that deeper layers require substantially more adaptation capacity than shallow layers, validating our telescopic scaling hypothesis. Our approach establishes a new paradigm for efficient medical VLSM fine-tuning, enabling deployment in resource-constrained clinical environments while maintaining competitive segmentation accuracy.
SVD-Det: A Lightweight Framework for Video Forgery Detection Using Semantic and Visual Defect Cues
Tsung-Shan Yang ⋅ Tianyu Zhang ⋅ Feng Qian ⋅ Bing Yan ⋅ Chung Chieh Kuo
With the rapid proliferation of AI-generated content (AIGC) on multimedia platforms, efficient and reliable video forgery detection has become increasingly important. Existing approaches often rely on either visual artifacts or semantic inconsistencies, but suffer from high computational costs, limiting their deployment at scale. In this work, we propose SVD-Det, a lightweight and efficient pipeline that leverages both semantic and Visual Defect cues to detect forged videos. SVD-Det fuses spatiotemporal representations from raw RGB frames and compression-induced distortions using a 3D-Swin Transformer, and augments semantic understanding via CLIP-based embeddings. To integrate these heterogeneous modalities, we introduce Domain-Query Attention (DoQA), a novel attention mechanism that hierarchically aggregates spatial and temporal features. Experiments across seven video generation domains demonstrate that SVD-Det not only achieves state-of-the-art detection performance but also reduces model size and inference time by over 97% and 98%, respectively, compared to LMM-based baselines. Our results highlight the practicality and robustness of SVD-Det for scalable AIGC detection in real-world scenarios.
Joint Optimization of Camera Model and Deep Neural Network for Image Recognition
Youta Noboru ⋅ Yuko Ozasa ⋅ Masayuki Tanaka
In this paper, we propose joint optimization of a camera model and a deep neural network (DNN) for image classification and object detection tasks. The camera model consists of an image sensor model which is parameterized by the camera spectral sensitivity (CSS) and an image signal processing (ISP) model.We assume the camera model is composed of a three-sensor imager without demosaicing process and an ISP with simple color correction and gamma correction. The DNNs follow to classify or detect objects. A key contribution of this paper is the joint optimization of not only the ISP model and DNN but also the image sensor model. For stable joint optimization, we have implemented a fully differentiable camera model. Therefore, we can jointly optimize the camera model and the DNN.Experimental comparisons with the flower and leaf datasets show that our approach outperforms existing approaches.Furthermore, we demonstrate that our approach is also effective for the object detection task. The source code will be made publicly available upon publication of this paper.
Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models
Masayuki Kawarada ⋅ Kosuke Yamada ⋅ Antonio Tejero-de-Pablos ⋅ Naoto Inoue
Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (color, genre), which has been challenging. Although recent vision foundation models, such as CLIP, offer a rich representation of the overall image, they are not amenable to focusing on the specified condition. In this paper, we propose a method to leverage a large vision-language model (LVLM) to generate conditional image embeddings: the DIOR framework. DIOR is a novel training-free approach that prompts the LVLM to describe an image with a single word related to the given condition. The hidden state vector of the LVLM's last token is directly extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or specialized prior knowledge. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods requiring additional training across multiple settings.
ReFineVQA: Iterative Refinement of Video Description via Feedback Generation for Video Question Answering
Jeongwan Shin ⋅ Chan Hur ⋅ Seongmin Cho ⋅ Jae-Ho Choi ⋅ Hyeyoung Park
Video question answering is a non-trivial task that demands joint understanding of visual contents and linguistic questions as well as temporal reasoning across video frames. Recent agent-based approaches address this by conducting multi-step reasoning with large language models (LLMs) across frame-level captions generated by vision-language models, but encounter limited temporal coherence across frames. A possible direction based on video language models (VideoLMs) directly captures temporal dynamics via video-level descriptions, but often lacks fine-grained visual cues due to a restricted number of input frames and a large dependency on input prompts. To tackle these challenges, we propose RefineVQA, a training-free framework that can easily be plugged into existing VideoLMs with iterative, LLM-guided description refinements. Specifically, the VideoLM produces an initial description, followed by LLM feedback determining whether the description suffices for the question and guiding further visual extraction, which in turn enhances the description quality while preserving temporal context. Plugged into state-of-the-art VideoLMs, ReFineVQA yields consistent gains across diverse benchmarks--NExT-QA, EgoSchema, Video-MME, ActivityNet, and StreamingBench--even with a small external LLM of 3.8B parameters.
MIST: Multilingual Incidental Dataset for Scene Text Detection
Saumya Vijay Mundra ⋅ Ajoy Mondal ⋅ Jawahar CV
Scene text detection has progressed rapidly, largely driven by curated datasets and benchmarks. However, many of these have reached evaluation saturation and are heavily biased toward focused scenes, limiting their effectiveness in real-world environments where detection is hindered by environmental factors. To address this, we introduce MIST - a Multilingual Incidental Scene Text dataset featuring diverse text instances across 11 languages. MIST provides language, transcription, legibility, and fine-grained polygon-shaped annotations across 12K scene images and 600K word-level text instances. Images are captured along roads using a GoPro mounted on a moving car to capture real-world complexities, ensuring the scenes are incidental rather than deliberately framed. MIST establishes a new challenging benchmark to enable robust evaluation of scene text detection methods in real-world scenarios. The datasets and code will be made publicly available.
Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios
Mingwei Tang ⋅ Jiahao Nie ⋅ Guang Yang ⋅ Ziqing Cui ⋅ Jie Li
Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision–language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.
MemeTAG: Keyword-Driven Meme Classification through Tag Embedding Reconstruction
Akshit Sharma ⋅ Prashant Patil
The proliferation of harmful internet memes poses a significant societal threat, yet their automated classification remains a formidable algorithmic challenge due to the nuanced, multimodal nature of their content. To address this, we introduce MemeTAG, a novel dual objective framework that pioneers a keyword-aware approach to meme classification. Our core innovation is a two-part semantic guid-ance mechanism: first, we leverage a pretrained Vision-Language Model to generate a set of descriptive keywords, that capture the high-level semantics. Second, we introduce the Aggregated Tag Inference Network (ATIN), an attention-based module that distills these keywords into a single, rich semantic embedding. This embedding servesas a target for a novel auxiliary reconstruction loss, which compels the model to learn deeply aligned visual and textual features. This approach, combined with an efficient three-stage training strategy, establishes a new state-of-the-art on the HarMeme, Hateful Memes Challenge (HMC) and PrideMM datasets, decisively outperforming existing state-of-the-art methods.
Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models
Héctor Laria ⋅ Alexandra Gomez-Villa ⋅ Jiang Qin ⋅ Muhammad Atif Butt ⋅ Bogdan Raducanu ⋅ Javier Vazquez-Corral ⋅ Joost van de Weijer ⋅ Kai Wang
Recent advances in text-to-image (T2I) diffusion models have enabled remarkable control over various attributes, yet precise color specification remains a fundamental challenge. Existing approaches, such as ColorPeel, rely on model personalization, requiring additional optimization and limiting flexibility in specifying arbitrary colors. In this work, we introduce ColorWave, a novel training-free approach that achieves exact RGB-level color control in diffusion models without fine-tuning. By systematically analyzing the cross-attention mechanisms within IP-Adapter, we uncover an implicit binding between textual color descriptors and reference image features. Leveraging this insight, our method rewires these bindings to enforce precise color attribution while preserving the generative capabilities of pretrained models. Our approach maintains generation quality and diversity, outperforming prior methods in accuracy and applicability across diverse object categories. Through extensive evaluations, we demonstrate that ColorWave establishes a new paradigm for structured, color-consistent diffusion-based image synthesis.
MedPEFT-CL: Dual-Phase Parameter-Efficient Continual Learning with Medical Semantic Adapter and Bidirectional Memory Consolidation
ZIYUAN GAO ⋅ Philippe Morel
Medical vision-language segmentation models suffer from catastrophic forgetting when adapting to new anatomical structures, requiring complete retraining that limits their clinical deployment. Although continual learning approaches have been studied for various applications, targeted research on continual learning approaches specifically designed for medical vision-language tasks remains underexplored.We propose MedPEFT-CL, a parameter-efficient continual learning framework that addresses both efficient learning of new tasks and preservation of previous knowledge through a dual-phase architecture based on CLIPSeg.Our dual-phase architecture features an adaptive learning phase that employs semantic similarity-based adapter allocation and parameter-efficient fine-tuning for medical tasks through prompt similarity analysis, and a knowledge consolidation phase employing bi-directional Fisher-memory coordination. This creates a reinforcing cycle: consolidation directs replay priorities while new tasks provide challenging samples that improve retention strategies.Our key contributions are: (1) a semantic-driven adapter allocation mechanism that enables efficient learning of new medical tasks, (2) a bi-modal LoRA adaptation that significantly reduces trainable parameters while maintaining cross-modal learning, and (3) bidirectional Fisher-memory coordination that prevents catastrophic forgetting from previous medical tasks.Extensive experiments across diverse medical datasets demonstrate superior forgetting mitigation and performance retention with minimal parameter overhead, making the framework effective for continual learning in medical vision-language scenarios.
Splatter Layout: Geometry-embedded 3D Reconstruction via Surface Unfolding
Bryan Heryanto ⋅ Tackgeun You ⋅ Chanwoo Kim ⋅ Hwasup Lim
We propose Splatter Layout, a single-view feedforward 3D reconstruction method that jointly predicts Gaussian Splats, mesh, and point cloud in unfolded surface layout aligned with the input image. Unlike prior approaches, which suffer from inconsistent surface representations, our unfolded layout extends reconstruction to invisible regions by predicting parameters from visible neighbors and placing them near adjacent counterparts. To achieve this, we supervise the pipeline using data generated from an unfolding network, ensuring bijective mappings and input-view alignment. Since the pipeline is layout-agnostic, it can be readily extended to diverse object categories, including humans and vehicles. Without modifying the underlying architecture, Splatter Layout substantially improves the geometric fidelity of Splatter Images. Its unfolded layout yields a coherent and interpretable feature organization, resolving prior inconsistencies while establishing dense correspondences with a template mesh for tasks such as animation and appearance editing.
HOLO: Holistic Lightweight Optimization for Scene Understanding with Auto-Annotation and Multimodal Learning
Xiaoyun Hu ⋅ Xiaohan Yan ⋅ Nan Wang ⋅ Gang Wei ⋅ Zhicheng Wang
Vision-language models (VLMs) have achieved remarkable success across various domains. However, their application to 3D scene understanding remains largely underexplored. Existing 3D VLMs predominantly focus on object-level tasks and often emphasize instance-centric representations within a scene, lacking holistic scene-level descriptions. In this work, we propose an automated annotation framework that leverages multi-view images to partition 3D scenes into localized point cloud sub-regions, which are then enriched with precise semantic information-all without any manual intervention. We processed ScanNet v2 and ScanNet++ to construct SceneCap, a large-scale dataset designed for scene-level description. To demonstrate the benefits of our framework for scene understanding , we introduce Natural Interactive Universal Multimodal Observer, NIUMO-LLM, a lightweight yet high-performing model that adapts vision-language capabilities for comprehensive 3D scene understanding through training on SceneCap. We further demonstrate that NIUMO-LLM achieves state-of-the-art (SOTA) performance on both scene description benchmarks and object-level tasks, requiring only 12 hours of training on a single NVIDIA A800 GPU. This design significantly reduces computational demands, lowering the barrier for MLLM-related research. For review purposes, we anonymously share representative samples at https://randomname432.github.io/HOLO/.
PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval
Osman Tursun ⋅ Sinan Kalkan ⋅ Simon Denman ⋅ Clinton Fookes
Zero-shot Composed Image Retrieval (ZS-CIR) enables image search using a reference image andtext prompt without requiring specialized text-image composition networks trained on large-scale paireddata. However, current ZS-CIR approaches face three critical limitations in their reliance on composedtext embeddings: static query embedding representations, insufficient utilization of image embeddings,and suboptimal performance when fusing text and image embeddings. To address these challenges, weintroduce the Prompt Directional Vector (PDV), a simple yet effective training-free enhancement thatcaptures semantic modifications induced by user prompts. PDV enables three key improvements: (1)dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor,(2) composed image embeddings through semantic transfer from text prompts to image features, and(3) weighted fusion of composed text and image embeddings that enhances retrieval by balancingvisual and semantic similarity. Our approach serves as a plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead. Extensive experiments across multiple benchmarksdemonstrate that PDV consistently improves retrieval performance when integrated with state-of-the-artZS-CIR approaches, particularly for methods that generate accurate compositional embeddings. The codewill be released upon publication.
GRAPE (Gaussian Rendering for Accelerated Pixel Enhancement) Brings Fast and Lightweight Arbitrary Super-Resolution
Jung In Jang ⋅ Kyong Hwan Jin
We present GRAPE-Gaussian Rendering for Accelerated Pixel Enhancement, a fast, lightweight method for arbitrary‑scale super‑resolution (ASSR) based on 2D Gaussian splatting. Lookup‑table (LUT) schemes are limited to preset scale factors and struggle with varied textures, while implicit neural representations (INRs) slow down because they require per‑coordinate queries; moreover, prior Gaussian‑splatting approaches rely on heavy networks or complex processing. GRAPE overcomes these limitations with a compact design in which a single point‑wise layer predicts anisotropic Gaussian parameters—RGB value, rotation, scale, and offset—and a differentiable rasterizer then renders the high‑resolution image in one pass. The entire model, including both encoder and decoder, contains just 1.56 M parameters and requires only 1.10 GB of GPU memory, yet achieves 68.55 FPS on Urban100 at x4 whose average image size is 984.51x797.81. This is more than 310x faster than GSASR, a 20.45 M‑parameter model that runs at 0.22 FPS. Although GRAPE does not further improve perceptual fidelity over heavier networks, it remains competitively close, providing an attractive quality–efficiency trade‑off across DIV2K, Set5, Set14, BSD100, and Urban100. Consequently, GRAPE is ideal for resource‑limited deployments or interactive applications that require rapid screen updates. The source code will be made publicly available at \href{https://github.com/username/GRAPE}.
Fetal and Neonatal Cortical Surface Reconstruction with Anatomical Normal-guidance and Perceptual Enhancements
Jiyang Lee ⋅ Woori Bae ⋅ U-Geun Ji ⋅ Hanyeol Yang ⋅ Jong-Min Lee
Accurate reconstruction of cortical surfaces from fetal and neonatal brain MRI plays a fundamental role in neuroscience research and clinical applications. Despite recent advances in deep learning-based approaches, accurate fetal and neonatal surface reconstruction remains challenging due to low tissue contrast, narrow sulci, and complex folding patterns. We present ANPE (Anatomical Normal-guidance with Perceptual Enhancements), a simple yet effective framework that enhances cortical surface reconstruction networks through three key aspects: Our method enforces anatomically plausible deformations by adaptively integrating normal vectors with velocity vectors to simultaneously capture global structure and fine-grained details, significantly improving geometric coherence. Furthermore, we amplify structural transitions and tissue boundaries without requiring explicit segmentation or signed distance functions to eliminate the dependency on additional processing steps, making our method more efficient and widely applicable. Finally, we utilize a context-aware loss function that transcends traditional point-wise losses by integrating a pre-trained feature extractor to captures hierarchical contextual and structural similarities, blending surface-based and image-based features to ensure anatomically meaningful reconstructions. Our simple yet efficient framework is shown to be an efficient, more accurate, and biologically informed approach, presents a new baseline to the fetal and neonatal cortical surface reconstruction.
View-aware Cross-modal Distillation for Multi-view Action Recognition
Trung Thanh Nguyen ⋅ Yasutomo Kawanishi ⋅ Vijay John ⋅ Takahiro Komamizu ⋅ Ichiro Ide
The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen–Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.
NRGMark: Localized Watermarking for Energy Transparency in Images
Shruti Agarwal ⋅ Élie Michel ⋅ Vishal Asnani ⋅ Tania Mathern ⋅ John Collomosse
We present NRGMark, a region-based image watermarking framework to embed energy use and provenance metadata into images. Targeting composite graphic designs such as posters, NRGMark enables imperceptible watermarking of distinct visual elements each carrying independent metadata on their environmental impact, such as the energy consumption associated with generative AI (GenAI) use. NRGMark extends image watermark encoder-decoder models by incorporating an object localization network to detect and decode multiple watermarked regions within a document, even under image transformations and physical print–scan degradation. NRGMark interoperates with several watermarking techniques and the emerging C2PA open standard for media provenance to encode environmental impact metadata. We demonstrate NRGMark on both synthetic and real-world design layouts, illustrating its potential to support energy transparency in the age of GenAI.
Test-Time Consistency in Vision Language Models
Shih-Han Chou ⋅ Shivam Chandhok ⋅ James Little ⋅ Leonid Sigal
Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks, yet they often exhibit inconsistent behavior when faced with semantically equivalent inputs—undermining their reliability and robustness. Recent benchmarks, such as MM-R$^3$, highlight that even state-of-the-art VLMs can produce divergent response across semantically equivalent inputs, despite maintaining high average accuracy. Prior work addresses this issue by modifying model architectures or conducting large-scale fine-tuning on curated datasets. In contrast, we propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training.Our method is entirely post-hoc, model-agnostic, and applicable to any VLM with access to its weights. Given a single test point, we enforce consistent predictions via two complementary objectives: (i) a Cross-Entropy Agreement Loss that aligns predictive distributions across semantically equivalent inputs, and (ii) a Pseudo-Label Consistency Loss that draws outputs toward a self-averaged consensus. Our method is *plug-and-play*, and leverages information from a single test-input itself to improve consistency. Experiments on the MM-R$^3$ benchmark show that our framework yields substantial gains in consistency across state-of-the-art models, establishing a new direction for inference-time adaptation in VLMs.
Revisiting an Old Perspective Projection for Monocular 3D Morphable Models Regression
Toby Chong ⋅ Ryota Nakajima
Fitting 3D morphable models to video is a key technique in content creation.In particular, regression-based approaches have produced fast and accurate results by matching the rendered output of the morphable model to the target image. These methods typically achieve stable performance by using orthographic projection, which removes ambiguity related to focal length and object distance. However, this simplification makes them unsuitable for close-up footage, such as that captured with head-mounted cameras.To address this limitation, we introduce a novel shrinkage parameter to the orthographic projection, enabling the incorporation of a pseudo-perspective effect while preserving the stability of the original projection. We present several techniques that allow this parameter to be integrated into existing orthographic methods with minimal changes through fine-tuning. We demonstrate the effectiveness of our modification through both quantitative and qualitative comparisons using a custom dataset recorded with head-mounted cameras.
General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood
Roy Betser ⋅ Omer Hofman ⋅ Roman Vainshtein ⋅ Guy Gilboa
The rapid advancement of generative models, particularly diffusion-based methods, has significantly improved the realism of synthetic images. As new generative models continuously emerge, detecting generated images remains a critical challenge. While fully supervised, and few-shot methods have been proposed, maintaining an updated dataset is time-consuming and challenging. Consequently, zero-shot methods have gained increasing attention in recent years. We find that existing zero-shot methods often struggle to adapt to specific image domains, such as artistic images, limiting their real-world applicability. In this work, we introduce CLIDE, a novel zero-shot detection method based on conditional likelihood approximation. Our approach computes likelihoods conditioned on real images, enabling adaptation across diverse image domains. We extensively evaluate CLIDE, demonstrating state-of-the-art performance on a large-scale general dataset and significantly outperform existing methods in domain-specific cases. These results demonstrate the robustness of our method and underscore the need of broad, domain-aware generalization for the AI-generated image detection task.
Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation
Amirhossein Dadashzadeh ⋅ Parsa Esmati ⋅ Majid Mirmehdi
Recent advances in Source-Free Unsupervised Video Domain Adaptation (SFUVDA) leverage vision-language models to enhance pseudo-label generation. However, challenges such as noisy pseudo-labels and over-confident predictions limit their effectiveness in adapting well across domains. We propose Co-STAR, a novel framework that integrates curriculum learning with collaborative self-training between a source-trained teacher and a contrastive vision-language model (CLIP). Our curriculum learning approach employs a reliability-based weight function that measures bidirectional prediction alignment between the teacher and CLIP, balancing between confident and uncertain predictions. This function preserves uncertainty for difficult samples, while prioritizing reliable pseudo-labels when the predictions from both models closely align. To further improve adaptation, we propose Adaptive Curriculum Regularization, which modifies the learning priority of samples in a probabilistic, adaptive manner based on their confidence scores and prediction stability, mitigating overfitting to noisy and over-confident samples. Extensive experiments across multiple video domain adaptation benchmarks demonstrate that Co-STAR consistently outperforms state-of-the-art SFUVDA methods. Code will be available at https://anonymised.
Being Positive about Negative Queries: Exclusion Aware Multimodal Retrieval using Disentangled Representations
Prachi Jha ⋅ Sumit Bhatia ⋅ Srikanta Bedathur
The handling of exclusion in multimodal retrieval remains an underexplored challenge with significant implications for the accuracy and reliability of information retrieval systems. Although existing approaches have advanced multimodal understanding, they typically lack mechanisms to process exclusion explicitly. To address this, we propose a novel model ExclMM that leverages disentangled representations to effectively handle exclusion in multimodal retrieval. Our approach enables precise differentiation between the presence and absence of specific elements in an image, outperforming existing methods. To rigorously evaluate our model, we construct a dataset, ExcluCOCO that pairs exclusion-based queries with ground truth images, sourced from MSCOCO. This dataset serves as a robust benchmark for assessing exclusion comprehension in multimodal contexts. By explicitly incorporating exclusion, our work advances multimodal retrieval by introducing both a model tailored for exclusion-aware retrieval and a benchmark to facilitate future research in this domain.
Dynamic Neural Networks (DNNs) have emerged as a promising solution to improve the computational efficiency of deep neural networks by adaptively adjusting inference complexity based on input characteristics. Despite their advantages, the deployment of dynamic networks in real-world applications remains challenging because most methods are hard to adapt for practical use cases such as object detection, in combination with the lacking support of inference infrastructure.In this work, we present a dynamic neural network architecture specifically designed for object detection. Using our method, we build a variety of Pareto-optimal models for object detection on COCO for models in the 7-10 GFLOPs range.Additionally, to measure the routing efficacy, we introduce an evaluation metric that facilitates standardized benchmarking across different dynamic network approaches. Finally, we introduce an evaluation of a deployment pipeline utilizing the ONNX format, thus building a DNN that shows speedup in a realistic deployment scenario. Experimental results demonstrate the performance and practical viability of our approach for efficient object detection in resource-constrained scenarios.
Semantic Map Guided Bird's-Eye View Learning for Online HD Map Construction
Huantao Ren ⋅ Hesham Eraqi ⋅ ABM Musa ⋅ Mohamed Moustafa
Vectorized High-Definition (HD) maps offer rich and precise environmental information about driving scenes, playing a crucial role in improving driver safety by supporting autonomous driving and advanced driver-assistance systems (ADAS). Processing individual camera images creates fragmented view of the world requiring complex and error-prone merging. Existing multi-view camera methods train deep neural networks to directly generate a unified bird’s-eye view (BEV) features used to learn HD map construction. Nevertheless, a significant limitation is the lack of direct supervision of the learned BEV features based on the ground-truth map elements. To overcome this limitation, we propose a novel method, referred to as Semantic Map Guidance (SMG), for explicit alignment of the learned BEV features and the corresponding semantic representations by utilizing ground-truth label during training. We demonstrate the effectiveness of the proposed SMG method by incorporating it into multiple state-of-the-art BEV-based methods for online HD map construction task. We perform extensive experiments on two widely used HD map datasets, nuScenes and Argoverse 2, demonstrating that SMG, without any bells and whistles, consistently improves the accuracy of all the tested networks by using the same base network implementation and hyperparameters without any additional inference time.
ISALux: Illumination and Semantics-Aware Transformer Employing Mixture of Experts for Low Light Image Enhancement
Raul Balmez ⋅ Alexandru Brateanu ⋅ Ciprian Orhei ⋅ Codruta Ancuti ⋅ Cosmin Ancuti
In this paper we introduce ISALux, a novel transformer-based approach for Low-Light Image Enhancement (LLIE) that seamlessly integrates illumination and semantic priors. Our architecture includes an original self-attention block, Hybrid Illumination and Semantics-Aware Multi-Headed Self-Attention (HISA-MSA), which integrates illumination and semantic segmentation maps for enhanced feature extraction. ISALux employs two self-attention modules to independently process illumination and semantic features, selectively enriching each other to regulate luminance and highlight structural variations in real-world scenarios. A Mixture of Experts (MoE)-based Feed-Forward Network (FFN) enhances contextual learning, with a gating mechanism conditionally activating the top K experts for specialized processing. To address overfitting in LLIE methods caused by distinct light patterns in benchmarking datasets, we enhance the HISA-MSA module with low-rank matrix adaptations (LoRA). Extensive qualitative and quantitative evaluations across multiple specialized datasets demonstrate that ISALux is competitive with state-of-the-art (SOTA) methods. Additionally, an ablation study highlights the contribution of each component in the proposed model.
FastPose-ViT: A Vision Transformer for Real-Time Spacecraft Pose Estimation
Pierre Ancey ⋅ Andrew Price ⋅ Saqib Javed ⋅ Mathieu Salzmann
Estimating the 6-degrees-of-freedom (6DoF) pose of a spacecraft from a single image is critical for autonomous operations like in-orbit servicing and space debris removal. Existing state-of-the-art methods often rely on iterative Perspective-n-Point (PnP)-based algorithms, which are computationally intensive and ill-suited for real-time deployment on resource-constrained edge devices. To overcome these limitations, we propose FastPose-ViT, a Vision Transformer (ViT)-based architecture that directly regresses the 6DoF pose. Our approach processes cropped images from object bounding boxes and introduces a novel mathematical formalism to map these localized predictions back to the full-image scale. This formalism is derived from the principles of projective geometry and the concept of ``apparent rotation," where the model predicts an apparent rotation matrix that is then corrected to find the true orientation. We demonstrate that our method outperforms other non-PnP strategies and achieves performance competitive with state-of-the-art PnP-based techniques on the SPEED dataset. Furthermore, we validate our model’s suitability for real-world space missions by quantizing it and deploying it on power-constrained edge hardware. On the NVIDIA Jetson Orin Nano, our end-to-end pipeline achieves a latency of ~75 ms per frame under sequential execution, and a non-blocking throughput of up to 33 FPS when stages are scheduled concurrently. To foster reproducibility, our codebase, model weights, and training and evaluation scripts will be released under an MIT license upon acceptance.
QuEENet: Quantum-Enhanced Expressive Network for Image Classification
Shashank Bayal ⋅ Dawane Govind ⋅ Komal Komal ⋅ SANTOSH VIPPARTHI ⋅ Subrahmanyam Murala
This paper presents QuEENet, a hybrid quantum-classical architecture for image classification that incorporates parameterized quantum circuits within a convolutional neural network. This study investigates how quantum circuit expressivity and entanglement strategies influence classification performance, with a focus on configurations involving a CNOT gate followed by a rotational gate on the target qubit. Non-Clifford gates, such as $R_x/R_y/R_z$ supports larger state-space coverage and expressivity in quantum models. The proposed QuEENet explored the aspect of non-Clifford gates in parameterized quantum circuits. While non-Clifford gates are theoretically critical for universal quantum computation, but their role in image classification task is unexplored. Experimental results across multiple benchmark datasets suggest that while increased expressivity via non-Clifford gates can be beneficial, it should be carefully balanced with circuit interpretability and trainability. QuEENet demonstrates that hybrid models can leverage quantum circuits not merely as architectural novelties, but as controllable modules for enhancing learning in classical pipelines. An extensive ablation study was conducted across multiple datasets to highlight the effects of Clifford and non-Clifford gate combinations and entanglement configurations.
SOLAR: Switchable Output Layer for Accuracy and Robustness in Once-for-All Training
Shaharyar Ahmed Khan Tareen ⋅ Lei Fan ⋅ Xiaojing Yuan ⋅ Qin Lin ⋅ Bin Hu
Once-for-All (OFA) training enables a single super-net to generate multiple sub-nets tailored to diverse deployment scenarios, supporting flexible trade-offs among accuracy, robustness, and model-size without retraining. However, as the number of supported sub-nets increases, excessive parameter sharing in the backbone limits representational capacity, leading to degraded calibration and reduced overall performance. To address this, we propose SOLAR (Switchable Output Layer for Accuracy and Robustness in Once-for-All Training), a simple yet effective technique that assigns each sub-net a separate classification head. By decoupling the logit learning process across sub-nets, the Switchable Output Layer (SOL) reduces representational interference and improves optimization, without altering the shared backbone. We evaluate SOLAR on five datasets (SVHN, CIFAR-10, STL-10, CIFAR-100, and TinyImageNet) using four super-net backbones (ResNet-34, WideResNet-16-8, WideResNet-40-2, and MobileNetV2) for two OFA training frameworks (OATS and SNNs). Experiments show that SOLAR outperforms the baseline methods: compared to OATS, it improves accuracy of sub-nets up to 1.26 %, 4.71 %, 1.67 %, and 1.76 %, and robustness up to 9.01 %, 7.71 %, 2.72 %, and 1.26 % on SVHN, CIFAR-10, STL-10, and CIFAR-100, respectively. Compared to SNNs, it improves TinyImageNet accuracy by up to 2.93 %, 2.34 %, and 1.35 % using ResNet-34, WideResNet-16-8, and MobileNetV2 backbones (with 8 sub-nets), respectively.
FG-TRACER: Tracing Information Flow in Multimodal Large Language Models in Free-Form Generation
Alessia Saporita ⋅ Vittorio Pipoli ⋅ Federico Bolelli ⋅ Lorenzo Baraldi ⋅ Andrea Acquaviva ⋅ ELISA FICARRA
Multimodal Large Language Models (MLLMs) have achieved impressive performance across a variety of vision–language tasks. However, their internal working mechanisms remain largely underexplored. In this work, we introduce FG-TRACER, a framework designed to analyze the information flow between visual and textual modalities in MLLMs in free-form generation. Notably, our numerically stabilized computational method enables the first systematic analysis of multimodal information flow in underexplored domains such as image captioning and chain-of-thought (CoT) reasoning. We apply FG-TRACER to two state-of-the-art MLLMs—LLaMA 3.2-Vision and LLaVA 1.5—across three vision–language benchmarks—TextVQA, COCO 2014, and ChartQA—and we conduct a word-level analysis of multimodal integration. Our findings uncover distinct patterns of multimodal fusion across models and tasks, demonstrating that fusion dynamics are both model- and task-dependent. Overall, FG-TRACER offers a robust methodology for probing the internal mechanisms of MLLMs in free-form settings, providing new insights into their multimodal reasoning strategies. Our source code is publicly available at https://anonymous.4open.science/r/FG-TRACER-CB5A/.
Patch Your Matcher: Correspondence-Aware Image-to-Image Translation Unlocks Cross-Modal Matching via Single-Modality Priors
Anton Frolov ⋅ Volker Rodehorst
Matching between image modalities is a high-impact research area. Current state-of-the-art (SOTA) methods rely on extensive multi-million-scale training protocols, which demand significant computational resources. While the necessary training effort scales quadratically with the number of optimized modalities, large-scale training schemes are also impractical for common cases where only two known modalities are matched. We propose Patch Your Matcher (PYM) (see https://anonymous.4open.science/r/patch-your-matcher-433), a universal method for leveraging pre-trained single-modality matchers for cross-modal matching (see Figure 1). PYM learns image-to-image (I2I) translations that map new modalities into the original matcher's modality using a novel adversarial learning approach based on explicit evaluation of 6-DoF two-view correspondence plausibility. Experiments with ELoFTR [Wang et al., 2024] demonstrate dramatic relative improvement of cross-modal matching accuracy, averaging 474.75% on unseen datasets, and even approaching 60.53% of the improvement achieved through extensive SOTA cross-modal training [Ren et al., 2024].
Diversity Preserving Coresets for Image Quality Assessment
Arpita Nema ⋅ Hanwei Zhu ⋅ Xi Zhang ⋅ Weisi Lin
Coresets are compact, representative subsets of large datasets. While coreset selection methods have been extensively investigated in image classification, their direct application to Image Quality Assessment (IQA) is hindered by the incoherent and structurally distinct nature of content and quality representations in IQA tasks. To address this gap, we propose Q-Diverse coreset, a framework tailored for IQA. Our method begins by extracting dual-view embeddings that are both content-aware and quality-aware, capturing semantic and perceptual nuances. Rather than directly combining these heterogeneous features, we construct separate pairwise distance matrices and fuse them in the distance space. This fusion transforms into a graph-based structure from which spectral embeddings are derived. Finally, a geometric diversity-based sampling strategy is applied in the spectral space to select a coreset that maximizes representativeness. Notably, Q-Diverse operates in a label-free manner, making it especially valuable in IQA, where collecting subjective quality annotations is computationally expensive and time-consuming. Experimental results on seven IQA benchmarks demonstrate that Q-Diverse enables the effective training of deep learning-based IQA architectures, even with limited data, impressively retaining performance. It achieves SRCC and PLCC values within 0.045 and 0.042 of those obtained from full-data training, using only 10% of the dataset on average. Our results establish Q-Diverse as a coreset selection method that enables efficient dataset curation as well as training and fine-tuning deep learning–based IQA models.
SAVE: Sparse Autoencoder‑Driven Visual Information Enhancement for Mitigating Object Hallucination
Sangha Park ⋅ Seungryong Yoo ⋅ Jisoo Mok ⋅ Sungroh Yoon
Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder‑Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object‑presence question‑answering probe identifies the SAE features most indicative of the model’s visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state‑of‑the‑art training‑free methods on standard benchmarks, achieving a 10%p improvement in CHAIR_S and consistent gains on POPE and MMHal‑Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination.
NavMapFusion: Diffusion-based Fusion of Navigation Maps for Online Vectorized HD Map Construction
Thomas Monninger ⋅ Zihan Zhang ⋅ Steffen Staab ⋅ Sihao Ding
Accurate environmental representations are essential for autonomous driving, providing the foundation for safe and efficient navigation. Traditionally, high‑definition (HD) maps are providing this representation of the static road infrastructure to the autonomous system a priori. However, because the real world is constantly changing, such maps must be constructed online from on‑board sensor data. Navigation‑grade standard-definition (SD) maps are widely available, but their resolution is insufficient for direct deployment. Instead, they can be used as coarse prior to guide the online map construction process. We propose NavMapFusion, a diffusion‑based framework that performs iterative denoising conditioned on high-fidelity sensor data and on low-fidelity navigation maps. This paper strives to answer: (1) How can coarse, potentially outdated navigation maps guide online map construction? (2) What advantages do diffusion models offer for map fusion? We demonstrate that diffusion-based map construction provides a robust framework for map fusion. Our key insight is that discrepancies between the prior map and online perception naturally correspond to noise within the diffusion process; consistent regions reinforce the map construction, whereas outdated segments are suppressed. On the nuScenes benchmark, NavMapFusion conditioned on coarse road lines from OpenStreetMap data reaches a 21.4% relative improvement on 100m, and even stronger improvements on larger perception ranges, while maintaining real‑time capabilities. By fusing low‑fidelity priors with high‑fidelity sensor data, the proposed method generates accurate and up-to-date environment representations, guiding towards safer and more reliable autonomous driving.
GFT: Graph Feature Tuning for Efficient Point Cloud Analysis
Manish Dhakal ⋅ Venkat Dasari ⋅ Rajshekhar Sunderraman ⋅ Yi Ding
Parameter-efficient fine-tuning (PEFT) significantly reduces computational and memory costs by updating only a small subset of the model's parameters, enabling faster adaptation to new tasks with minimal loss in performance.Previous studies have introduced PEFTs tailored for point cloud data, as general approaches are suboptimal.To further reduce the number of trainable parameters, we propose a point-cloud-specific PEFT, termed Graph Features Tuning (GFT), which learns a dynamic graph from initial tokenized inputs of the transformer using a lightweight graph convolution network and passes these graph features to deeper layers via skip connections and efficient cross-attention modules.Extensive experiments on object classification and segmentation tasks show that GFT operates in the same domain, rivalling existing methods, while reducing the trainable parameters.
QAL : A Loss for Recall–Precision Balance in 3D Reconstruction
Pranay Meshram ⋅ Yash Turkar ⋅ kartikeya singh ⋅ Praveen Raj Masilamani ⋅ Charuvahan Adhivarahan ⋅ Karthik Dantu
Volumetric learning underpins many 3D vision tasks such as completion, reconstruction, and mesh generation, yet training objectives still rely on Chamfer Distance (CD) or Earth Mover’s Distance (EMD), which fail to balance recall and precision. We propose Quality-Aware Loss (QAL), a drop-in replacement for CD/EMD that combines a coverage-weighted nearest-neighbor term with an uncovered--ground-truth attraction term, explicitly decoupling recall and precision into tunable components. Across diverse pipelines, QAL achieves consistent coverage gains, improving by an average of +4.3 pts over CD and +2.8 pts over the best alternatives. Though modest in percentage, these improvements reliably recover thin structures and under-represented regions that CD/EMD overlook. Extensive ablations confirm stable performance across hyperparameters and across output resolutions, while full retraining on PCN and ShapeNet demonstrates generalization across datasets and backbones. Moreover, QAL-trained completions yield higher grasp scores under GraspNet evaluation, showing that improved coverage translates directly into more reliable robotic manipulation. QAL thus offers a principled, interpretable, and practical objective for robust 3D vision and safety-critical robotics pipelines.
Meta-YOLO: Metadata-Guided Real-Time Object Detector in Aerial Imagery
Deukryeol Yoon ⋅ Seonghak KIM ⋅ Young Hwa Sung ⋅ Jinho Jung
Aerial object detection is constrained by tiny targets, large scale variation, and strict real-time limits. It supports traffic monitoring, disaster response, and infrastructure inspection. Yet current detectors often ignore available platform metadata and process frames in isolation. This omission prevents receptive fields from adapting to scale variation and reduces accuracy. We propose Meta-YOLO to exploit platform metadata for scale-aware aerial object detection in real time. Meta-YOLO injects normalized telemetry into spatial sampling to guide feature extraction. It modulates deformable convolution offsets using a spatial metadata map aligned with the image. This links visual features with platform state and enables receptive fields to adapt to object scale. Built on YOLOX, Meta-YOLO adds two modules: feature modulation and offset correction. Evaluated on 327K aerial frames with metadata, Meta-YOLO achieves up to +8.7 AP gains over YOLOX in lightweight regimes and consistently outperforms other recent detectors. It preserves real-time throughput with negligible overhead and improves accuracy without extra visual processing.
Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection
Francesco Dalmonte ⋅ Emirhan Bayar ⋅ Emre Akbas ⋅ Iuliana Georgescu
Unsupervised anomaly detection in medical images is an important yet challenging task due to the diversity of possible anomalies and the practical impossibility of collecting comprehensively annotated datasets. In this paper, we propose a modernized autoencoder-based framework, the Q-Former Autoencoder, that leverages state-of-the-art pretrained vision foundation models for medical anomaly detection.Instead of training encoders from scratch, we directly utilize frozen foundation models as feature extractors, enabling rich, multi-stage, high-level representations without domain-specific fine-tuning. We introduce the Q-Former architecture as the bottleneck, which enables us to control the length of the reconstruction sequence, while efficiently aggregating multi-scale features. Additionally, we incorporate a perceptual loss computed using features from a the trained Masked Autoencoder, guiding the reconstruction towards semantically meaningful structures. Our framework is evaluated on four diverse medical anomaly detection benchmarks - BraTS2021, RESC, RSNA and LiverCT, achieving state-of-the-art results. Our results highlight the potential of foundation model encoders, pretrained on natural images, to generalize effectively to medical image analysis tasks without further fine-tuning. Code and models will be released upon acceptance.
AusSmoke meets MultiNatSmoke: a fully-labelled diverse smoke segmentation dataset
Weihao Li ⋅ Hongjin Zhao ⋅ Gao Zhu ⋅ Ge-Peng Ji ⋅ Nicholas Wilson ⋅ Marta Yebra ⋅ Nick Barnes
Wildfires are an escalating global concern due to the devastating impacts on the environment, economy, and human health, with notable incidents such as the 2019-2020 Australian bushfires and the 2025 California wildfires underscoring the severity of these events. AI-enabled camera-based smoke detection has emerged as a promising approach for the rapid detection of wildfires. However, existing wildfire smoke segmentation datasets that are used for training detection and segmentation models are limited in scale, geographically constrained, and often rely on synthetic imagery, which hinders effective training and generalization. To overcome these limitations, we present AusSmoke, a new smoke segmentation dataset collected from Australia to fill the shortage gap in this region. Furthermore, we introduce a MultiNational geographically diverse and substantially larger fully-labelled benchmark, called MultiNatSmoke, that consolidates publicly available international datasets with the newly collected Australian imagery, expanding the scale by an order of magnitude over previous collections. Finally, we benchmark smoke segmentation models, demonstrating improved performance and enhanced generalization across diverse geographical contexts.
NAPP: Noise-Adaptive Prototype Perturbation for Few-Shot Learning
Il Kim ⋅ Sang Yun ⋅ Dongheon Lee ⋅ Seong Kim Kim ⋅ Joonki Paik
Few-shot learning aims to generalize deep models to novel categories with only a handful of labeled examples, but existing methods remain vulnerable to task-irrelevant noise, unstable prototype estimation, and limited adaptability under domain shift. To address these issues, we propose the Noise-Adaptive Prototype Perturbation Network (NAPP), a framework that enhances robustness and generalization for few-shot image classification. NAPP introduces three key innovations: (1) a Noise Cancellation Mechanism embedded in Vision Transformer self-attention layers that dynamically suppresses spurious, task-irrelevant features. (2) a MixPerturbation Module that perturbs class prototypes through augmented feature combinations, producing more stable and transferable prototype representations. (3) an Adaptive Noise-Conditioned Meta-Learning scheme that fine-tunes less than 0.02\% of noise-related parameters at meta-test time, enabling efficient and rapid adaptation to unseen classes without eroding pretrained knowledge. Extensive experiments demonstrate that NAPP achieves competitive and superior performance compared to state-of-the-art few-shot classification methods across both in-domain and challenging cross-domain benchmarks. These results highlight NAPP as a parameter-efficient and domain-robust framework, underscoring its practical effectiveness in real-world few-shot learning scenarios.
GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting
Madhav Agarwal ⋅ Mingtian Zhang ⋅ Laura Sevilla-Lara ⋅ Steven McDonagh
Speech-driven talking heads have emerged recently, enabling interactive avatars. However, real-world applications are limited, as current methods are either accurate but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with one-shot generation while Gaussian Splatting approaches are real-time, yet inaccuracies in tracking or mappings of Gaussians lead to unstable outputs and video artifacts, detrimental to realistic use cases. We address this by mapping Gaussian Splatting via 3D Morphable Models (3DMM) to generate person-specific avatars and introduce transformer-based prediction of 3DMM parameters, directly from audio, to drive temporal consistency. From monocular video and independent speech input signals we generate real-time talking head videos with lip-sync, where we report competitive quantitative and qualitative video-generation performance.
Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting
Hao-Jen Chien ⋅ Yi-Chuan Huang ⋅ Chung-Ho Wu ⋅ Wei-Lun Chao ⋅ Yu-Lun Liu
Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, and a static view is rendered by fixing the model’s time parameter, which retains nearby temporal variation. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their last well-observed past states, while defective states are anchored to future states with stronger supervision. The anchor's influence decays with temporal distance to respect true motion. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96\% user preference.
The task of style transfer for 3D Gaussian splats has been explored in many previous works, but these require reconstructing or fine-tuning the splat while incorporating style information or optimizing a feature extraction network on the splat representation. We propose a reconstruction- and optimization-free approach to stylizing 3D Gaussian Splats. This is done by generating a graph structure across the implicit surface of the splat representation. A feed-forward, surface-based stylization method is then used and interpolated back to the individual splats in the scene. This allows for any style image and 3D Gaussian Splat to be used without any additional training or optimization. This also allows for fast stylization of splats, achieving speeds under 2 minutes even on consumer-grade hardware. We demonstrate the quality results this approach achieves and compare to other Gaussian Splat style transfer methods. Code will be publicly available upon publication.
PEaRL: Pathway-Enhanced Representation Learning for Gene and Pathway Expression Prediction from Histology
Sejuti Majumder ⋅ Saarthak Kapse ⋅ Moinak Bhattacharya ⋅ Xuan Xu ⋅ Alisa Yurovsky ⋅ Prateek Prasanna
Integrating histopathology with spatial transcriptomics (ST) provides a powerful opportunity to link tissue morphology with molecular function. Yet most existing multimodal approaches rely on a small set of highly variable genes, which limits predictive scope and overlooks the coordinated biological programs that shape tissue phenotypes. We present PEaRL (Pathway Enhanced Representation Learning), a multimodal framework that represents transcriptomics through pathway activation scores computed with ssGSEA. By encoding biologically coherent pathway signals with a transformer and aligning them with histology features via contrastive learning, PEaRL reduces dimensionality, improves interpretability, and strengthens cross-modal correspondence. Across three cancer ST datasets—breast, skin, and lymph node—PEaRL consistently outperforms state-of-the-art methods, yielding higher accuracy for both gene- and pathway-level expression prediction (up to 58.9% and 20.4% increase in Pearson correlation coefficient compared to SOTA). These results demonstrate that grounding transcriptomic representation in pathways produces more biologically faithful and interpretable multimodal models, advancing computational pathology beyond gene-level embeddings.
Flood-LDM: Generalizable Latent Diffusion Models for rapid and accurate zero-shot High-Resolution Flood Mapping
Sun Han Neo ⋅ Sachith Seneviratne ⋅ Herath Mudiyanselage Viraj Vidura Herath ⋅ Abhishek Saha ⋅ Sanka Rasnayaka ⋅ Lucy Marshall
Flood prediction is critical for emergency planning and response to mitigate human and economic losses. Traditional physics-based hydrodynamic models generate accurate high-resolution flood maps using numerical methods requiring fine-grid discretization; which are computationally intensive for large areas and impractical for real-time applications. While recent studies have applied convolutional neural networks for flood map super-resolution with good accuracy and speed, these models suffer from limited generalizability when applied to unseen areas or novel flood scenarios. In this paper, we propose a novel approach that leverages latent diffusion models to perform super-resolution on coarse-grid flood maps, with the objective of achieving the accuracy of fine-grid flood maps while significantly reducing inference time. Our experimental results demonstrate that latent diffusion models can substantially decrease the computational time required to produce high-fidelity flood maps without compromising on accuracy, thus enabling their use in real-time flood risk management. Moreover, diffusion models exhibit superior generalizability across different physical locations, with transfer learning further accelerating adaptation to new geographic regions with limited efforts. Finally, by incorporating physics-informed inputs into the model, our approach addresses the common limitation of black-box behavior in machine learning, thereby enhancing interpretability. Code will be made publicly available upon acceptance.
2D-to-3D human pose lifting is an ill-posed problem due to depth ambiguity and occlusion. Existing methods relying on spatial and temporal consistency alone are insufficient to resolve these problems especially in the presence of significant occlusions or high dynamic actions. Semantic information, however, offers a complementary signal that can help disambiguate such cases. To this end, we propose ActionPose, a framework that leverages action knowledge by aligning motion embeddings with text embeddings of fine-grained action labels. ActionPose operates in two stages: pretraining and fine-tuning. In the pretraining stage, the model simultaneously learns to recognize actions and reconstruct 3D poses from masked and noisy 2D poses. During the fine-tuning stage, the model is further refined using real-world 3D human pose estimation datasets without action labels. Additionally, our framework incorporates masked body parts and masked time windows in motion modeling, encouraging the model to leverage semantic information when spatial and temporal consistency is unreliable. Experiments demonstrate the effectiveness of ActionPose, achieving state-of-the-art performance in 3D pose estimation on public datasets, including Human3.6M and MPI-INF-3DHP. Specifically, ActionPose achieves an MPJPE of 36.7mm on Human3.6M with detected 2D poses as input and 15.5mm on MPI-INF-3DHP with ground-truth 2D poses as input.
SphereEdit: Spherical Semantic Editing in Diffusion Models
Salamata Konate ⋅ Hassan Hamidi ⋅ Elham Dolatabadi ⋅ Frank Rudzicz ⋅ Laleh Seyyed-Kalantari
Despite significant advances in diffusion models, achieving precise and composable image editing without task-specific training remains a challenge. Existing approaches often rely on iterative optimization or linear latent operations, which are slow, brittle, and prone to attribute entanglement (e.g., editing “lipstick” inadvertently alters skin tone). We introduce SphereEdit, a training-free framework that leverages the spherical geometry of diffusion embeddings and token aware cross-attention to enable interpretable, fine-grained control. We represent semantic attributes as unit vector directions in the denoiser’s prediction space and show that antipodal symmetry ("old" is approximately the negation of "young") naturally supports bidirectional edits, while approximate orthogonality enables clean composition through spherical coefficient. At inference, these directions modulate cross-attention activations, producing spatially localized edits without optimization or fine-tuning. SphereEdit achieves sharper, more disentangled edits than prior baselines, while remaining plug-and-play and applicable across diverse image editing tasks.
FedEFC: Federated Learning Using Enhanced Forward Correction Against Noisy Labels
Seunghun Yu ⋅ Jin-Hyun Ahn ⋅ Joonhyuk Kang
Federated Learning (FL) is a powerful framework for privacy-preserving distributed learning. It enables multiple clients to collaboratively train a global model without sharing raw data. However, handling noisy labels in FL remains a major challenge due to heterogeneous data distributions and communication constraints, which can severely degrade model performance. To address this issue, we propose FedEFC, a novel method designed to tackle the impact of noisy labels in FL. FedEFC mitigates this issue through two key techniques: (1) prestopping, which prevents overfitting to mislabeled data by dynamically halting training at an optimal point, and (2) loss correction, which adjusts model updates to account for label noise. In particular, we develop an effective loss correction tailored to the unique challenges of FL, including data heterogeneity and decentralized training. Furthermore, we provide a theoretical analysis, leveraging the composite proper loss property, to demonstrate that the FL objective function under noisy label distributions can be aligned with the clean label distribution. Extensive experimental results validate the effectiveness of our approach, showing that it consistently outperforms existing FL techniques in mitigating the impact of noisy labels, particularly under heterogeneous data settings (e.g., achieving up to 41.64% relative performance improvement over the existing loss correction method).
We introduce the novel task of Photo Dating by Facial Age Aggregation, which aims to estimate the year a photograph was taken by leveraging information from the faces of people present in the image. To facilitate this research, we publicly release a new dataset containing over 1.6 million annotated faces, primarily from movie stills, with identity and birth year annotations. Uniquely, our dataset provides annotations for multiple individuals within a single image, enabling the study of multi-face information aggregation. We propose a probabilistic framework that formally combines career-based temporal priors with visual evidence from modern face recognition and age estimation models to infer the capture year. Our experiments demonstrate that aggregating evidence from multiple faces consistently improves the performance and the approach significantly outperforms strong, scene-based baselines, particularly for images containing several identifiable individuals.
Scalable Video Action Anticipation with Cross Linear Attentive Memory
Zeyun Zhong ⋅ Manuel Martin ⋅ David Schneider ⋅ David Lerch ⋅ Chengzhi Wu ⋅ Frederik Diederichs ⋅ Juergen Gall ⋅ Jürgen Beyerer
Recent advances in action anticipation rely heavily on Transformer architectures to learn discriminative representations of the past observation, incurring high computational and memory overhead that limits their applicability to long videos. While temporal processors with linear complexity like RNNs and state-space models offer efficient alternatives, their sequential nature risks overlooking subtle cues in observed frames that could enhance future anticipation. We address this limitation with Cross Linear Attentive Memory (CLAM), a memory module that selectively retrieves complementary context cues from frame features. By reformulating linear attention to replace traditional cross-attention, CLAM achieves linear computation complexity and constant memory usage relative to input length. Finally, by fusing the outputs of the temporal processor and CLAM, a non-autoregressive Transformer decoder generates future actions in one shot with high accuracy. Experiments on egocentric (EpicKitchens100 and Ego4D) and third-person (Thumos14) benchmarks demonstrate our model’s superior anticipation accuracy and scalability, processing longer sequences with significantly less latency growth than alternatives. Our approach also achieves promising results in online action detection.
DUDA: Distilled Unsupervised Domain Adaptation for Lightweight Semantic Segmentation
Beomseok Kang ⋅ Niluthpol Mithun ⋅ Abhinav Rajvanshi ⋅ Han-pang Chiu ⋅ Supun Samarasekera
Unsupervised Domain Adaptation (UDA) is essential for enabling semantic segmentation in new domains without requiring costly pixel-wise annotations. State-of-the-art (SOTA) UDA methods primarily use self-training with architecturally identical teacher and student networks, relying on Exponential Moving Average (EMA) updates. However, these approaches face substantial performance degradation with lightweight models due to inherent architectural inflexibility leading to low-quality pseudo-labels. To address this, we propose Distilled Unsupervised Domain Adaptation (DUDA), a novel framework that combines EMA-based self-training with knowledge distillation (KD). Our method employs an auxiliary student network to bridge the architectural gap between heavyweight and lightweight models for EMA-based updates, resulting in improved pseudo-label quality. DUDA employs a strategic fusion of UDA and KD, incorporating innovative elements such as gradual distillation from large to small networks, inconsistency loss prioritizing poorly adapted classes, and learning with multiple teachers. Extensive experiments across four UDA benchmarks demonstrate DUDA's superiority in achieving SOTA performance with lightweight models, often surpassing the performance of heavyweight models from other approaches.
Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study
Arushi Rai ⋅ Adriana Kovashka
While there has been rapid progress in video-LLMs capable of advanced reasoning, prior work shows that these models struggle on the challenging task of sports feedback generation and require expensive and difficult-to-collect finetuning feedback data from the target sport. This limitation is evident from the poor generalization to sports unseen during finetuning. Furthermore, traditional text generation evaluation metrics (e.g., BLEU-4, METEOR, ROUGE-L, BERTScore), originally developed for machine translation and summarization, fail to capture the unique aspects of sports feedback quality. To address the first problem, we propose using auxiliary freely-available web data from the target domain, such as competition videos and coaching manuals, in addition to existing sports feedback from a disjoint, source domain to improve sports feedback generation performance on the target domain. To improve evaluation, we propose two evaluation metrics: (1) specificity and (2) actionability. Together, our approach improves meaningful and practical generation of sports feedback under limited annotations.
START: Spatial and Textual Learning for Chart Understanding
Zhuoming Liu ⋅ Xiaofeng Gao ⋅ Feiyang Niu ⋅ Qiaozi Gao ⋅ Liu Liu ⋅ Robinson Piramuthu
Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) — grasping both is essential for precise, fine-grained chart reasoning.Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM’s understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages a vision-language model (VLM) to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model’s ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation.Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.
VRAgent: Self-Refining Agent for Zero-Shot Multimodal Video Retrieval
Ketul Shah ⋅ Pankaj Nathani ⋅ Rama Chellappa ⋅ Fabian Caba Heilbron
Recent advances in Vision-Language Models (VLMs) and Large Language Models (LLMs) have demonstrated remarkable zero-shot capabilities while becoming increasingly accessible. They have enabled zero‑shot text‑to‑video search, yet they still struggle on real‑world queries that demand temporally aligned reasoning across vision and speech. We present VRAgent, an agentic retrieval framework that leverages a central LLM as a planner which (i) decomposes a free‑form user query into a tool‑instruction set spanning visual, dialogue and other modality‑specific retrievers, and (ii) iteratively self‑refines this plan by scoring its own outputs and rewriting the next instruction set. The resulting closed‑loop optimization acts at test time and requires no additional training data or gradient updates. VRAgent is modular by design—adding a new modality is as simple as adding a corresponding foundation model into the toolbox. On our newly proposed MM‑MSRVTT and TVR‑1200 multimodal benchmarks, VRAgent improves average recall by +8.3\% and +1.7\% over the best zero‑shot baselines, while on single‑modality MSR‑VTT and DiDeMo it obtains consistent gains of +3.7\% and +4.1\%. An interactive variant that asks the user up to two multiple‑choice questions pushes average recall to 79.7\% on MSR‑VTT, underscoring the value of on‑the‑fly human feedback.
MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps
Sharat Bhat ⋅ Harshita Khandelwal ⋅ Tushar Kataria ⋅ Vivek Gupta
Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise—capabilities that current large language models (LLMs) and vision–language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning.To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question–answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning.We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning. All codes and data will be publicly released via github upon acceptance.
IMPACT: Interpretable Most Important Person Analysis and Classification using Transformer-based Models
Akshat Rampuria ⋅ Kamakshya Nayak ⋅ Kamalakar Thakare ⋅ Tushar Joshi ⋅ Aditya Singh ⋅ Haesol Park ⋅ Heeseung Choi ⋅ Debi Dogra ⋅ Ig-Jae Kim
Identifying the Most Important Person (MIP) in complex social and sports events remains a challenging problem due to the dynamic nature of group interactions, subtle visual cues, and context-dependent semantics. Traditional methods often struggle to accurately capture the interplay between individuals and the overarching activity, especially in unstructured real-world environments. In addition, the lack of strong supervision and the need for a deeper contextual understanding further complicate the task. In this work, we propose IMPACT, a novel multi-modal framework that leverages recent advances in vision language models to bridge the gap between visual perception and semantic reasoning. Our approach integrates structured scene understanding, natural language generation, and cross-modal learning to jointly model activity recognition and MIP localization. The method integrates language, vision, and spatial reasoning to improve scene interpretability as well as accuracy in group activity recognition tasks. By incorporating language-based representations, the proposed method enables interpretable and robust performance in sports-centric group activity scenarios. Comprehensive experiments on C-Sports and NCAA datasets demonstrate that the framework significantly enhances the localization of key individuals as well as the accuracy of activity prediction, laying the groundwork for a holistic scene understanding in human-centric video and image analysis. Our proposed method achieves an accuracy of 81.6\% when compared with human annotator markings and an increase in mAP scores by $\sim 5\%$ for MIP identification.
SurgXBench: Explainable Vision-Language Model Benchmark for Surgery
Jiajun Cheng ⋅ Xianwu Zhao ⋅ Sainan Liu ⋅ Xiaofan Yu ⋅ Ravi Prakash ⋅ Patrick Codd ⋅ Jonathan Katz ⋅ Shan Lin
Innovations in digital intelligence are transforming robotic surgery with more informed decision-making. Real-time awareness of surgical instrument presence and actions (e.g., cutting tissue) is essential for such systems. Yet, despite decades of research, most machine learning models for this task are trained on small datasets and still struggle to generalize. Recently, vision-Language Models (VLMs) have brought transformative advances in reasoning across visual and textual modalities. Their unprecedented generalization capabilities suggest great potential for advancing intelligent robotic surgery. However, surgical VLMs remain underexplored, and existing models show limited performance, highlighting the need for benchmark studies to assess their capabilities and limitations and to inform future development. To this end, we benchmark the zero-shot performance of several advanced VLMs on two public robotic-assisted laparoscopic datasets for instrument and action classification. Beyond standard evaluation, we integrate explainable AI to visualize VLM attention and uncover causal explanations behind their predictions. This provides a previously underexplored perspective in this field for evaluating the reliability of model predictions. We also propose several explainability analysis-based metrics to complement standard evaluations. Our analysis reveals that surgical VLMs, despite domain-specific training, often rely on weak contextual cues rather than clinically relevant visual evidence, highlighting the need for stronger visual and reasoning supervision in surgical applications.
SeqFeedNet: Sequential Feature Feedback Network for Background Subtraction
Yu-Shun Huang ⋅ Yu-Shun Huang ⋅ Yi-Xiang Yang
Background subtraction (BGS) is a fundamental task in computer vision with applications in video surveillance, object tracking, and recognition. Despite recent advancements, many deep learning-based BGS algorithms rely on large models to extract high-level representations, demanding significant computational resources and leading to inefficiencies in processing video streams. To address these limitations, we introduce the Sequential Feature Feedback Network (SeqFeedNet), a novel supervised algorithm for BGS in unseen videos that operates without additional pre-processing models. SeqFeedNet innovatively incorporates time-scale diverse sequential features and employs a feedback mechanism for each iteration. Moreover, we propose the Sequential Fit Training (SeqFiT) technique, enhancing model convergence during training. Evaluated on the CDNet 2014 dataset, SeqFeedNet not only achieves $\sim5$ times increase in inference speed but also outperforms F-Measure scores of the leading supervised algorithms, making it highly suitable for real-world applications. Our experiment demonstrates that SeqFeedNet surpasses state-of-the-art network without pre-trained segmentation model by 3.83\% F-Measure on the CDnet 2014 dataset. Leading the way to establish a new benchmark for efficient and effective BGS in unseen videos.
Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors
Son Tung Nguyen ⋅ Alejandro Fontan ⋅ Michael Milford ⋅ Tobias Fischer
Recent learning-based visual localization methods have shown strong performance by incorporating global descriptors to disambiguate visually similar landmarks in large-scale environments. However, existing approaches typically derive these descriptors from geometrical cues alone (e.g., via a covisibility graph), decoupling them from visual information and limiting their discriminative power. This disconnection reduces robustness under noisy geometrical constraints, particularly when these constraints are derived from potentially unreliable overlapping scores. In this paper, we propose an aggregator module that learns global descriptors consistent with both geometrical relationships and visual similarity. This dual consistency ensures that images receive similar descriptors only when they are both visually similar and structurally connected in the covisibility graph. The aggregator improves descriptor quality by correcting erroneous associations between unrelated image pairs that arise from noisy overlapping scores. We leverage a batch mining strategy based solely on the covisibility graph with a modified contrastive loss, eliminating the need for manual place label annotation and enabling efficient training across diverse environments. Experiments on challenging benchmarks demonstrate that our approach significantly improves localization accuracy in large-scale environments while maintaining similar computational and memory efficiency. The code accompanying this paper will be released.
VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models
Kailai Feng ⋅ Yabo Zhang ⋅ Haodong Yu ⋅ Zhilong Ji ⋅ Jinfeng Bai ⋅ Hongzhi Zhang ⋅ Wangmeng Zuo
Artistic typography is a technique that enables one to visualize the meaning of an input character in an imaginable and readable manner. With powerful text-to-image diffusion models, existing methods directly design the overall geometry and texture of input character, making it challenging to ensure both creativity and legibility. In this paper, we introduce a dual-branch, training-free method called VitaGlyph, enabling flexible artistic typography with controllable geometry changes while maintaining legibility.The key insight of VitaGlyph is to treat the input character as a scene composed of a Subject and its Surrounding, which are rendered with varying degrees of geometric transformation. To enhance the visual appeal and creativity of the generated artistic typography, the Subject flexibly expresses the essential concept of the input character, while the Surrounding enriches relevant background without altering the shape. Specifically, we implement VitaGlyph through a three-phase framework: (i) Knowledge Acquisition leverages large language models to design text descriptions for the Subject and Surrounding. (ii) Regional Interpretation detects the part that matches the subject description most closely and refines the structure using Semantic Typography. (iii) Attentional Compositional Generation separately renders the textures of the Subject and Surrounding and blends them in an attention-based manner. Experiments demonstrate that VitaGlyph not only achieves better artistry and legibility, but also manages to depict multiple customized concepts, facilitating more creative and pleasing artistic typography generation.
AortaDiff: A Unified Multitask Diffusion Framework for Contrast-Free AAA Imaging
Yuxuan Ou ⋅ NING BI ⋅ Jiazhen Pan ⋅ Jiancheng Yang ⋅ Boliang Yu ⋅ Usama Zidan ⋅ Regent Lee ⋅ Vicente Grau
While contrast-enhanced CT (CECT) is standard for assessing abdominal aortic aneurysms (AAA) , the required iodinated contrast agents pose significant risks, including nephrotoxicity, patient allergies, and environmental harm. To reduce contrast agent use, recent deep learning methods have focused on generating synthetic CECT from non-contrast CT (NCCT) scans. However, most adopt a multi-stage pipeline that first generates images and then performs segmentation, which leads to error accumulation and fails to leverage shared semantic and anatomical structures.To address this, we propose a unified deep learning framework that generates synthetic CECT images from NCCT scans while simultaneously segmenting the aortic lumen and thrombus. Our approach integrates conditional diffusion models (CDM) with multi-task learning, enabling end-to-end joint optimization of image synthesis and anatomical segmentation. Unlike previous multitask diffusion models, our approach requires no initial predictions (e.g., a coarse segmentation mask), shares both encoder and decoder parameters across tasks, and employs a semi-supervised training strategy to learn from scans with missing segmentation labels, a common constraint in real-world clinical data.We evaluated our method on a cohort of 264 patients, where it consistently outperformed state-of-the-art single-task and multi-stage models. For image synthesis, our model achieved a PSNR of 25.61 dB, compared to 23.80 dB from a single-task CDM. For anatomical segmentation, it improved the lumen Dice score to 0.89 from 0.87 and the challenging thrombus Dice score to 0.53 from 0.48 (nnU-Net). These segmentation enhancements led to more accurate clinical measurements, reducing the lumen diameter MAE to 4.19 mm from 5.78 mm and the thrombus area error to 33.85\% from 41.45\% when compared to nnU-Net.
DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment
Sheng-Hao Liao ⋅ Shang-Fu Chen ⋅ Tai-Ming Huang ⋅ Wen-Huang Cheng ⋅ Kailung Hua
Drag-based image editing using generative models provides intuitive and fine-grained control over image structures. However, existing methods rely heavily on manually provided masks and textual prompts to preserve semantic fidelity and motion precision. Removing these constraints creates a fundamental trade-off: substantial visual artifacts without masks and poor spatial control without prompts. To address these limitations, we propose DirectDrag, a novel mask-free and prompt-free editing framework. DirectDrag enables precise and efficient manipulation with minimal user input while maintaining high image fidelity and accurate point alignment. DirectDrag introduces two key innovations. First, we design an Auto Soft Mask Generation module that intelligently infers editable regions from point displacement, automatically localizing deformation along movement paths while preserving contextual integrity through the generative model's inherent capacity. Second, we develop a Readout-Guided Feature Alignment mechanism that leverages intermediate diffusion activations to maintain structural consistency during point-based edits, substantially improving visual fidelity. Despite operating without manual mask or prompt, DirectDrag achieves superior image quality compared to existing methods while maintaining competitive drag accuracy. Extensive experiments on DragBench and real-world scenarios demonstrate the effectiveness and practicality of DirectDrag for high-quality, interactive image manipulation.
Bridging the Domain Gap in Small Multimodal Models: A Dual-level Alignment Perspective
Aveen Dayal ⋅ Peketi Divya ⋅ Nidhi Tiwari ⋅ Linga Reddy Cenkeramaddi ⋅ C Mohan ⋅ Abhinav Kumar
Small Multimodal Models (SMMs) fine-tuned with the Low-Rank Adaptation (LoRA) technique perform well on vision–language tasks, yet LoRA remains vulnerable to distribution shift. Unsupervised Domain Adaptation (UDA) is a common remedy for this issue, but existing theory and methods are designed primarily for single- or dual-encoder architectures, overlooking the encoder–decoder structure of SMMs, whose fusion mechanism introduces additional shift. This work bridges this gap in two steps. First, we derive a dual-divergence risk bound that separates encoder divergence from fusion divergence and illustrate its tightness compared to the classical encoder-only bound with a negation-flip example. Second, motivated by this theory, we propose Dual-level Adversarial Alignment (DuAA), a two-stage alignment algorithm. DuAA inserts domain-discriminative adapters after the encoder and within the decoder to minimize both divergences. Furthermore, DuAA employs selective pseudo-labeling to refine target semantics. We compile twelve new cross-domain VQA tasks with distinct visual and textual shifts from existing datasets and observe that DuAA consistently outperforms standard fine-tuning across all tasks.
Crash2DocAI: Automated Integration of Post-Crash Car Part Images into Technical Reports
Václav Diviš ⋅ Jessica Giovagnola ⋅ Khalil Ben Chikha ⋅ Marek Hrúz
Car-crash safety assessments require experts to analyze and document numerous vehicle components from various angles, resulting in a large number of post-crash images. Currently, this process relies on manual image classification and integration into structured reports — a time-consuming and error-prone workflow that limits scalability and consistency. In this paper, we present \textit{Crash2DocAI}, a tool designed to automate the classification and integration of post-crash car part images into technical reports. Our system leverages ConvNeXt, a state-of-the-art image classification model, which achieves a top-1 accuracy of 94.4\% on a newly compiled dataset of 5,772 publicly available post-crash images spanning 32 car part categories. To enable real-time deployment on CPU-only devices, we apply structured pruning and quantization, reducing the model size from 334.3\,MB to 77.6\,MB and inference time from 342\,ms to 94\,ms per image—while preserving classification performance. To enhance the robustness of our tool, we introduce an Out-of-Model-Scope (OMS) monitor based on Mahalanobis distance, which filters images outside the target domain. This binary detector achieves a precision of 71\% and a recall of 95\%, with only a 1\% overhead on inference time. We further demonstrate the practical utility of \textit{Crash2DocAI} in real-world scenarios through a user study involving 26 automotive safety experts. The results reflect a 90\% speed-up and significantly more consistent completion times. Finally, we release the National Highway Traffic Safety Administration-Post-Crash Car Parts (NHTSA-PCCP) dataset to the research community, along with the application and evaluation materials at: \url{linkprovidedincamera-readyversion}
Conversational Image Generation: Towards Multi-Round Personalized Generation with Multi-Modal Language Models
Haochen Zhang ⋅ Animesh Sinha ⋅ Felix Juefei-Xu ⋅ Haoyu Ma ⋅ Kunpeng Li ⋅ Zhipeng Fan ⋅ Xiaoliang Dai ⋅ Tingbo Hou ⋅ Peizhao Zhang ⋅ Zecheng He
Recent advancements in diffusion models have significantly enhanced personalized image generation, enabling high-fidelity synthesis of human-subject-specific images. However, existing approaches are constrained by the inherent limitations of diffusion models, which lack conversational capabilities, and operate in a single-round setting, restricting user interaction. In this work, we propose a novel framework that integrates multi-modal large language models (MLLMs) for multi-round conversational personalization. To achieve this, we identified a performance bottleneck in the detokenizer of current MLLMs, which struggles to reconstruct fine-grained facial identity details. Thus, we enhance the detokenizer with a personalization-enhaced Diffusion Transformer (DiT). We also introduce a multi-stage instruction fine-tuning strategy to balance face preservation and prompt alignment effectively. To support multi-round generation, we implement a chat-history caching mechanism and construct the first multi-round personalization dataset from video clips. Experimental results demonstrate that our approach achieves state-of-the-art performance among MLLM-based personalization methods. To the best of our knowledge, this is the first work to enable conversational personalization, unlocking new capabilities for MLLMs in personalized image generation.
CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting
Chae-Yeon Heo ⋅ Yeong-Jun Cho
In this paper, we propose a semantic-guided framework to address the challenging problem of large-mask image inpainting, where essential visual content is missing and contextual cues are limited. To compensate for the limited context, we leverage a pretrained Amodal Completion (AC) model to generate structure-aware candidates that serve as semantic priors for the missing regions. We introduce Context-Semantic Fusion Network (CSF-Net), a transformer-based architecture that fuses these candidates with contextual features to produce a semantic guidance image for image inpainting. This guidance improves inpainting quality by promoting structural accuracy and semantic consistency. CSF-Net can be seamlessly integrated into existing inpainting models without architectural changes and consistently enhances performance across diverse masking conditions. Extensive experiments on the Places365 dataset demonstrate that CSF-Net effectively reduces object hallucination while enhancing visual realism and semantic alignment.
SceneShine: Illumination-aware Human Scene Gaussian Re-Splatting from Mobile Device Video
Xuqian Ren ⋅ Wenjia Wang ⋅ Mai Nguyen ⋅ Juho Kannala ⋅ Esa Rahtu
Realistic integration of humans into novel 3D scenes requires accurate relighting and shadow generation, yet current vanilla 3D Gaussian Splatting (3DGS) methods struggle with these challenges. We present SceneShine, an illumination-aware 3DGS framework designed for seamless human-scene integration, featuring physically-based avatar relighting and shadow generation for realistic scene composition. Relighting human surfaces in in-the-wild videos is challenging due to the inherent ambiguity in physics-based rendering, which struggles to simultaneously model scene lighting and BRDF properties accurately. To tackle this, we leverage a pseudo-global light map prior to guide BRDF parameter decomposition during training, effectively reducing relighting artifacts. We further incorporate point-based ray tracing to handle human-scene occlusions and dynamically recalculate scene colors, ensuring realistic shadow generation.We further propose a synthetic dataset for evaluation. Extensive experiments demonstrate that our method outperforms existing approaches in both reconstruction quality and identity preservation while achieving convincing illumination-aware composition.
See, Think, Learn: A Self-Taught Multimodal Reasoner
Sourabh Sharma ⋅ Sonam Gupta ⋅ Sadbhawna Thakur
Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate \textbf{perception} and robust \textbf{reasoning}, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called "See-Think-Learn" (STL). At its core, STL introduces a structured reasoning template that encourages the model to \textit{see before thinking}: first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model's ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs. We will make the code publicly available upon acceptance.
SpecGen: Neural Spectral BRDF Generation via Spectral-Spatial Tri-plane Aggregation
Jin Zhenyu ⋅ Wenjie Li ⋅ Zhanyu Ma ⋅ Heng Guo
Synthesizing spectral images across different wavelengths is essential for photorealistic rendering. Unlike conventional spectral uplifting methods that convert RGB images into spectral ones, we introduce SpecGen, a novel method that generates spectral bidirectional reflectance distribution functions (BRDFs) from a single RGB image of a sphere. This enables spectral image rendering under arbitrary illuminations and shapes covered by the corresponding material. A key challenge in spectral BRDF generation is the scarcity of measured spectral BRDF data. To address this, we propose the Spectral-Spatial Tri-plane Aggregation (SSTA) network, which models reflectance responses across wavelengths and incident-outgoing directions, allowing the training strategy to leverage abundant RGB BRDF data to enhance spectral BRDF generation. Experiments show that our method accurately reconstructs spectral BRDFs from limited spectral data and surpasses state-of-the-art methods in hyperspectral image reconstruction, achieving an improvement of 8 dB in PSNR. Codes and data will be released upon acceptance.
Pretraining Helps When Capacity Allows: Evidence from Ultra-Small ConvNets
Srikanth Muralidharan ⋅ Heitor Medeiros ⋅ Masih Aminbeidokhti ⋅ Eric Granger ⋅ Marco Pedersoli
Robust visual recognition on embedded platforms requires models that both generalize out‑of‑distribution (OOD) and fit into tiny compute/memory budgets. While pre‑training is a standard route to robustness for mid/large backbones, its value in the ultra‑small regime remains unclear. We present a capacity‑aware study of pre‑training for two efficient ConvNet families (EfficientNet and MobileNetV3) scaled from “small” to “ultra‑small” via a simple, reproducible recipe. We compare three initializations—COCO detection pre‑training, ImageNet classification pre‑training, and training from scratch—on two axes of distribution shift: (i) cross‑dataset RGB$\rightarrow$RGB transfer between LLVIP and FLIR (ii) cross‑modality detection where models are fine‑tuned on RGB and evaluated on infrared (IR). A complementary classification study on DomainNet probes whether the trends extend beyond detection. Across settings, we find that pre‑training’s benefit is conditional on both backbone capacity and shift difficulty. Task‑aligned COCO detection pre‑training is the most reliable starting point at moderate sizes and for the easier transfer direction. In the low‑capacity regimes, differences are typically within run‑to‑run variation, and training from scratch can match or surpass pre‑training. Classification mirrors this capacity gating. Our results test the premise "pre‑training always helps" and instead quantify when task‑aligned pre‑training pays off for ultra‑small backbones and when it likely does not\footnote{The code will be available online after acceptance.}.
Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs
Sinan Mutlu ⋅ Georgios Fotios Angelis ⋅ Savas Ozkan ⋅ Paul Wisbey ⋅ Anastasios Drosou ⋅ Mete Ozay
Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction incomplete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ulti-mately improves the accuracy-running time tradeoff.
UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks
Bingyin Zhao ⋅ Yingjie Lao
Backdoor attacks are emerging threats to deep neural networks, which typically embed malicious behaviors into a victim model by injecting poisoned samples. Adversaries can activate the injected backdoor during inference by presenting the trigger on input images. Prior defensive methods have achieved remarkable success in countering dirty-label backdoor attacks where the labels of poisoned samples are often mislabeled. However, these approaches do not work for a recent new type of backdoor -- clean-label backdoor attacks that imperceptibly modify poisoned data and hold consistent labels. More complex and powerful algorithms are demanded to defend against such stealthy attacks. In this paper, we propose UltraClean, a general framework that simplifies the identification of poisoned samples and defends against both dirty-label and clean-label backdoor attacks. Given the fact that backdoor triggers introduce adversarial noise that intensifies in feed-forward propagation, UltraClean first generates two variants of training samples using off-the-shelf denoising functions. It then measures the susceptibility of training samples leveraging the error amplification effect in DNNs, which dilates the noise difference between the original image and denoised variants. Lastly, it filters out poisoned samples based on the susceptibility to thwart the backdoor implantation. Despite its simplicity, UltraClean achieves a superior detection rate across various datasets and significantly reduces the backdoor attack success rate while maintaining a decent model accuracy on clean data, outperforming existing defensive methods by a large margin.
GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring
Maximilian Schall ⋅ Felix Knöfel ⋅ Noah König ⋅ Jan Kubeler ⋅ Maximilian von Klinski ⋅ Joan Linnemann ⋅ Xiaoshi Liu ⋅ Iven Schlegelmilch ⋅ Ole Woyciniuk ⋅ Alexandra Schild ⋅ Dante Wasmuht ⋅ Magdalena Bermejo Espinet ⋅ Germán Illera Basas ⋅ Gerard de Melo
Monitoring critically endangered Western Lowland Gorillas is hampered by the immense manual effort required to re-identify individuals from vast amounts of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, "in-the-wild" video datasets suitable for training robust deep learning models. To address this critical gap, we introduce a comprehensive benchmark suite of three novel datasets: Gorilla Wild, the largest in-the-wild video dataset for primate re-identification to date, designed for challenging cross-encounter evaluation; Gorilla Zoo, for assessing cross-domain generalization; and Gorilla Tracking, a meticulously annotated dataset for evaluating multi-object tracking.Building on these datasets, we present GorillaWatch, a complete end-to-end pipeline that integrates state-of-the-art detection, tracking, and re-identification. Our technical contributions include a novel multi-frame self-supervised pretraining strategy that leverages temporal consistency in tracklets to learn powerful, domain-specific features without manual labels. We systematically adapt and evaluate various foundation models, video transformers, and ensemble techniques, achieving state-of-the-art performance on gorilla re-identification. Furthermore, we introduce a constrained clustering method that uses spatiotemporal metadata to accurately perform unsupervised population counting. An adaptation of AttnLRP for representation learning provides interpretability, ensuring our model focuses on meaningful biological traits.
Reciprocal Teaching: Dynamic Multi-Model Teacher-Student Learning for Multiple Noisy Annotations
Wenjie Ai ⋅ Cuong Nguyen ⋅ Adrian Hilton ⋅ Gustavo Carneiro
As datasets grow in size, expert-based annotation becomes increasingly impractical, making crowdsourcing a scalable and cost-effective alternative. In crowdsourcing, samples are typically annotated by multiple workers and aggregated via majority voting, a process that overlooks annotator-specific biases and introduces noisy labels that can impair downstream models. Traditional multi-rater methods attempt to model annotator biases (e.g., with transition matrices) but often overfit when faced with many classes or few, noisy annotators. By contrast, Learning with Noisy Labels (LNL) assumes a single noisy label per sample and has demonstrated that robust strategies (e.g., semi-supervised and multi-model learning) usually outperform bias-estimation methods, though these approaches remain underexplored in multi-annotator settings.To bridge this gap, we propose the Reciprocal Teacher-student Learning from Multi-rater Noisy Annotation (RETINA), a framework that integrates LNL techniques into multi-rater learning. RETINA trains annotator-specific models to capture individual labeling patterns and employs a dynamic teacher–student process, where the teacher identifies clean and noisy samples to guide the student. Experiments on synthetic and real-world benchmarks, including our proposed SynMRL benchmark, show that RETINA outperforms existing multi-rater methods, particularly in high-noise, low-annotator, many-class settings.
DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis
Numan Saeed ⋅ Tausifa Jan Saleem ⋅ Fadillah Maani ⋅ Muhammad Ridzuan ⋅ Hu Wang ⋅ Mohammad Yaqub
Deep learning for medical imaging is hampered by task-specific models that lack generalizability and prognostic capabilities, while existing 'universal' approaches suffer from simplistic conditioning and poor medical semantic understanding. To address these limitations, we introduce DuPLUS, a deep learning framework for efficient multimodal medical image analysis. DuPLUS introduces a novel vision-language framework that leverages hierarchical semantic prompts for fine-grained control over the analysis task, a capability absent in prior universal models. To enable extensibility to other medical tasks, it includes a hierarchical, text-controlled architecture driven by a unique dual-prompt mechanism. For segmentation, DuPLUS is able to generalize across three imaging modalities, ten different anatomically various medical datasets, encompassing more than 30 organs and tumor types. It outperforms the state-of-the-art task-specific and universal models on 8 out of 10 datasets. We demonstrate extensibility of its text-controlled architecture by seamless integration of electronic health record (EHR) data for prognosis prediction, and on a head and neck cancer dataset, DuPLUS achieved a Concordance Index (CI) of 0.69. Parameter-efficient fine-tuning enables rapid adaptation to new tasks and modalities from varying centers, establishing DuPLUS as a versatile and clinically relevant solution for medical image analysis. The code for this work is made available at: https://anonymous.4open.science/r/DuPLUS-6C52
TriaGS: Differentiable Triangulation-Guided Geometric Consistency for 3D Gaussian Splatting
Quan Hong ⋅ Tuan Dang
3D Gaussian Splatting is crucial for real-time novel view synthesis due to its efficiency and ability to render photorealistic images. However, building a 3D Gaussian is guided solely by photometric loss, which can result in inconsistencies in reconstruction. This under-constrained process often results in "floater" artifacts and an unstructured geometry, preventing the extraction of high-fidelity surfaces. To address this issue, our paper introduces a novel method that improves reconstruction by enforcing global geometry consistency through constrained multi-view triangulation. Our approach aims to achieve a consensus on 3D representation in the physical world by utilizing various estimated views. We optimize this process by evaluating a 3D point against a robust consensus point, which is re-triangulated from a bundle of neighboring views in a self-supervised fashion. We demonstrate the effectiveness of our method across multiple datasets, achieving state-of-the-art results. On the DTU dataset, our method attains a mean Chamfer Distance of $0.50$ mm, outperforming comparable explicit methods. We will make our code open-source to facilitate community validation and ensure reproducibility.
SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking
Nico Leuze ⋅ Maximilian Hoh ⋅ Samed Doğan ⋅ Nicolas Rodriguez Pena ⋅ Alfred Schöttl
Accurately recovering 6D object poses in densely packed industrial bin-picking environments remain a significant challenge, owing to occlusions, specular reflections, and textureless parts. We introduce an holistic depth-only 6D pose estimation approach that fuses multi-view depth maps into either a fine-grained 3D point cloud, in its vanilla version, or a sparse Truncated Signed Distance Field (TSDF). At the core of our framework lies a staged heatmap mechanism that yields scene-adaptive attention priors across different resolutions, steering computation toward foreground regions, thus keeping memory requirements at high resolutions feasible. Along, we propose a density-aware sparse transformer block that dynamically attends to (self-) occlusions and the non-uniform distribution of 3D representations. While sparse 3D processing has proven effective for long-range perception, its potential in close-range robotic applications remains underexplored. The proposed framework operates fully sparse, enabling high-resolution volumetric representations to capture fine geometric details crucial for accurate pose estimation in cluttered scenes. Our method operates the entire scene integrally, predicting the 6D pose via a novel per-voxel voting strategy, allowing simultaneous pose predictions for an arbitrary amount of target objects. We validate our method on the recently published IPD and MV-YCB multi-view datasets, demonstrating competitive performance in heavily cluttered industrial and household bin picking scenarios.
Generalized Category Discovery for LiDAR Semantic Segmentation
Minseok Kim ⋅ Jiyong Boo ⋅ Kuk-Jin Yoon
Novel Category Discovery (NCD) methods for LiDAR Semantic Segmentation (LSS) assume that labeled and unlabeled points coexist in every scan and that all unlabeled points belong solely to novel categories. We formalize this into a more practical task, Generalized Category Discovery for LSS (GCDLSS), in which the labeled and unlabeled subsets are disjoint and the unlabeled data contain a mixture of known and novel categories. Existing 2D GCD methods fail under this setting, struggling to distinguish the two groups in sparse, imbalanced LiDAR data. To address this limitation, we present a unified framework that (i) employs a learnable adaptive threshold to obtain point-wise anomaly scores to capture candidates, (ii) refines these candidates through a clustering-based filtering mechanism, and (iii) stabilizes training with a novel-feature queue that supplies reliable novel features even when a scene lacks novel categories. This explicit modeling of novel categories preserves segmentation quality for known classes while markedly improving discovery performance—a direction not explored in prior GCD or NCD methods. Extensive experiments on SemanticKITTI and nuScenes demonstrate that our approach consistently surpasses adapted baselines, establishing a strong benchmark for future work in open-world LSS.
ICONIC-444: A 3.1-Million-Image Dataset for OOD Detection Research
Gerhard Krumpl ⋅ Henning Avenhaus ⋅ Horst Possegger
Current progress in out-of-distribution (OOD) detection is limited by the lack of large, high-quality datasets with clearly defined OOD categories across varying difficulty levels (near- to far-OOD) that support both fine- and coarse-grained computer vision tasks.To address this limitation, we introduce ICONIC-444 (Image Classification and OOD Detection with Numerous Intricate Complexities), a large-scale image dataset containing over 3.1 million RGB images spanning 444 classes tailored for OOD detection research.Captured with a prototype industrial sorting machine, ICONIC-444 closely mimics real-world tasks.It complements existing datasets by offering structured, diverse data suited for rigorous OOD evaluation across a spectrum of task complexities. We define four reference tasks within ICONIC-444 to benchmark and advance OOD detection research and provide baseline results for 22 state-of-the-art post-hoc OOD detection methods.
Any Detector Can Detect Anything
Thomas Huang ⋅ Siyuan Li ⋅ Martin Danelljan ⋅ Henghui Ding ⋅ Luc Van Gool ⋅ Fisher Yu
Visual prompt-based detection enables generalization to arbitrary novel instances in the target image by using one or a few visual templates.Previous methods rely on complex relation or explicit feature matching modules, and their designs are deeply coupled with specific detectors, greatly limiting their applicability.Instead, we propose our `Any Detector can Detect Anything' framework that can enable any detector to detect any object given a single or a few visual templates.Specifically, we design an adapter called Template-Aware Adapter that can be added on top of any existing detector architecture to inject visual template information directly into the detection features.After integration, localization is done on the feature maps as in standard object detectors, effectively transforming any detector into a visual prompt-based detector.Furthermore, we revisit current visual prompt detection benchmarks and correct their unrealistic test assumptions and class splits, which limit the usability of the developed algorithms in the real world.We introduce a set of realistic benchmarks to remedy these issues.We comprehensively evaluate the proposed model on both existing and our new benchmarks, outperforming current state-of-the-art one-shot and few-shot detection methods by a large margin.Our code and benchmark will be released.
Towards Unconstrained Cross-View Pose Estimation
Alexander Wollam ⋅ Kyle Ashley ⋅ Maxim Shugaev ⋅ Oliver Arend ⋅ Ilya Semenov ⋅ Hadis Dashtestani ⋅ Sumved Ravi ⋅ Nathan Jacobs
Cross-view pose estimation entails predicting the relative 3 Degrees-of-Freedom (3DoF) pose of an image within an aerial view. Existing work focuses on imagery in controlled settings featuring highly constrained parameters. In contrast, a wide variety of camera parameterizations are seen in-the-wild across tasks where such estimation is useful. To address this gap, we propose a method capable of performing cross-view pose estimation in these less constrained scenarios with ground-view images of unknown FoV, pitch, roll, and projection type (panoramic or rectilinear). Namely, our method avoids common assumptions—such as gravity/horizon alignment needed for geometric-based projections—and purely relies on a transformer to learn the cross-view relationships in a data-driven manner, paired with prediction modules to enable continuous querying of the pose search space. Evaluations of our approach demonstrates it's ability to perform competitively with the state-of-the-art over the VIGOR benchmark, while maintaining performance in those harder less constrained scenarios. This supports our work as the first generalized approach to this task that is capable of operating with less-constrained imagery. The code will be made available at a later date.
Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization
Meryem Taşyürek ⋅ Tuğçe Kızıltepe ⋅ Hacer Keles
In this work, we propose DARSLP, a simple gloss-free, transformer-based sign language production (SLP) framework that directly maps spoken-language text to sign pose sequences. We first train a pose autoencoder that encodes sign poses into a compact latent space using an articulator-based disentanglement strategy, where features corresponding to the face, right hand, left hand, and body are modeled separately to promote structured and interpretable representation learning. Next, a non-autoregressive transformer decoder is trained to predict these latent representations from word-level text embeddings of the input sentence. To guide this process, we apply channel-aware regularization by aligning predicted latent distributions with priors extracted from the ground-truth encodings using a KL divergence loss. The contribution of each channel to the loss is weighted according to its associated articulator region, enabling the model to account for the relative importance of different articulators during training. Our approach does not rely on gloss supervision or pretrained models, and achieves state-of-the-art results on the PHOENIX14T and CSL-Daily datasets. Source codes will be released upon acceptance.
SmoothDiffusion-VE: Real-time Generative Video Editing Using Adaptive Feature Cache
Mustafa Munir ⋅ Sophia Zalewski ⋅ Shiqiu Liu ⋅ David Tarjan ⋅ Sushmitha Belede ⋅ Anjul Patney ⋅ Radu Marculescu
Video editing with diffusion models presents significant challenges, especially under real-time constraints. Current methods either enhance temporal consistency at the cost of slow processing or rely on frame-by-frame editing, leading to flickering and temporal artifacts. To address both challenges, we propose SmoothDiffusion-VE, a streaming-based editing approach that improves temporal consistency and processing speed through our proposed Adaptive Feature Cache (AFC) and motion-guided attention. The AFC dynamically adjusts the caching behavior based on perceptual similarity (LPIPS) between frames, i.e., shifting to a mini-cache mode for similar frames to reduce computational load. Conversely, significant frame changes trigger deeper caching to maintain robust temporal coherence. Our motion-guided attention selectively focuses on dynamic regions using optical flow, reducing unnecessary computations in static areas and accelerating processing. SmoothDiffusion-VE can run 60 FPS on one A100 GPU and 28 FPS on one RTX 4090 GPU, achieving a 1564$\times$ speedup over Plug-and-Play Diffusion (PNP) and a 1916$\times$ speedup over Diffusion Motion Transfer (DMT), delivering a powerful solution for fast and consistent video editing.
SafeguardGS: 3D Gaussian Primitive Pruning While Avoiding Catastrophic Scene Destruction
Yongjae Lee ⋅ Zhaoliang Zhang ⋅ Deliang Fan
3D Gaussian Splatting (3DGS) has advanced novel view synthesis, but its densification process leads to an excessive number of Gaussian primitives, which negatively impact rendering speed and memory usage. Although many 3DGS pruning techniques have been proposed to address this issue, a comprehensive analysis is still lacking. In this paper, we are the first to categorize 3DGS pruning techniques into two approaches: scene-level importance-threshold-based pruning and pixel-level importance-rank-based pruning, defined by their scope of importance calculation (scene-level vs. pixel-level) and their pruning criteria (threshold vs. rank). Our studies reveal that while the former leads to disastrous quality drops under extreme Gaussian primitive decimation, the latter not only sustains high rendering quality but also provides a natural pruning boundary, i.e., a safeguard for Gaussian pruning. We further propose multiple pruning score functions. From our extensive studies on various pruning score functions, we discover that color similarity with blending weight is the most effective factor for identifying insignificant primitives. In our experiments, our proposed method, SafeguardGS, with the optimal score function achieves the highest PSNR-per-primitive performance under extreme pruning setting, retaining only about 10% of the primitives from the original 3DGS scene, i.e., $10\times$ compression ratio.
Uncertainty-Aware Vision-Language Segmentation for Medical Imaging
Aryan Das ⋅ Tanishq Rachamalla ⋅ Koushik Biswas ⋅ Swalpa Roy ⋅ Vinay Verma
Medical image segmentation is crucial for computer-aided diagnosis, surgical planning, and clinical research, requiring precise delineation of anatomical structures and pathological regions across various imaging modalities. Traditional deep learning approaches, primarily based on convolutional neural networks (CNNs) and transformers, have shown high performance but are limited by their reliance on visual features alone, which restricts their generalization and integration of clinical knowledge. The increasing availability of multimodal medical data, including paired image-text records from electronic health systems, offers a promising solution to these limitations. Vision-language segmentation (VLS) leverages natural language inputs, such as radiology reports or anatomical queries, to guide the segmentation process. This multimodal approach bridges the gap between low-level visual cues and high-level clinical concepts, reduces the need for task-specific supervision, and facilitates more intuitive human-AI interaction in medical workflows. Despite recent advancements, VLS in the medical domain faces significant challenges, including the subtlety of pathological features, inter-reader variability, and the need for fine-grained spatial accuracy. Privacy constraints and the scarcity of well-aligned, high-quality image-text pairs further complicate model training and evaluation, limiting the applicability of general-purpose VLS models in clinical settings. This paper introduces an innovative uncertainty-guided multimodal vision-language segmentation model designed to address these challenges. Our model integrates visual and textual data through advanced cross-modal learning techniques, enhancing segmentation accuracy and robustness. By incorporating uncertainty guidance, the model improves spatial precision and better captures domain-specific visual-linguistic relationships, making it more suitable for clinical applications.
Stabilizing Direct Training of Spiking Neural Networks: Membrane Potential Initialization and Threshold-robust Surrogate Gradient
Hyunho Kook ⋅ Byeongho Yu ⋅ Jeong Oh ⋅ Eunhyeok Park
Recent advancements in the direct training of Spiking Neural Networks (SNNs) have demonstrated high-quality outputs even at early timesteps, paving the way for novel energy-efficient AI paradigms. However, the inherent non-linearity and temporal dependencies in SNNs introduce persistent challenges, such as temporal covariate shift (TCS) and unstable gradient flow with learnable neuron thresholds. In this paper, we present two key innovations: MP-Init (Membrane Potential Initialization) and TrSG (Threshold-robust Surrogate Gradient). MP-Init addresses TCS by aligning the initial membrane potential with its stationary distribution, while TrSG stabilizes gradient flow with respect to threshold voltage during training. Extensive experiments validate our approach, achieving state-of-the-art accuracy on both static and dynamic image datasets.
DocWaveDiff: A Predict-and-Refine approch for Document Image Enhancement with Wavelet U-Nets and Diffusion models
Matteo Marulli ⋅ Marco Bertini
OCR and document layout analysis algorithms are essential components of AI-based document systems, yet they are typically trained on clean, degradation-free images. When applied to degraded documents such as blurred scans or pages spoiled by handwritten text their performance drops significantly. To address this issue, we propose DocWaveDiff, a novel document restoration method based on a predict-and-refine diffusion framework incorporating wavelet U-Nets. Given a degraded image patch and optionally its prior features, our Early Predictor generates an initial restoration, which is then refined by a Denoiser Refiner that estimates the residual image. The combination of these outputs yields the final restored result. We evaluate DocWaveDiff on multiple public benchmarks and demonstrate its strong performance across various document degradation scenarios, including deblurring and handwriting removal. Our results confirm that integrating wavelet transforms into the predict-and-refine framework enhances restoration quality and supports more robust document understanding systems.
Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting
Rishikesh Bhyri ⋅ Brian Quaranto ⋅ Junsong Yuan ⋅ Peter Kim ⋅ Nan Xi
Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising around 1,300 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, CrowdDiff) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting. The code and models will be made publicly available upon completion of the review process.
Benefiting from the powerful data learning and representation capabilities of neural networks, Learned Image Compression (LIC) methods have demonstrated better Rate-Distortion (RD) performance than traditional image compression frameworks. In this paper, we analyze the role of latent variables in image compression, both qualitatively and quantitatively. We then propose a latent variable compensation method to mitigate the loss introduced by quantization. We also introduce a regularization term for the latent variable mean square error into the loss function, providing more explicit guidance for the compression and reconstruction of the model's internal latent variables. Additionally, we propose a noise compensation method that acts as a plug-and-play component to enhance reconstruction in lossy image compression without significant additional encoding or decoding time. We also present a data augmentation technique involving image inversion, which helps the training set conform to the symmetry inherent in probabilistic modeling for image compression tasks. Extensive experiments demonstrate that the proposed method enhances rate-distortion metrics and visual quality.
AugMapNet: Improving Spatial Latent Structure via BEV Grid Augmentation for Enhanced Vectorized Online HD Map Construction
Thomas Monninger ⋅ Md Zafar Anwar ⋅ Stanislaw Antol ⋅ Steffen Staab ⋅ Sihao Ding
Autonomous driving requires an understanding of the infrastructure elements, such as lanes and crosswalks. To navigate safely, this understanding must be derived from sensor data in real-time and needs to be represented in vectorized form. Learned Bird's-Eye View (BEV) encoders are commonly used to combine a set of camera images from multiple views into one joint latent BEV grid. Traditionally, from this latent space, an intermediate raster map is predicted, providing dense spatial supervision but requiring post-processing into the desired vectorized form. More recent models directly derive infrastructure elements as polylines using vectorized map decoders, providing instance-level information. Our approach, $\mathrm{\textbf{Aug}}$mentation $\mathrm{\textbf{Map}}$ $\mathrm{\textbf{Net}}$work (AugMapNet), proposes latent BEV grid augmentation, a novel technique that significantly enhances the latent BEV representation. AugMapNet combines vector decoding and dense spatial supervision more effectively than existing architectures while remaining as straightforward to integrate and as generic as auxiliary supervision. Experiments on nuScenes and Argoverse2 datasets demonstrate significant improvements in vectorized map prediction performance up to 13.3 % over the StreamMapNet baseline on 60 m range and greater improvements on larger ranges. We confirm transferability by applying our method to another baseline and find similar improvements. A detailed analysis of the latent BEV grid confirms a more structured latent space of AugMapNet and shows the value of our novel concept beyond pure performance improvement. The code will be released upon paper acceptance.
From Cognitive Priors to Instance Semantics: A Unified Framework for Multi-task Affective Computing
Guanyu Hu ⋅ Dimitrios Kollias ⋅ Xinyu Yang
Understanding human affect via Valence-Arousal, Expressions, and Action Unit is essential for human-machine interaction. While recent multi-task learning (MTL) methods aim to unify these tasks, they often overlook three key challenges: (i) the assumption of complete annotations for all tasks, which leads to underutilization of single-task datasets with disjoint labels; (ii) task conflicts arising from Noisy Gradients, Negative Transfer (NT), and Task-specific Performance Misalignment (TPM); and (iii) the absence of unified modeling across all three affective task types: regression, detection, and classification.We introduce COIN, a novel two-stage MTL framework that bridges Cognitive Priors and Instance Semantics for robust MTL training.At first, we propose a cognitively guided cross-task label induction strategy that enables supervision propagation under sparse annotations and mitigates NT, resulting in strong task-specific experts. Then, we propose two complementary branches to tackle TPM:(i) transfers knowledge from task-optimal experts and jointly optimizes task-specific objectives under partial supervision;(ii) enhances visual-language consistency using Class-Conditioned Prompts & Instance-Adaptive Prompts.Experiments demonstrate the framework's ability to achieve robust cross-task & generalization performance across six diverse datasets.Code and models will be released upon paper's acceptance.
FuLLaMa: Training-free Diffusion-based Object Removal with Context Preservation
Ilke Demir ⋅ Umur Ciftci
Diffusion models have demonstrated remarkable capabilities in image inpainting tasks, yet they often struggle to maintain semantic consistency and fine-grained details when filling large masked regions. Existing approaches typically require extensive fine-tuning or are trained from scratch, still losing the context, patterns, or realism. We introduce FuLLaMa, a novel training-free framework for diffusion-based removal that preserves semantic embeddings as well as the image quality throughout the infill process. FuLLaMa enhances traditional removal algorithms with the information manifold of LVLMs and generation capability of DMs. Through Adaptive Parameter Manifold Navigation (APMN), DM is guided to generate content that harmonizes with the existing context and structure of the image, without introducing new elements. FuLLaMa achieves a high and balanced performance compared to 6 SOTA object removal algorithms, on 2 datasets, for various tasks such as large area, small object, multi-instance, and patterned removal; using 9 visual and 5 contextual evaluation metrics. We conduct several ablation studies for the system and objective design, also showcasing mask-based, language-based, and point-and-click removal applications. Our work establishes context and quality co-preservation as a fundamental principle for diffusion-based removal.
STRinGS: Selective Text Refinement in Gaussian Splatting
Abhinav Raundhal ⋅ Gaurav Behera ⋅ P Narayanan ⋅ Ravi Kiran Sarvadevabhatla ⋅ Makarand Tapaswi
Text as signs, labels, or instructions is a critical element of real-world scenes as they can convey important contextual information. 3D representations such as 3D Gaussian Splatting (3DGS) struggle to preserve fine-grained text details, while achieving high visual fidelity. Small errors in textual element reconstruction can lead to significant semantic loss. We propose STRinGS, a text-aware, selective refinement framework to address this issue for 3DGS reconstruction. Our method treats text and non-text regions separately, refining text regions first and merging them with non-text regions later for full-scene optimization. STRinGS produces sharp, readable text even in challenging configurations. We introduce a text readability measure OCR Character Error Rate (CER) to evaluate the efficacy on text regions. STRinGS results in a 63.6% relative improvement over 3DGS at just 7K iterations. We also introduce a curated dataset STRinGS-360 with diverse text scenarios to evaluate text readability in 3D reconstruction. Our method and dataset together push the boundaries of 3D scene understanding in text-rich environments, paving the way for more robust text-aware reconstruction methods.
VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning
Madhavaram Vivek Vardhan ⋅ Vartika Sengar ⋅ Arkadipta De ⋅ Charu Sharma
Scene understanding and reasoning has been a fundamental problem in 3D computer vision, requiring models to identify objects, their properties, and spatial or comparative relationships among the objects. Existing approaches enable this by creating scene graphs using multiple inputs such as 2D images, depth maps, object labels, and annotated relationships from specific reference view. However, these methods often struggle with generalization and produce inaccurate spatial relationships like "left/right", which become inconsistent across different viewpoints. To address these limitations, we propose Viewpoint-Invariant ZerO-shot scene graph generation for 3D scene Reasoning (VIZOR). VIZOR is a training-free, end-to-end framework that constructs dense, viewpoint-invariant 3D scene graphs directly from raw 3D scenes. The generated scene graph is unambiguous, as spatial relationships are defined relative to each object’s front-facing direction, making them consistent regardless of the reference view. Furthermore, it infers open-vocabulary relationships that describe spatial and proximity relationships among scene objects without requiring annotated training data. We conduct extensive quantitative and qualitative evaluations to assess the effectiveness of VIZOR on scene graph generation and downstream tasks, such as query-based object grounding. VIZOR outperforms state-of-the-art methods, showing clear improvements in scene graph generation and achieving 22% and 4.81% gains in zero-shot grounding accuracy on the Replica and Nr3D datasets, respectively.
Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering
Jongha Kim ⋅ Byungoh Ko ⋅ Jeehye Na ⋅ Jinsung Yoon ⋅ Hyunwoo Kim
Despite the remarkable capabilities of Large Vision Language Models (LVLMs), they still lack detailed knowledge about specific entities. Retrieval-augmented Generation (RAG) is a widely adopted solution that enhances LVLMs by providing additional contexts from an external Knowledge Base. However, we observe that previous decoding methods for RAG are sub-optimal as they fail to sufficiently leverage multiple relevant contexts and suppress the negative effects of irrelevant contexts. To this end, we propose Relevance-aware Multi-context Contrastive Decoding (RMCD), a novel decoding method for RAG. RMCD outputs a final prediction by combining outputs predicted with each context, where each output is weighted based on its relevance to the question. By doing so, RMCD effectively aggregates useful information from multiple relevant contexts while also counteracting the negative effects of irrelevant ones. Experiments show that RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving the best performance on three knowledge-intensive visual question-answering benchmarks. Also, RMCD can be simply applied by replacing the decoding method of LVLMs without additional training. Analyses also show that RMCD is robust to the retrieval results, consistently performing the best across the weakest to the strongest retrieval results.
CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation
Shizhe Sun ⋅ Wataru Ohyama
We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD's potential as a new paradigm for attention-guided distillation in computer vision tasks.
FlyPose: Towards Robust Human Pose Estimation From Aerial Views
Hassaan Farooq ⋅ Marvin Brenner ⋅ Peter Stütz
Unmanned Aerial Vehicles (UAVs) are increasingly deployed in close proximity to humans for applications such as parcel delivery, traffic monitoring, disaster reponse and infrastructure inspections. Ensuring safe and reliable operation in these human-populated environments demands accurate perception of human poses and actions from an aerial viewpoint. However, person detection and human pose estimation from onboard UAVs present unique challenges due to factors like low resolution, steep viewing angles, occlusion, and limited computation resources. In this work, we develop \textit{FlyPose}, a lightweight top-down human pose estimation model optimized for aerial imagery and able to run on edge devices. We compare the effectiveness of current approaches and improve the person detection and pose estimation results on aerial datasets. Through multi-dataset training, we achieve an average improvement of 12.3 AP in person detection on the test-sets of Manipal-UAV, VisDrone, HIT-UAV as well as a custom aerial pose estimation dataset. For 2D human pose estimation we report an improvement of 16.3 mAP on the challenging UAV-Human dataset. FlyPose runs with an inference latency of 19.5 milliseconds on a Jetson Orin Developer Kit. Our model aims to provide a foundation for human-aware UAV applications with realtime demands and provides more accurate human poses for downstream tasks.
MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
Tooba Tehreem Sheikh ⋅ Jean Lahoud ⋅ Rao Anwer ⋅ Fahad Khan ⋅ Salman Khan ⋅ Hisham Cholakkal
Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model will be made publicly available.
Exploiting Label-Independent Regularization from Spatial Patterns for Whole Slide Image Analysis
Weiyi Wu ⋅ Xinwen Xu ⋅ Chongyang Gao ⋅ Xingjian Diao ⋅ Siting Li ⋅ Jiang Gui
Whole slide images, with their gigapixel-scale panoramas of tissue samples, are pivotal for precise disease diagnosis. However, their analysis is hindered by immense data size and scarce annotations. Existing MIL methods face challenges due to the fundamental imbalance where a single bag-level label must guide the learning of numerous patch-level features. This sparse supervision makes it difficult to reliably identify discriminative patches during training, leading to unstable optimization and suboptimal solutions. We propose a spatially regularized MIL framework that leverages inherent spatial relationships among patch features as label-independent regularization signals. Our approach learns a shared representation space by jointly optimizing feature-induced spatial reconstruction and label-guided classification objectives, enforcing consistency between intrinsic structural patterns and supervisory signals. Experimental results on multiple public datasets demonstrate significant improvements over state-of-the-art methods, offering a promising direction for pathology image analysis.
SVS-GAN for Semantic Synthesis of Traffic Videos for Autonomous Driving
Khaled Seyam ⋅ Julian Wiederer ⋅ Markus Braun ⋅ Bin Yang
Autonomous driving demands robust perception modules trained on diverse scenarios, yet collecting and annotating real-world datasets is both expensive and often lacks sufficient coverage of all possible driving conditions. Semantic Image Synthesis (SIS)---the process of generating realistic images from semantic label maps---has proven effective for producing large-scale labeled data. However, extending SIS to the video domain as Semantic Video Synthesis (SVS), where entire sequences are generated from semantic maps, remains underexplored. We introduce SVS-GAN, a framework specifically tailored for SVS that generates high-quality, temporally coherent videos at a resolution of 1024$\times$512 in real-time (45 FPS). Our approach leverages a deformable motion triple-pyramid generator and a segmentation-aware discriminator to ensure strong semantic alignment and visual fidelity. Through this combination of tailored architecture and loss design, we bridge the gap between SIS and SVS, outperforming state-of-the-art GAN- and diffusion-based baselines on both Cityscapes and KITTI-360. When combined with a semantic-map generator, SVS-GAN enables controllable generation of diverse driving scenarios, providing a scalable source of labeled video for data augmentation and closed-loop testing.
Integrating Multi-scale and Multi-filtration Topological Features for Medical Image Classification
Pengfei Gu ⋅ Huimin Li ⋅ Haoteng Tang ⋅ Dongkuan Xu ⋅ Erik Enriquez ⋅ Dongchul Kim ⋅ Bin Fu ⋅ Danny Chen
Modern deep neural networks have shown remarkable performance in medical image classification. However, such networks either emphasize pixel-intensity features instead of fundamental anatomical structures (e.g., those encoded by topological invariants), or they capture only simple topological features via single-parameter persistence.In this paper, we propose a new topology-guided classification framework that extracts multi-scale and multi-filtration persistent topological features and integrates them into vision classification backbones.For an input image, we first compute cubical persistence diagrams (PDs) across multiple image resolutions/scales. We then develop a ``vineyard'' algorithm that consolidates these PDs into a single, stable diagram capturing signatures at varying granularities, from global anatomy to subtle local irregularities that may indicate early-stage disease.To further exploit richer topological representations produced by multiple filtrations, we design a cross-attention-based neural network that directly processes the consolidated final PDs. The resulting topological embeddings are fused with feature maps from CNNs or Transformers. By integrating multi-scale and multi-filtration topologies into an end-to-end architecture, our approach enhances the model's capacity to recognize complex anatomical structures.Evaluations on three public datasets show consistent, considerable improvements over strong baselines and state-of-the-art methods, demonstrating the value of our comprehensive topological perspective for robust and interpretable medical image classification.
NeuroBridge: Few-Shot Cross-Modal Neuron Re-identification via Dual-Channel Deep Metric Learning
Wenwei Li ⋅ Mingwei Liao ⋅ Lingyi Cai ⋅ Anan LI
Associating the in-vivo function of neurons with their ex-vivo anatomical structure is a central challenge in neuroscience. However, this field is constrained by a critical bottleneck: the extreme difficulty of acquiring paired cross-modal data, leading to a persistent scarcity of large-scale datasets. This inherent limitation frames the re-identification of the same neuron as a formidable few-shot, fine-grained visual recognition task. To address this challenge, we propose a novel deep metric learning framework designed to learn modality-invariant feature representations for single neurons under these data-scarce conditions. The core of this framework is a dual-channel network architecture that explicitly disentangles and fuses the local morphological information of the neuron's soma with the global topological context of the dendritic arbor, thereby capturing a more robust neural signature. To maximize data efficiency, we integrate a Circle Loss objective with a Multi-Similarity hard-sample mining strategy, which effectively optimizes the embedding space for better class separation. On a cross-modal neuron dataset that realistically reflects experimental data scarcity, our method demonstrates excellent performance, achieving a Recall of 77.4% and a Specificity of 90.1% on the test set. Extensive ablation studies and comparative analyses validate the effectiveness of our proposed method, establishing a new strong baseline for this critical yet data-limited biomedical application. To foster future research in this field, we will release our code, dataset, and pre-trained models.
Sketch3R: Rapid and Realistic 3D VR Sketch Creation to Shape Retrieval
Mritunjoy Halder ⋅ Shivam Shukla ⋅ Lokender Tiwari ⋅ Raghav Mittal ⋅ Brojeshwar Bhowmick
Large 3D shape repositories are rapidly expanding, driven by advances in generative modeling, making efficient shape retrieval increasingly important for authoring tools. While text queries capture high-level semantics, they often fail to convey precise geometric details. 3D sketches provide a more expressive means of representing shape geometry, and recent AR/VR developments have made sketch-based retrieval practical. However, existing 3D sketch datasets face three major limitations:(1) reliance on quad meshes or voxel hulls, which often fail on complex or non-manifold shapes; (2) use of fixed-size point clouds that discard stroke connectivity and limit geometric fidelity; and (3) dependence on expensive curve-based or multi-view rendering pipelines, which hinder large-scale data generation. Limited point cloud representations also fail to capture sketch connectivity and topology when used to train retrieval models. To address these challenges, we propose Sketch3R, a scalable framework that converts arbitrary 3D meshes into human-like VR sketches using a graph-based representation that preserves stroke connectivity and adapts to sketch complexity. Leveraging this representation, Sketch3R employs a lightweight graph-attention Siamese network for efficient and accurate sketch-to-shape retrieval. Experiments demonstrate that our method outperforms prior approaches in both accuracy and speed, while robustly handling 3D shapes across diverse topologies.
PhysEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education
Megha Mariam K M ⋅ Aditya Arun ⋅ Zakaria Laskar ⋅ Jawahar CV
Generative AI models, particularly Text-to-Video (T2V) systems, offer a promising avenue for transforming science education by automating the creation of engaging and intuitive visual explanations. In this work, we take a first step toward evaluating their potential in physics education by introducing a dedicated benchmark for explanatory video generation. The benchmark is designed to assess how well T2V models can convey core physics concepts through visual illustrations. Each physics concept in our benchmark is decomposed into granular teaching points, with each point accompanied by a carefully crafted prompt intended for visual explanation of the teaching point. T2V models are evaluated on their ability to generate accurate videos in response to these prompts. Our aim is to systematically explore the feasibility of using T2V models to generate high-quality, curriculum-aligned educational content—paving the way toward scalable, accessible, and personalized learning experiences powered by AI. Our evaluation reveals that current models produce visually coherent videos with smooth motion and minimal flickering, yet their conceptual accuracy is less reliable. Performance in areas such as mechanics, fluids and optics is encouraging, but models struggle with electromagnetism and thermodynamics, where abstract interactions are harder to depict. These findings underscore the gap between visual quality and conceptual correctness in educational video generation. We hope this benchmark helps the community close that gap and move toward T2V systems that can deliver accurate, curriculum-aligned physics content at scale.
Dual-Domain Multimodal Hyperbolic Fusion for Cardiopulmonary Disease Diagnosis in Emergency Care
Ke Nan ⋅ Maggie Samaan ⋅ Benjamin Burns ⋅ Xia Ning ⋅ Yuchi Han ⋅ Yuan Xue
Differentiating between cardiac and pulmonary diseases in emergency settings presents a significant challenge due to overlapping symptoms like dyspnea and chest pain, where misdiagnosis can lead to inappropriate interventions and increased morbidity. While electrocardiograms (ECGs) and chest X-rays (CXRs) provide complementary diagnostic information, existing multimodal fusion approaches fail to fully capture the complex relationships between these fundamentally different data modalities. To address these limitations, we propose DDMF-Net, a Dual-Domain Multimodal Fusion Network that explicitly unifies multi-domain features—from both frequency and spatial/temporal perspectives—and conducts cross-modality fusion of ECG, CXR signals and clinical parameters in hyperbolic space, thereby enhancing the modeling of complex cardiopulmonary pathophysiology. Our framework contains three innovations: (1) a frequency fusion module that captures complementary spectral patterns across modalities, (2) an inter-domain fusion module that dynamically balances domain-specific features, and (3) a hyperbolic cross-attention module with soft-entailment loss that effectively models hierarchical relationships between low-level imaging/signal data and high-level clinical parameters. Evaluated on four MIMIC datasets, DDMF-Net achieves state-of-the-art performance with over 2.9% improvement in micro-AUC, enabling more accurate differentiation of cardiac and pulmonary conditions in time-sensitive emergency settings. Code is publicly available at https://anonymous.4open.science/r/Dualdomainmultimodalfusionnetwork.