WACV 2026 Schedule

Filter Events

FRI 6 MAR

7:30 a.m.

Registration

(ends 5:00 PM)

8 a.m.

Poster Pickup [8:00-2:00]

(ends 2:00 PM)

8:30 a.m.

Workshop:

Workshop on Large Foundation Models in Biology and Biomedicine

(ends 5:00 PM)

Workshop:

6th Real-World Surveillance: Applications and Challenges

(ends 5:00 PM)

Workshop:

SAFE 2026 – Synthetic & Adversarial ForEnsics

(ends 5:00 PM)

Workshop:

Pixels to Patients: Bridging CV State-of-Art with Clinical Impact

(ends 5:00 PM)

Workshop:

4th Workshop on Computer Vision for Winter Sports

(ends 12:00 PM)

Workshop:

HARVEST-Vision: International Workshop on Applications of CV and HPC in Agriculture

(ends 12:00 PM)

Workshop:

International Workshop on Smart Waste Monitoring (WasteVision)

(ends 12:00 PM)

9:30 a.m.

Break:

Coffee Break

(ends 10:15 AM)

1 p.m.

Tutorial:

Machine Unlearning, Privacy, and AI Governance: Exploring Connections, Understanding Limitations, and Interrogating Policy Assumptions

(ends 5:00 PM)

Workshop:

EVGEN - Event-based Vision in the Era of Generative AI - Transforming Perception and Visual Innovation Summary

(ends 5:00 PM)

Workshop:

VisionDocs: 3rd Workshop on Computer Vision Systems for Document Analysis and Recognition

(ends 5:00 PM)

Workshop:

The Second Workshop on Computer Vision for Geospatial Image Analysis

(ends 5:00 PM)

Workshop:

Synthetic Realities and Data in Biometric Analysis and Security

(ends 5:00 PM)

3 p.m.

Break:

Coffee Break

(ends 3:45 PM)

Poster Sessions (based on workshop schedule) [3:00-3:45]

(ends 3:45 PM)

SAT 7 MAR

8 a.m.

Registration

(ends 5:00 PM)

Poster Pickup [8:00-2:00]

(ends 2:00 PM)

8:30 a.m.

Tutorial:

Beyond Vision: Multimodal Perspectives for Cross-View Geo-Localization

(ends 12:00 PM)

Workshop:

5th Workshop on Image/Video/Audio Quality Assessment in Computer Vision, VLM and Diffusion Model

(ends 5:00 PM)

Workshop:

LENS: Learning and Exploitation of Latent Space Geometries

(ends 5:00 PM)

Workshop:

3rd Workshop on Computer Vision for Earth Observation (CV4EO) Applications

(ends 5:00 PM)

Workshop:

Foundational Models Beyond the Visual Spectrum

(ends 12:00 PM)

Workshop:

3rd Physical Retail AI Workshop

(ends 12:00 PM)

Workshop:

VReID-XFD: Video-based Human Recognition at Extreme Far Distances

(ends 12:00 PM)

Workshop:

WACV-2026 Workshop On Generative, Adversarial, Manipulation and Presentation Attacks In Biometrics

(ends 12:00 PM)

9:30 a.m.

Break:

Coffee Break

(ends 10:15 AM)

1 p.m.

Workshop:

Visual Art, Generative AI, and the Legal/Ethical Dilemma

(ends 5:00 PM)

Workshop:

Workshop on Generative AI for Photography

(ends 5:00 PM)

Workshop:

Robust and Generalized Lane Topology Understanding and HD Map Generation through CoT Design

(ends 5:00 PM)

Workshop:

Large Language and Vision Models for Autonomous Driving

(ends 5:00 PM)

Workshop:

WACV 2026 Workshop Proposal Scene Graph for Structured Intelligence

(ends 5:00 PM)

3 p.m.

Poster Sessions (based on workshop schedule) [3:00-3:45]

(ends 3:45 PM)

Break:

Coffee Break

(ends 3:45 PM)

SUN 8 MAR

8 a.m.

Registration

(ends 5:00 PM)

8:30 a.m.

Remarks:

Opening Remarks and Paper Awards

(ends 9:00 AM)

9 a.m.

Poster Pickup [9:00-4:00]

(ends 4:00 PM)

Keynote:

Sparse View Synthesis

Ravi Ramamoorthi

(ends 10:00 AM)

10 a.m.

Break:

Coffee Break

(ends 10:15 AM)

10:15 a.m.

Oral Session 1A: Generative Models I [10:15-11:15]

Orals 10:15-11:15

[10:15] DreamAnywhere: Object-Centric Panoramic 3D Scene Generation

[10:27] ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

[10:39] Odo: Depth-Guided Diffusion for Identity-Preserving Body Reshaping

[10:51] BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

[11:03] Reinforcement Learning-based Adaptive Control of Classifier-Free Guidance and Timestep Embeddings in Diffusion Models

(ends 11:15 AM)

Oral Session 1B: 3D Computer Vision I [10:15-11:15]

Orals 10:15-11:15

[10:15] TS-PCI: Point Cloud Frame Interpolation with Time-Aware Point Cloud Sampling and Self-Supervised Learning Strategy

[10:27] Enhanced Back-Projection of Vision Features for 3D Symmetry Detection

[10:39] OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting

[10:51] UnderWater SLAM with Laser-light sectioning method using ST-GAT

[11:03] Leveraging Pretrained Representations for Cross-Modal Point Cloud Completion

(ends 11:15 AM)

11:15 a.m.

Poster Session 1 [11:15-1:00]

Posters 11:15-1:00

DreamAnywhere: Object-Centric Panoramic 3D Scene Generation

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Odo: Depth-Guided Diffusion for Identity-Preserving Body Reshaping

BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

Reinforcement Learning-based Adaptive Control of Classifier-Free Guidance and Timestep Embeddings in Diffusion Models

TS-PCI: Point Cloud Frame Interpolation with Time-Aware Point Cloud Sampling and Self-Supervised Learning Strategy

Enhanced Back-Projection of Vision Features for 3D Symmetry Detection

OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting

UnderWater SLAM with Laser-light sectioning method using ST-GAT

Leveraging Pretrained Representations for Cross-Modal Point Cloud Completion

Referring Change Detection in Remote Sensing Imagery

Inpaint360GS: Efficient Object-Aware 3D Inpainting via Gaussian Splatting for 360° Scenes

A Multi-Agent Diffusion Approach for MRI Anomaly Segmentation via Modality-Specific LoRA Specialization

GenHSI: Controllable Generation of Human-Scene Interaction Videos

SymNet: A Multi-Task Network for Joint Radio Map Reconstruction and Transmitter Localization

LooC: Effective Low-Dimensional Codebook for Compositional Vector Quantization

End-to-End Fine-Tuning of 3D Texture Generation using Differentiable Rewards

IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection

FAE-Net: Fashion Attribute Editing via Disentangled Latent Conditioning in Diffusion Models

DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing

MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data

TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

Towards Fast and Scalable Normal Integration using Continuous Components

A framework for real-time Surgical Phase Recognition with application to Robot-Assisted Partial Nephrectomy

A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions

Forget Less by Learning Together through Concept Consolidation

Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?

VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction

ChameleonTuner: Automatic ISP Color Tuning in Subjective Scenarios

ART: Actor-Related Tubelet for Detecting Complex-shaped Action Tubes

Training-free Multi-view 4D Human Motion Reconstruction Virtual Reality System

EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

OW-Rep: Open World Object Detection with Instance Representation Learning

Cluster-Guided Adversarial Perturbations for Robust Contrastive Learning

A Universal Self-Attention Enhancement for Bridging Low-bit Quantization and Vision Transformers

Overcoming Fine-Grained Visual Challenges in Animal Re-Identification via Semantic Feature Alignment

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

RAT4D: Rig and Animate Objects without Surface Templates in 4D

AFL-PRF: Adaptive Federated Learning for Low-Quality Data: Enhancing Performance, Robustness, and Fairness

Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers

DreamMakeup: Face Makeup Customization using Latent Diffusion Models

Morphing Through Time: Diffusion-Based Bridging of Temporal Gaps for Robust Alignment in Change Detection

AdaptViG: Adaptive Vision GNN with Exponential Decay Gating

MooTrack360: A Novel Fisheye Camera Dataset for Robust Multi Diary Cow Detection and Tracking

MDUNet: Multimodal Decoding UNet for Passive Occluder-Aided Non-line-of-sight 3D Imaging

Interleaved Vision-and-Language Generation via Generative Voken

Root Completion from Intraoral Scans of Tooth Crowns using Diffusion with Patch Perturbation

Systematic Analysis of the Unintentional CSAM-Generation-Potential of Text-to-Image Models

SGPMIL: Sparse Gaussian Process Multiple Instance Learning

Understanding Human-Like Biases in VLMs via Subjective Face Analytics

M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

Revisiting Layer Normalization for Point Cloud Test Time Adaptation

Learning to Animate Images from A Few Videos to Portray Delicate Human Actions

One-Shot Fine-Grained Re-Identification of Paint Marked Honey Bees using Vision Foundation Models

From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance

Mitigating Backdoor Attacks via Trigger Reconstruction and Model Hardening

Network-agnostic distortion-robust projections for wide-angle image understanding

4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos

Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters

Cluster-based Pseudo-labeling for Semi-Supervised LiDAR Semantic Segmentation

Human Pose Aggregation for Multi-View Temporal Video Alignment

Marshaled Learning: Bridging Large Neural Networks with Memory-Constrained Trusted Execution Environments in Federated Learning

Feature-Disentangling RGB-NIR Fusion Network for Remote Driver Physiological Measurement

Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation

Zero-Shot Video Deraining with Video Diffusion Models

Joint Modeling of Corruption-Driven and Information-Limited Uncertainty for Robust 3D Gaussian Splatting

Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness

CADE: Continual Weakly-supervised Video Anomaly Detection with Ensembles

PaRaChute: Pathology-Radiology Cross-Modal Fusion for Missing-Modality-Robust Survival Prediction

How to Design and Train Your Implicit Neural Representation for Video Compression

Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?

Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis

DICE: Discrete Inversion Enabling Controllable Editing for Masked Generative Models

Deepfake Detection that Generalizes Across Benchmarks

Unified Video Anomaly Detection Model for Detecting Different Anomaly Types

SCORP: Scene-Consistent Object Refinement via Proxy Generation and Tuning

CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones

ViGG: Robust RGB-D Point Cloud Registration using Visual-Geometric Mutual Guidance

Gaussian Representations for Video

MSRTrack: LLM-Powered Object Tracking with Motion and Semantic Reasoning

Enabling High-Quality In-the-Wild Imaging from Severely Aberrated Metalens Bursts

Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources

Beyond Realism: Learning the Art of Expressive Composition with StickerNet

Analysis of Text Accuracy and Visual Alignment in Vision-Language Models for Artistic Text Generation

MEDAL: multi-modal MEta-space Distillation and ALignment for Visual Compatibility Learning

PS3: Part level instance segmentation in 3D

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

Accelerated Dose Generation in Gamma Knife Radiosurgery Using a Wavelet Diffusion Model for Sparse Representation

DermEVAL: A Dermatologist-Reviewed Benchmark for Multimodal Large Language Models

SynPlay: Large-Scale Synthetic Human Data with Real-World Diversity for Aerial-View Perception

Do generative video models understand physical principles?

ZonUI-3B: Competitive GUI Grounding with a 3B VLM Trained on a Single Consumer GPU

No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts

MIX-based Foreground and Background Patch Augmentation Guided by Physics and Material Properties for X-ray Detection

Reverse Personalization

Beyond the Encoder: Joint Encoder-Decoder Contrastive Pre-Training Improves Dense Prediction

HyperPose: Hyper-pose Embeddings for 3D-Aware Generative Models with Self-Supervised Disentangling of Pose and Scene

Learning Mask-Aware Offsets: Two-branch Deformable Attention Networks for Inpainting with Masked Region Avoidance

Salience-SGG: Enhancing Unbiased Scene Graph Generation with Iterative Salience Estimation

SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection

Visibility guided Self-Supervised Occlusion Resilient Human Pose Estimation

Diverse Sketch Colorization with Content-Enhanced Style Representation and Recolorization Distillation

GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection

MR-Pruner: Training-free Multi-resolution Visual Token Pruning for Multi-modal Large Language Models

SpikeRain: Towards Energy-Efficient Single Image Deraining with Spiking Neural Networks

MBTI: Metric-Based Textual Inversion for Fine-Grained Image Generation

ImageNet-sES: A First Systematic Study of Sensor–Environment Simulation Anchored by Real Recaptures

MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

From Lightweight CNNs to SpikeNets: Benchmarking Accuracy–Energy Tradeoffs with Pruned Spiking SqueezeNet

AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

BanglaProtha: Evaluating Vision Language Models in Underrepresented Long-tail Cultural Contexts

CommonForms: A Large, Diverse Dataset for Form Field Detection

Beyond Real Weights: Hypercomplex Representations for Stable Quantization

mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

Contrastive Integrated Gradients: A Feature Attribution-Based Method for Explaining Whole Slide Image Classification

PSA-MIL: A Probabilistic Spatial Attention-Based Multiple Instance Learning for Whole Slide Image Classification

Fully Unsupervised Self-debiasing of Text-to-Image Diffusion Models

Model-free Domain Adaptation for Concealed Multimodal Large-Language Models

CAAC: Confidence-Aware Attention Calibration to Reduce Hallucinations in Large Vision-Language Models

LiDAR-DHMT: LiDAR-Adaptive Dual Hierarchical Mask Transformer for Robust Freespace Detection and Semantic Segmentation

Mixed Diffusion for 3D Indoor Scene Synthesis

PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models

MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation

SilverLining: Data-First Mitigation of Spatial and Spectral Shortcuts Without Introducing New Confounders

Distilling Diversity and Control in Diffusion Models

Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space

Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

AUTOCORRELATION-BASED FIDUCIAL MARKERS FOR TRACEABILITY

GrounDiff: Diffusion-Based Ground Surface Generation from Digital Surface Models

QuadraNet V2: Efficient and Sustainable Training of High-Order Neural Networks with Quadratic Adaptation

AutoSew: A Geometric Approach to Stitching Prediction with Graph Neural Networks

SaccadeX: Directed Acyclic Graph-based Semi-Supervised Learning of Continuous Ocular Dynamics from Sparse Neuromorphic Streams

Federated Model Synchronization for Diagnostic Redefinition through a Novel Selective Parameter Unlearning

A Fast, Simple, and Flexible Scale Informative Feature Transform Module for Arbitrary Scale Image Super-Resolution

(ends 1:00 PM)

Exhibits + Demos 1 [11:15-5:45]

(ends 5:45 PM)

noon

Break:

Lunch

(ends 1:30 PM)

12:30 p.m.

Panel Discussion:

Bridging the Gap Between Academic Benchmarks and Real-World Deployment in Computer Vision: The Path to Translation

(ends 1:30 PM)

1:45 p.m.

Oral Session 2A: Vision+Language and Other Modalities I [1:45-2:45]

Orals 1:45-2:45

[1:45] MageBench: Bridging Large Multimodal Models to Agents

[1:57] You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

[2:09] InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation

[2:21] ITSELF: Attention Guided Fine-Grained Alignment for Vision–Language Retrieval

[2:33] MarineEval: Assessing the Marine Intelligence of Vision-Language Models

(ends 2:45 PM)

Oral Session 2B: Biometrics, Face, Gesture, and Body Pose I [1:45-2:45]

Orals 1:45-2:45

[1:45] Identity Verification from Human Scent using Channel Representation of 2D Gas Chromatography-Mass Spectrometry Data

[1:57] milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion

[2:09] OpenCowID: Zero-Shot Visual Identification of Dairy Cows

[2:21] QCFace: Image Quality Control for boosting Face Representation & Recognition

[2:33] MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions

(ends 2:45 PM)

2:45 p.m.

Break:

Courtesy Break

(ends 3:00 PM)

3 p.m.

Oral Session 3A: Low-level and Physics-based Vision [3:00-4:00]

Orals 3:00-3:48

[3:00] BrightRate: Quality Assessment for User-Generated HDR Videos

[3:12] Reviving Unsupervised Optical Flow: Concept Reevaluation, Multi-Scale Advances and Full Open-Source Release

[3:24] UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations

[3:36] DRWKV: Focusing on Object Edges for Low-Light Image Enhancement

(ends 4:00 PM)

Oral Session 3B: Image Recognition and Understanding I [3:00-4:00]

Orals 3:00-3:48

[3:00] Layout Anything: One Transformer for Universal Room Layout Estimation

[3:12] BOP-Distrib: Revisiting 6D Pose Estimation Benchmarks for Better Evaluation under Visual Ambiguities

[3:24] Cosine Similarity is Almost All You Need (for Prototypical-Part Models)

[3:36] Orca: Object Recognition and Comprehension for Archiving Marine Species

(ends 4:00 PM)

4 p.m.

Poster Session 2 + Refreshments [4:00-5:45]

Posters 4:00-5:45

MageBench: Bridging Large Multimodal Models to Agents

You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation

ITSELF: Attention Guided Fine-Grained Alignment for Vision–Language Retrieval

MarineEval: Assessing the Marine Intelligence of Vision-Language Models

Identity Verification from Human Scent using Channel Representation of 2D Gas Chromatography-Mass Spectrometry Data

milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion

OpenCowID: Zero-Shot Visual Identification of Dairy Cows

QCFace: Image Quality Control for boosting Face Representation & Recognition

MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions

BrightRate: Quality Assessment for User-Generated HDR Videos

Reviving Unsupervised Optical Flow: Concept Reevaluation, Multi-Scale Advances and Full Open-Source Release

UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations

DRWKV: Focusing on Object Edges for Low-Light Image Enhancement

Layout Anything: One Transformer for Universal Room Layout Estimation

BOP-Distrib: Revisiting 6D Pose Estimation Benchmarks for Better Evaluation under Visual Ambiguities

Cosine Similarity is Almost All You Need (for Prototypical-Part Models)

Orca: Object Recognition and Comprehension for Archiving Marine Species

Multimodal Medical Image Binding via Shared Text Embeddings

PHYSPLAT: a Framework for Photorealistic Hybrid Simulation of Real and Synthetic Elements using 3D Gaussian Splatting

ArchitectHead: Continuous Level of Detail Control for 3D Gaussian Head Avatars

Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts

The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy

RapidMV: Leveraging Spatio-Angular Latent Space for Efficient and Consistent Text-to-Multi-View Synthesis

Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning

AD2: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems

SSplain: Sparse and Smooth Explainer for Retinopathy of Prematurity Classification

Tables Guide Vision: Learning to See the Heart through Tabular Data

SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense

Enhancing Object Detection Training via Joint Image-Annotation Generation

Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation

Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity

3D Gaussian Point Encoders

HumanGuideNet: Adapter-Based Alignment of Deep Neural Networks with Human Similarity Judgments

Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification

MEGA-PCC: A Mamba-based Efficient Approach for Joint Geometry and Attribute Point Cloud Compression

HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis

CycleSL: Server-Client Cyclical Update Driven Scalable Split Learning

3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

Segmentation-Aware Latent Diffusion for Satellite Image Super-Resolution: Enabling Smallholder Farm Boundary Delineation

mmWeaver: Environment-Specific mmWave Signal Synthesis from a Photo and Activity Description

Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models

Roadside Monocular 3D Detection Prompted by 2D Detection

UniCalib: Targetless LiDAR-camera Calibration via Probabilistic Flow on Unified Depth Representations

Color Bind: Exploring Color Perception in Text-to-Image Models

ASC: Learning Augmentation Severity-Consistent Representations Improves Generalization via Augmentation Search

Detecting Out-of-Distribution Objects through Class-Conditioned Inpainting

Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars

Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

False Alarm Rectification for Early Smoke Segmentation

Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space

Semi-Supervised Hierarchical Open-Set Classification

ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora

CURIO: Curvature-Aligned and Efficient OCR for Low-Resource Historical Manuscripts

DoTA: Latent Distribution Conditioned Data Attribution for Diffusion Models

Learnable Query-Enhanced Pose Transformation

CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video

Evaluating Text-to-Image and Text-to-Video Synthesis with a Conditional Frechet Distance

From Bands to Depth: Understanding Bathymetry Decisions on Sentinel-2

Pyramidal Spectrum: Frequency-based Hierarchically Vector Quantized VAE for Videos

PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel NeRA Adapter for Enhanced Feature Adaptation

EllipssianNet: Image-guided Sampling of 2D Gaussians for Gaussian Splatting

Zero-Shot Coreset Selection via Iterative Subspace Sampling

MorphXAI: An Explainable Framework for Morphological Analysis of Parasites in Blood Smear Images

WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement

SPAR-Det: Segmentation-guided and Prior-Aided Routing for Small Object Detection

GeoHSAF: Geometric Hippocampus Shape Analysis Framework for Longitudinal Alzheimer's Disease Classification

BAFIS: Dataset + Framework to assess occupational Bias and Human Preference in modern Text-to-image Models

Imitating the Functionality of Image-to-Image Models Using a Single Example

Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts

SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization

Sketch2Stitch: GANs for Abstract Sketch-Based Dress Synthesis

AnyBald: Toward Realistic Diffusion-Based Hair Removal In-The-Wild

Understanding Generative AI Capabilities in Everyday Image Editing Tasks

Reconstructing Realistic and Relightable Eyes

1LoRA: Summation Compression for Very-Low Rank Adaptation

Illuminating Darkness: Learning to Enhance Low-light Images In-the-Wild

DOODLE: Diffusion-based Out-of-Distribution Learning for Open-set LiDAR Semantic Segmentation

RobustFormer: Noise-Robust Pre-training for Images and Videos

CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning

Trajectory Tactics: When Transformers Learn Exploration to Generate Online Signature

BrandFusion: Aligning Image Generation with Brand Styles

Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models

FARF-Net: Frequency-guided Adaptive Receptive Field Network for Edge-enhanced Polyp Segmentation

Discrete Facial Encoding: A Framework for Data-driven Facial Display Discovery

Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding

DenseBEV: Transforming BEV Grid Cells into 3D Objects

Human knowledge integrated multi-modal learning for single source domain generalization

GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction

ScoliGaitX: A Deep Multi-Modal Fusion Network for Scoliosis Assessment via Gait Video Analysis

Non‑Contact Blood Pressure Estimation from Face Videos via Physiology‑Aware Contrastive Learning

Ordinal-Aware Multimodal Engagement Recognition for Collaborative Learning

DynaGSLAM: Real-Time Gaussian-Splatting SLAM for Online Rendering, Tracking, Motion Predictions of Moving Objects in Dynamic Scenes

Test-Time Adaptation through Semantically-guided Feature Decomposition for Few-shot Chest X-ray Diagnosis

FlowMorph: Revealing an Optimizable Flow Latent Space for Controlled Image Morphing

PVeRA: Probabilistic Vector-Based Random Matrix Adaptation

Harnessing Object Grounding for Time-Sensitive Video Understanding

TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning

Revisiting Retentive Networks for Fast Range-View 3D LiDAR Semantic Segmentation

Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention

Moiré Zero: An Efficient and High-Performance Neural Architecture for Moiré Removal

LENVIZ: A High-Resolution Low-Exposure Night Vision Benchmark Dataset

RobustGait: Robustness Analysis for Appearance Based Gait Recognition

A-V Representation Learning via Audio Shift Prediction for Multimodal Deepfake Detection and Temporal Localization

CraftSVG: Multi-Object Text-to-SVG Synthesis via Layout Guided Diffusion

Shift-Equivariant Complex-Valued Convolutional Neural Networks

ObjectMeshDeform : Towards recovering precise 3D geometry of real objects via image-guided mesh deformation of 3D generative priors

Memory-Augmented Representation for Efficient Event-based Visuomotor Policy Learning with Adaptive Perception and Control

PADM: A Physics-aware Diffusion Model for Attenuation Correction

Yunheon Lee, Juncheol Ye, Jaehong Kim, Dongsu Han NerVast: Compression-Efficient Scaling of Implicit Neural Video Representations via Scene-based Parameter-sharing

CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition

MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction

Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning

ConsensusXAI: A framework to examine class-wise agreement in medical imaging

Real-Time Tracking of Flexible Markers in Low-Contrast Fluoroscopy Using a Deep Neural Network Trained Solely on Synthetic Data

OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models

DMAT: An End-to-End Framework for Joint Atmospheric Turbulence Mitigation and Object Detection

Divide and Refine: Enhancing Multimodal Representation and Explainability for Emotion Recognition in Conversation

iMotion-LLM: Instruction-Conditioned Trajectory Generation

SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation

SUGAR: A Sweeter Spot for Generative Unlearning of Many Identities

Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation

Matching Semantically Similar Non-Identical Objects

UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network

Tables Decoded: DELTA for Structure, TARQA for Understanding

What Happens When: Learning Temporal Orders of Events in Videos

UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

Personalized Image Privacy Advisors via Federated Daisy-Chaining

DiRe: Diversity-promoting Regularization for Dataset Condensation

Vision-informed Semantic Text Alignment for Open-set Recognition in Remote Sensing

Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation

Temporal Object Captioning for Street Scene Videos from LiDAR Tracks

STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph

Conditional Text-to-Image Generation with Reference Guidance

Improved Wildfire Spread Prediction with Time-Series Data and the WSTS+ Benchmark

From Few-Shot to Zero-Shot Pallet Load Recognition: A Deployed Embedding-Based Vision System for Industrial Logistics

Graph-Based Spectral Attention with Multi-Spectral Images for Illuminant Estimation

(ends 5:45 PM)

MON 9 MAR

8 a.m.

Registration

(ends 5:00 PM)

8:30 a.m.

Keynote:

Applications of Computer Vision in Healthcare: The Road to Autonomy

Dorin Comaniciu

(ends 9:30 AM)

9 a.m.

Poster Pickup [9:00-4:00]

(ends 4:00 PM)

9:30 a.m.

Break:

Courtesy Break

(ends 9:45 AM)

9:45 a.m.

Oral Session 4A: Image Recognition and Understanding II [9:45-10:45]

Orals 9:45-10:33

[9:45] Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

[9:57] Extreme Amodal Face Detection

[10:09] ENCORE : A Neural Collapse Perspective on Out-of-Distribution Detection in Deep Neural Networks

[10:21] Performance of Conformal Prediction in Capturing Aleatoric Uncertainty

(ends 10:45 AM)

Oral Session 4B: Machine Learning I [9:45-10:45]

Orals 9:45-10:45

[9:45] Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination

[9:57] Unified Alignment Protocol: Making Sense of the Unlabeled Data in New Domains

[10:09] Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning

[10:21] Learning from Unknown for Open-Set Test-Time Adaptation

[10:33] Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling

(ends 10:45 AM)

10:45 a.m.

Exhibits + Demos 2 [10:45-6:00]

(ends 6:00 PM)

Poster Session 3 [10:45-12:30]

Posters 10:45-12:30

Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

Extreme Amodal Face Detection

ENCORE : A Neural Collapse Perspective on Out-of-Distribution Detection in Deep Neural Networks

Performance of Conformal Prediction in Capturing Aleatoric Uncertainty

Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination

Unified Alignment Protocol: Making Sense of the Unlabeled Data in New Domains

Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning

Learning from Unknown for Open-Set Test-Time Adaptation

StreetView-Waste: A Multi-Task Dataset for Urban Waste Management

AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering

Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation

Dragonite: Single-Step Drag-based Image Editing with Geometric-Semantic Guidance

Augmenting with NeRFs: Fast Relocalization on Densified Datasets

GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving

Towards Reliable Test-Time Adaptation: Style Invariance as a Correctness Likelihood

TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model

Large Sign Language Models: Toward 3D American Sign Language Translation

Crafting Descriptive Information for a Zero-shot Method to Improve Knowledge-Based Visual Question Answering Performance

Learning Action Hierarchies via Hybrid Geometric Diffusion

Learning Group Actions In Disentangled Latent Image Representations

GAITGen: Disentangled Motion-Pathology Impaired Gait Generative Model -- Bringing Motion Generation to the Clinical Domain

Gradient-Free Classifier Guidance for Diffusion Model Sampling

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models

Fused Similarity Measure Based Alignment with Dual-Scale Adaptive Selection for Weakly Supervised Video Anomaly Detection

PointNet4D: A lightweight 4D Point Cloud Video Backbone for Online and Offline Perception in Robotic Applications

Cross-Modal Event Encoder: Bridging Image–Text Knowledge to Event Streams

SAIL: Self-supervised Learning of Lighting-Invariant Representations from Real Images with Latent Diffusion

PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

NERVE: Neighbourhood & Entropy-Guided Random-Walk for Training Free Open-Vocabulary Segmentation

CLIP’s Visual Embedding Projector is a Few-shot Cornucopia

RegionAligner: Bridging Ego-Exo Views for Object Correspondence via Unified Text-Visual Learning

Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences

Show Me: Unifying Instructional Image and Video Generation with Diffusion Models

VOCAL: Visual Odometry via ContrAstive Learning

Pose-Diverse Multi-View Virtual Try-on from a Single Frontal Image via Diffusion Transformer

SimForce: Force and Surface Electromyography from Full Body Video with Graph Neural Nets

SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout

AirLock+: Scaling UAV-to-Satellite Image Registration for Target Geolocalization and Geospatial Augmented Reality

Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement

Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues

HEART-PFL: Stable Personalized Federated Learning under Heterogeneity with Hierarchical Directional Alignment and Adversarial Knowledge Transfer

Detecting Social Engagement of Elderly From Lifelog Image-streams to Identify Effective Cues for Autobiographic Recall

HistoMILKD: A Multiple Instance Learning based Multi-Teacher Knowledge Distillation Framework for Whole Slide Image Classification

Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model

CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading

DATTA: Domain-Adversarial Test-Time Adaptation for Cross-Domain WiFi-Based Human Activity Recognition

MixER: From Cross-Modal to Mixed-Modal Visible-Infrared Re-Identification

LighthouseGS: Indoor Structure-aware 3D Gaussian Splatting for Panorama-Style Mobile Captures

FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation

Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression

MANTA: Physics-Informed Generalized Underwater Object Tracking

KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird’s-Eye-View Segmentation

MAFM³: Modular Adaptation of Foundation Models for Multi-Modal Medical AI

SPOC: Spatially-Progressing Object State Change Segmentation in Video

Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection

Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image

Multi-Modal Soccer Scene Analysis with Masked Pre-Training

Visual Detector Compression via Location-Aware Discriminant Analysis

Latent Uncertainty-Aware Multi-View SDF Scan Completion

Comp4D: Compositional 4D Scene Generation

SENCA-st: Integrating Spatial Transcriptomics and Histopathology with Cross Attention Shared Encoder for Region Identification in Cancer Pathology

Towards High-Fidelity, Identity-Preserving Real-Time Makeup Transfer: Decoupling Style Generation

Feature Inversion as a Lens on Vision Encoders

MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training

SCALEX: Scalable Concept and Latent Exploration for Diffusion Models

DCSHARP: 3D Gaussian Splatting with Direction Cosine Spherical Harmonics and Shape-Aware Pruning

DOTGraph: CLIP-Driven Feature Disentanglement and Optimal Transport based Graph Learning for Few-Shot Segmentation

Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models

Sketch-guided Cage-based 3D Gaussian Splatting Deformation

Autoregressive Styled Text Image Generation, but Make it Reliable

Universal Neural Architecture Space: Covering ConvNets, Transformers and Everything in Between

Sea-CLIP: Mining Semantic-Aware Representations for Few-Shot Anomaly Detection with CLIP

CLIP-IT: CLIP-based Pairing of Histology Images with Privileged Textual Information

LightGazeNet: A Lightweight GNN-based Architecture for Gaze Estimation

Codebook Knowledge with Mamba-Transformer For Low-Light Image Enhancement

TiCLS : Tightly Coupled Language Text Spotter

Perception-Inspired Color Space Design for Photo White Balance Editing

CLUE: Bringing Machine Unlearning to Mobile Devices

FairScene: Learning Class-Disentangled 2D/3D Representations for Semantic Scene Completion

Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-training

Event-based Graph Representation with Spatial and Motion Vectors for Asynchronous Object Detection

ART-ASyn: Anatomy-aware Realistic Texture-based Anomaly Synthesis Framework for Chest X-Rays

High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

MuseDance: A Diffusion-based Music-Driven Image Animation System

A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings

DTMIR-Pro: Domain Translation with Prompt-based Latent-Space Generalization for Multi-Weather Image Restoration

ObjectCore -– Efficient Few-shot Logical Anomaly Detection using Object Representations

A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis

Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation

Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy

FlowEO: Generative Unsupervised Domain Adaptation for Earth Observation

Efficient Vision Transformers via Token Merging with Head-wise Attention Correction

2S-CEDiff: A Two-Stage Diffusion Framework for Generating High-Fidelity Contrast-Enhanced CT Images from Non-Contrast Scans

JOCA: Task-Driven Joint Optimisation of Camera Hardware and Adaptive Camera Control Algorithms

CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Mobile-Oriented Video Diffusion: Enabling Text-to-Video Generation on Mobile Devices Without Retraining, Compression, or Pruning

Workzone3D: A Multimodal Dataset for 3D Work Zone Perception in Autonomous Driving

Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

Training-free Detection of Text-to-video Generations via Over-coherence

ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion

Efficient Text-Guided Convolutional Adapter for the Diffusion Model

FLoMo-Net: A Novel Task-Adaptive Mixture of Experts Routing Framework with Frequency and Uncertainty Correction for Medical Image Segmentation

From Detection to Anticipation: Online Understanding of Struggles across Various Tasks and Activities

CaRS: A Causal Intervention Segmentation Framework and Benchmark Dataset for Autonomous Driving under Transitional Weather Conditions

Domain Generalizing DINO for Visual Regression via Latent Distractor Subspace Consistency

WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

Decoupling Shape and Texture in SAM-2 via Controlled Texture Replacement

HumanBench: Two Heads, No Legs, But Mostly Human, the State of Generative Capabilities in T2I Models

SOAF: Scene Occlusion-aware Neural Acoustic Field

FedSCAl: Leveraging Server and Client Alignment for Unsupervised Federated Source-Free Domain Adaptation

Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-free Open-Vocabulary Semantic Segmentation

One Model, Many Behaviors: Training-Induced Effects on Out-of-Distribution Detection

TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection

Rethinking Real Image Editing: Unleashing Diverse Editing Operators via Multi-Objective Optimization

Decomposition Sampling for Efficient Region Annotations in Active Learning

DNA: Dual-branch Network with Adaptation for Open-Set Online Handwriting Generation

STEG-AIW: Spatio-Temporal Gating and Adaptive-Timestep Inference for Efficient Spiking Neural Networks

Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction

Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar

Perceptually Guided 3DGS Streaming and Rendering for Mixed Reality

Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting

Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

Guided Texture Segmentation via Coordinate-Aware Class-Ratio Mapping

Empowering Source-Free Domain Adaptation via MLLM-Guided Reliability-Based Curriculum Learning

ExDDV: A New Dataset for Explainable Deepfake Detection in Video

AGENet: Adaptive Edge-aware Geodesic Distance Learning for Few-Shot Medical Image Segmentation

KMOPS: Keypoint-Driven Method for Multi-Object Pose and Metric Size Estimation from Stereo Images

OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding

Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling

Hybrid State Representation for Video Procedure Planning

(ends 12:30 PM)

noon

Break:

Lunch

(ends 1:30 PM)

Doctoral Consortium:

DC Event

(ends 2:00 PM)

1:30 p.m.

Oral Session 5A: Generative Models II [1:30-2:30]

Orals 1:30-2:30

[1:30] CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization

[1:42] DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

[1:54] VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping

[2:06] VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework

[2:18] Fine-grained Defocus Blur Control for Generative Image Models

(ends 2:30 PM)

Oral Session 5B: Remote Sensing and Sensors [1:30-2:30]

Orals 1:30-2:18

[1:30] CalibBEV: LiDAR-Camera Calibration via BEV Alignment

[1:42] X-JEPA: A Novel Joint Learning Cross-Modal Predictive Alignment Framework for Remote Sensing Image Retrieval

[1:54] SSMRadNet : A Sample-wise State-Space Framework for Efficient and Ultra-Light Radar Segmentation and Object Detection

[2:06] Rank-based Geographical Regularization: Revisiting Contrastive Self-Supervised Learning for Multispectral Remote Sensing Imagery

(ends 2:30 PM)

2:30 p.m.

Break:

Courtesy Break

(ends 2:45 PM)

2:45 p.m.

Oral Session 6A: 3D Computer Vision II [2:45-3:45]

Orals 2:45-3:45

[2:45] OMeGa: Joint Optimization of Explicit Meshes and Gaussian Splats for Robust Scene-Level Surface Reconstruction

[2:57] Confidence Through Parallel Attention for Depth and Uncertainty Estimation in Dynamic Environments

[3:09] BiNAR: A Bi-Modal Framework for Non-Aligned RGB-IR 3D Reconstruction via Gaussian Splatting

[3:21] Spec-Gloss Surfels and Normal–Diffuse Priors for Relightable Glossy Objects

[3:33] Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning

(ends 3:45 PM)

Oral Session 6B: Video Recognition and Understanding I [2:45-3:45]

Orals 2:45-3:45

[2:45] Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance

[2:57] Similarity-aware Probabilistic Embeddings Modeling for Video-Text Retrieval

[3:09] PromptGAR: Flexible Promptive Group Activity Recognition

[3:21] Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

[3:33] Broadcast2Pitch: Game State Reconstruction from Unconstrained Soccer Videos

(ends 3:45 PM)

3:45 p.m.

Meeting:

PAMI-TC meeting

(ends 4:30 PM)

4:30 p.m.

Poster Session 4 + Reception [4:30-6:15]

Posters 4:30-6:15

CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization

DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping

VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework

Fine-grained Defocus Blur Control for Generative Image Models

CalibBEV: LiDAR-Camera Calibration via BEV Alignment

X-JEPA: A Novel Joint Learning Cross-Modal Predictive Alignment Framework for Remote Sensing Image Retrieval

SSMRadNet : A Sample-wise State-Space Framework for Efficient and Ultra-Light Radar Segmentation and Object Detection

Rank-based Geographical Regularization: Revisiting Contrastive Self-Supervised Learning for Multispectral Remote Sensing Imagery

OMeGa: Joint Optimization of Explicit Meshes and Gaussian Splats for Robust Scene-Level Surface Reconstruction

Confidence Through Parallel Attention for Depth and Uncertainty Estimation in Dynamic Environments

BiNAR: A Bi-Modal Framework for Non-Aligned RGB-IR 3D Reconstruction via Gaussian Splatting

Spec-Gloss Surfels and Normal–Diffuse Priors for Relightable Glossy Objects

Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning

Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance

Similarity-aware Probabilistic Embeddings Modeling for Video-Text Retrieval

PromptGAR: Flexible Promptive Group Activity Recognition

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Broadcast2Pitch: Game State Reconstruction from Unconstrained Soccer Videos

VLMs Guided Interpretable Decision Making in Autonomous Driving

DARB-Splatting: Generalizing Splatting with Decaying Anisotropic Radial Basis Functions

Surgical Gaussian Surfels: Highly Accurate Real-time Surgical Scene Rendering using Gaussian Surfels

Gated Temporal Fusion Transformers for Robust Multi-Object Tracking

SIAM: Synchronous Interaction Attention for Human Mesh Recovery

SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding

Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection

LVM-Lite: Training Large Vision Models with Efficient Sequential Modeling

HiGlassRM: Learning to Remove High-prescription Glasses via Synthetic Dataset Generation

Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

SeaClips: A Video Dataset for Maritime Object Detection.

UniDiff: Parameter-Efficient Adaptation of Diffusion Models for Land Cover Classification with Multi-Modal Remotely Sensed Imagery and Sparse Annotations

CropAT: Leveraging Diffusion-Generated Target-Like Cropped Objects for Pseudo-Label Refinement in Domain-Adaptive Object Detection

Beyond Faces: A Multimodal Person Clustering for Unconstrained Environments

Eye-for-an-eye: Appearance Transfer with Dense Semantic Correspondence in Diffusion Models

Towards Photorealistic Style Transfer with Multimodal Guidance and Robustness to Content Images in Arbitrary Styles

Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation

Automated Pore Detection from In-Situ FDM 3D Printing Video: A Comparative Evaluation of Modern Segmentation Models

Safe Vision-Language Models via Unsafe Weights Manipulation

SOPHY: Generating Simulation-Ready Objects with Physical Materials

Generalization of Real World Video Deblurring By Image-to-Image Translation

A Dataset and Framework for Learning State-invariant Object Representations

Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation

Countering Multi-modal Representation Collapse through Rank-targeted Fusion

Learning Unified Spatio-temporal Representations for Efficient Compressed Video Understanding

More Than Memory Savings: Zeroth-Order Optimization Mitigates Forgetting in Continual Learning

Global Focal and Radial Distortion Averaging from Radial Fundamental Matrices for Robust Self-Calibration

UCDSC: Open Set UnCertainty aware Deep Simplex Classifier for Medical Image Datasets

DODA: Adapting Object Detectors to Dynamic Agricultural Environments in Real-Time with Diffusion

Structured Context Learning for Generic Event Boundary Detection

Sun-E: Dataset and Benchmark for Event-Based Sun Sensing

FCC: Fully Connected Correlation for One-Shot Segmentation

MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency

ProSkill: Segment-Level Skill Assessment in Procedural Videos

MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

CoL2A: Convolution-free Local Linear Attention for SpatioTemporal Event Processing

Dronaquatics: Real-time Swimming Analytics Using Drone Captured Imagery

Histopath-C: Towards Realistic Domain Shifts for Histopathology Vision-Language Adaptation

Robust Multimodal Emotion Recognition from Incomplete Modalities via Query-Based Unimodal and Cross-Modal Learning

WiSE-OD: Benchmarking Robustness in Infrared Object Detection

Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching

Gaussian Swaying: Surface-Based Framework for Aerodynamic Simulation with 3D Gaussians

Restora-Flow: Mask-Guided Image Restoration with Flow Matching

PrevMatch: Revisiting and Maximizing Temporal Knowledge in Semi-Supervised Semantic Segmentation

One-shot Portrait Stylizaiton via Geometric Alignment

Zero-Shot Table Extraction in Business Documents: A Unified Benchmark with Error Taxonomy and Ecological Analysis

SegMango: Early Deep Mango Yield Prediction based on Flower Segmentation and Weather Data

Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery

PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction

Timestamp Query Transformer for Temporal Action Segmentation

QC-SF: Improving Computer Vision for Airborne LiDAR Point Clouds of Boreal Forests with Quebec Simulated Forest Dataset

RemEdit: Efficient Diffusion Editing with Riemannian Geometry

GrowTAS: Progressive Expansion from Small to Large Subnets for Efficient ViT Architecture Search

Subspace-Guided Knowledge Distillation for Efficient Model Transfer

TimeRefine: Temporal Grounding with Time Refining Video LLM

Curve Skeletonization in Continuous domain for Meshes and Point Clouds

Gene-DML: Dual-Pathway Multi-Level Discrimination for Gene Expression Prediction from Histopathology Images

Color Preserving CMOS-SPAD Fusion for Multi-Frame HDR

Unsupervised Segmentation by Diffusing, Walking and Cutting

Learning spatio-temporal feature representations for video-based gaze estimation

ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss

3D Superquadric Splatting

Controllable Long-term Motion Generation with Extended Joint Targets

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution

LASOR: Towards Clinically Transparent and Explainable Ophthalmic Report Generation via Lesion-Aware Segmentation

Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects

Lorentz Entailment Cone for Semantic Segmentation

WarpRF: Multi-View Consistency for Training-Free Uncertainty Quantification and Applications in Radiance Fields

GAEA: A Geolocation Aware Conversational Assistant

WSSSP-Net: Weakly Supervised Semantic Segmentation Plugin Network for Face Anti-Spoofing

CONCORD: Concept-Informed Diffusion for Dataset Distillation

Improving Out-of-Distribution Detection Using Segmented Images and Cross-View Attention Fusion

An improved architecture for part-based animal re-identification through semantic segmentation distillation

FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks

Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction

R3: Reconstruction, Raw, and Rain: Deraining Directly in the Bayer Domain

An Efficient Multi-Rater Setup Towards Personalized and Diversified Medical Image Segmentation

HiMix : Hierarchical Visual-Textual Mixing Network for Lesion Segmentation

FSP-DETR: Few-Shot Prototypical Parasitic Ova Detection

DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions

Context-Preserving Dermoscopic Editing: Mask-Guided Lesion-Aware Diffusion for Attribute Modification

How I Met Your Bias: Investigating Bias Amplification in Diffusion Models

BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries

ProtoGMVAE: A Variational Auto-Encoder with True Gaussian Mixture Prior for Prototypical-based Self-Explainability

AEON: Adaptive Embedding Optimized Noise for Robust Watermarking in Diffusion Models

Revisiting Vision–Language Foundations for No-Reference Image Quality Assessment

Semi-supervised Domain Adaptation via Mutual Alignment through Joint Error

Unified Control for Inference-Time Guidance of Denoising Diffusion Models

Learning Subglacial Bed Topography from Sparse Radar with Physics-Guided Residuals

4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimer's Diagnosis

One-Cycle Structured Pruning via Stability-Driven Subnetwork Search

PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction

Multi-view stereo with multiple projectors for oneshot entire shape scan based on Neural SDF and DSSS demultiplexing

FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting

ControlEvents: Controllable Synthesis of Event Camera Data with Foundational Prior from Image Diffusion Models

DPBridge: Latent Diffusion Bridge for Dense Prediction

PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection

SurfDist: Interpretable Three-Dimensional Instance Segmentation Using Curved Surface Patches

CRISP: Cylindrical Rendering for In-Stream Point Clouds

INRetouch: Context Aware Implicit Neural Representation for Photography Retouching

Line Art Colorization with Offset Prior-based Diffusion Model

Food Image Generation on Multi-Noun Categories

RobuMTL: Enhancing Multi-Task Learning Robustness Against Weather Conditions

EndoPBR: Photorealistic Synthetic Data for Surgical 3D Vision via Physically-based Rendering

Inpainting of Sparse Depth Maps from Monocular Depth-from-Focus on Pixel Processor Arrays

DMS2F-HAD: A Dual-branch Mamba-based Spatial–Spectral Fusion Network for Hyperspectral Anomaly Detection

F-ViTA: Foundation Model Guided Visible to Infrared Translation

KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding

FujiView: Multimodal Late-Fusion for Predicting Scenic Visibility

Improving Animal Pose Estimation through Species Similarity Measures and Rigorous Label Definition

Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Semi-supervised Key-Point Estimation for Echocardiography Video

Anatomically-guided masked autoencoder pre-training for aneurysm detection

Style-Friendly SNR Sampler for Style-Driven Generation

Lose Your Self (LoYS): an adversarial entropy-based unsupervised approach for model debiasing

MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

Grounding Degradations in Natural Language for All-In-One Video Restoration

ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points

Enhancing Vision Language Corruption Robustness using Cross Distribution & Prompted Denoisers

BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining

Towards Egocentric 3D Hand Pose Estimation in Unseen Domains

Direct Visual Grounding by Directing Attention of Visual Tokens

CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow

(ends 6:15 PM)

TUE 10 MAR

8 a.m.

Registration

(ends 2:00 PM)

8:30 a.m.

Keynote:

A short history of video Understanding - Past, Present, and Future

Hilde Kühne

(ends 9:30 AM)

9 a.m.

Poster Pickup [9:00-4:00]

(ends 4:00 PM)

9:30 a.m.

Break:

Courtesy Break

(ends 9:45 AM)

9:45 a.m.

Oral Session 7A: Biometrics, Face, Gesture, and Body Pose II [9:45-10:45]

Orals 9:45-10:45

[9:45] Motion-Aware Graph Fusion NetWork for 3D Human Pose Estimation

[9:57] UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training

[10:09] Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities

[10:21] VAST-ReID: A Low-Light Benchmark Dataset for Person Re-Identification with Visual and Attribute-Rich Semantic Tracking

[10:33] DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors

(ends 10:45 AM)

Oral Session 7B: Vision+Language and Other Modalities II [9:45-10:45]

Orals 9:45-10:45

[9:45] DREAM: Dynamic Prompts and GuidedMix for Efficient Continual Adaptation of Visual-Language Models

[9:57] brat: Aligned Multi-View Embeddings for Brain MRI Analysis

[10:09] Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score

[10:21] Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation

[10:33] CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering

(ends 10:45 AM)

10:45 a.m.

Exhibits + Demos 3 [10:45-2:00]

(ends 2:00 PM)

Poster Session 5 [10:45-12:15]

Posters 10:45-12:15

Motion-Aware Graph Fusion NetWork for 3D Human Pose Estimation

UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training

Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities

VAST-ReID: A Low-Light Benchmark Dataset for Person Re-Identification with Visual and Attribute-Rich Semantic Tracking

DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors

DREAM: Dynamic Prompts and GuidedMix for Efficient Continual Adaptation of Visual-Language Models

brat: Aligned Multi-View Embeddings for Brain MRI Analysis

Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score

Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation

CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering

Isolating the Role of Temporal Information in Video Saliency: A Controlled Experimental Analysis

Diffusion-Based Action Recognition Generalizes to Untrained Domains

CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation

Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information

Causality-Driven Audits of Model Robustness

BAFLE-DCT: Bypassing Adversarial Filters via Frequency-Selective Embedding in the DCT Domain

Logit-Adjusted Test-Time Adaptation under Partial Class Imbalance

Test Time Adaptation Using Adaptive Quantile Recalibration

OSEG: Improving Diffusion sampling through Orthogonal Smoothed Energy Guidance

Hymavi : A Hybrid Mamba-Attention Network in Multi-View Framework for Volumetric Medical Image Segmentation

RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding

IMKD: Intensity-Aware Multi-Level Knowledge Distillation for Camera-Radar Fusion

LogicCBMs: Logic-Enhanced Concept-Based Learning

Understanding the Visual Projection Space of Multimodal LLMs

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

SSMT-Net: A Semi-Supervised Multitask Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images

Gaussian Splatting Map Registration with Orthographic Bird's-Eye-View Renderings

MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation

MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection

ODEt(ODEl): Shortcutting the Time and the Length in Diffusion and Flow Models for Faster Sampling

TM-Adapter: Temporal Merge Adapter for Efficient Global Temporal Modeling

Diagnose Like A REAL Pathologist: An Uncertainty-Focused Approach for Trustworthy Multi-Resolution Multiple Instance Learning

Align Video Diffusion Model with Online Video-Centric Preference Optimization

SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding

GHOST: Getting to the Bottom of Hallucinations with A Multi-round Consistency Benchmark

GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts

Zero‑Shot Domain Generalisation via Prompt-Driven Feature Refinement

Can Image Splicing and Copy-Move Forgery Be Detected by the Same Model? Forensim: An Attention-Based State-Space Approach

Distilling Offline Action Detection Models into Real-Time Streaming Models

AuthGuard: Generalizable Deepfake Detection via Language Guidance

GroupPortrait: Multi-ID Portrait Generation with High Identity Preservation and Fine-Grained Control

GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion

Denoise, Divide, Distill, and Predict (D3P): Towards Forecasting Long-horizon Real-world Anomaly from Normalcy

See, Record, Do: Automated Generation of UI Workflows from Tutorial Videos

Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

FocalComm: Hard Instance-Aware Multi-Agent Perception

SFMNet: Sparse Focal Modulation for 3D Object Detection

MoSCo: Real-time and Efficient Text-to-Motion Synthesis via Delta Training

VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion

Predicting Task fMRI Contrasts from Resting-State fMRI Using Sparse 3D Convolutions

SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection

Overcoming Small Data Limitations in Video-Based Infant Respiration Estimation

Non-Aligned Reference Image Quality Assessment for Novel View Synthesis

Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis

TED-4DGS: Temporally Activated and Embedding-based Deformation for 4DGS Compression

QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain

Single-step Diffusion for Image Compression at Ultra-Low Bitrates

Virtually Unrolling the Herculaneum Papyri by Diffeomorphic Spiral Fitting

Diffusion Noise Optimization for Synthetic VLM Training

Histogram Assisted Quality Aware Generative Model for Resolution Invariant NIR Image Colorization

Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices

A Unified Diffusion-Based Framework for Multi-Agent Trajectory Prediction Integrating Structured Multi-Modal Representations

ChartQA-X: Generating Explanations for Visual Chart Reasoning

Distribution Highlighted Reference-based Label Distribution Learning for Facial Age Estimation

T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation

UniTabBank: A Large Scale Multi-Lingual, Multi-Layout, Multi-Type, Multi-Format Dataset for Table Detection

R-MMA: Enhancing Vision-Language Models with Recurrent Adapters for Few-Shot and Cross-Domain Generalization

Training-Free Few-Shot Segmentation via Vision-Language Guided Prompting

High-Level Semantics and Low-Level Features Fusion for Multi-Scale Object Detection in Dynamic Construction Environments

Mean-Shift Distillation for Diffusion Mode Seeking

TacticalCalib: End-to-End 6-DoF Camera Pose Regression for Tactical Camera Calibration

F-INR: Functional Tensor Decomposition for Implicit Neural Representations

V2XScene: Multi-View Consistent 3D Scene Simulation for Collaborative Perception

WiSAR3D - Aerial LiDAR dataset for 3D object detection

A Deep Network for Object Detection on Inland Waters

CoreCaption: Core Caption based Text-to-Video Retrieval

ZebraPose: Zebra Detection and Pose Estimation using only Synthetic Data

FAIR-SIGHT: Fairness Assurance in Image Recognition via Simultaneous Conformal Thresholding and Dynamic Output Repair

GDoFS: Gaussian DoF Separation for Plausible 3D Geometry in Sparse-View 3DGS

Learning Beyond Labels: Self-Supervised Handwritten Text Recognition

Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment

Neural Geometry Image-Based Representations with Optimal Transport (OT)

RealDroneVision: Dataset and Architecture Advancements for Small-Object Drone Detection

PerVL-Bench: Benchmarking Multimodal Personalization for Large Vision–Language Models

TRACE: Confounder-free Adversarial Fine-tuning for Robust Object Detection

Zero-LEAD: Source-Free Universal Domain Adaptation for Abdominal Multi-Organ Segmentation

Correcting and Quantifying Systematic Errors in 3D Box Annotations for Autonomous Driving

FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding

Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset

UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models

Optimal Transport for Rectified Flow Image Editing: Unifying Inversion-Based and Direct Methods

Synthesizing Compositional Videos from Text Description

S2O: Static to Openable Enhancement for Articulated 3D Objects

HodgeFormer: Transformers for Learnable Operators on Triangular Meshes through Data-Driven Hodge Matrices

Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data

From Prompt to Production: Automating Brand-Safe Marketing Imagery with Text-to-Image Models

Equivariant Sampling for Improving Diffusion Model-based Image Restoration

PoseAdapt: Sustainable Human Pose Estimation via Continual Learning Benchmarks and Toolkit

SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance

ATM: Enhanced Alignment for Text-to-Motion Generation

Memoire: Learning User Personas from Gallery Tags for Personalized Photo Curation

CSGaussian: Progressive Rate-Distortion Compression and Segmentation for 3D Gaussian Splatting

Test-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations

SD-CSFL: A Synthetic Data-Driven Conformity Scoring Framework for Robust Federated Learning

Alignment and Distillation: A Robust Framework for Multimodal Domain Generalizable Human Action Recognition

Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport

Clear Sights on Site: A Spatial-Adaptive Channel Network for Deblurring Construction Site Images

SegMo: Segment-aligned Text to 3D Human Motion Generation

From Darkness to Detail: Frequency-Aware SSMs for Low-Light Vision

Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

ForestSplats: Deformable transient field for Gaussian Splatting in the Wild

Graph Query Networks for Object Detection with Automotive Radar

FlowCLAS: Enhancing Normalizing Flow-Based Anomaly Segmentation Via Contrastive Learning

ScoreNet: Netting Lightweight Quality Scores for Better Visual Assessment with Large Multi-Modality Models

VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics

Digital Forensic AI You Can Explain: A Case Study on Video Source Camera Identification

Modeling and Learning Multiple Hypotheses for Monocular 3D Object Detection

SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation

DreamCatcher: Efficient Multi-Concept Customization via Representation Finetuning

Enhancing Monocular 3D Hand Reconstruction with Learned Texture Priors

PALMS+: Modular Image-Based Floor Plan Localization Leveraging Depth Foundation Model

IPCD: Intrinsic Point-Cloud Decomposition

Multimodal Graph Representation Learning over Arbitrary Sets of Modalities

FNOPT: Resolution-Agnostic, Self-Supervised Cloth Simulation using Meta-Optimization with Fourier Neural Operators

D2Mamba: Dual Domain Guided Informed Search in State Space Model for Underwater Image Enhancement

Image-Guided Semantic Pseudo-LiDAR Point Generation for 3D Object Detection

HABIT: Human Action Benchmark for Interactive Traffic in CARLA

EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models

AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization

Exploring the Boundaries of Diffusion Models for Offline Writer Identification with Sparse and Intra-Variable Data

CLoCKDistill: Consistent Location-and-Context-aware Knowledge Distillation for DETRs

MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

(ends 12:15 PM)

noon

Break:

Lunch

(ends 1:30 PM)

1:30 p.m.

Oral Session 8A: Biomedical, Healthcare, and Medicine [1:30-2:30]

Orals 1:30-2:30

[1:30] Cycle-consistent Multi-graph Matching for Self-supervised Annotation of C. Elegans

[1:42] Automated Suturing Skill Assessment in Robot-assisted Surgery from Endoscopic Videos using Clinically-guided Evaluation Criteria

[1:54] Deep Image Decomposition for Medical Imaging Anonymization and Curation

[2:06] Intraoperative 2D/3D Registration via Spherical Similarity Learning and Differentiable Levenberg-Marquardt Optimization

[2:18] ACuRE: Accurate Continuity-Regularized SpO2 Estimation Using Liquid Time-Constant Networks

(ends 2:30 PM)

Oral Session 8B: Video Recognition and Understanding II [1:30-2:30]

Orals 1:30-2:30

[1:30] CAST: Evaluating Multi-Object Trackers with Context-Aware Switch and Transfer Scores

[1:42] Advancing Player Identification and Tracking with Global ID Fusion (GIF)

[1:54] Distilling What and Why: Enhancing Driver Intention Prediction with MLLMs

[2:06] LASER: Lip Landmark Assisted Speaker Detection for Robustness

[2:18] VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

(ends 2:30 PM)

2:45 p.m.

Oral Session 9B: Machine Learning II [2:45-3:45]

Orals 2:45-3:33

[2:45] IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers

[2:57] MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

[3:09] Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

[3:21] Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients

(ends 3:45 PM)

Oral Session 9A: Generative Models III [2:45-3:45]

Orals 2:45-3:45

[2:45] SCAdapter: Content-Style Disentanglement for Diffusion Style Transfer

[2:57] T2LF: LLM-Guided Multimodal Diffusion for Text-to-Light Field Synthesis

[3:09] VideoSketcher: A Training-Free Approach for Coherent Video Sketch Transfer

[3:21] Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

[3:33] SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

(ends 3:45 PM)

3:45 p.m.

Poster Session 6 + Refreshments [3:45-5:30]

Posters 3:45-5:30

Cycle-consistent Multi-graph Matching for Self-supervised Annotation of C. Elegans

Automated Suturing Skill Assessment in Robot-assisted Surgery from Endoscopic Videos using Clinically-guided Evaluation Criteria

Deep Image Decomposition for Medical Imaging Anonymization and Curation

Intraoperative 2D/3D Registration via Spherical Similarity Learning and Differentiable Levenberg-Marquardt Optimization

ACuRE: Accurate Continuity-Regularized SpO2 Estimation Using Liquid Time-Constant Networks

CAST: Evaluating Multi-Object Trackers with Context-Aware Switch and Transfer Scores

Advancing Player Identification and Tracking with Global ID Fusion (GIF)

Distilling What and Why: Enhancing Driver Intention Prediction with MLLMs

LASER: Lip Landmark Assisted Speaker Detection for Robustness

VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

SCAdapter: Content-Style Disentanglement for Diffusion Style Transfer

T2LF: LLM-Guided Multimodal Diffusion for Text-to-Light Field Synthesis

VideoSketcher: A Training-Free Approach for Coherent Video Sketch Transfer

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers

MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients

Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone

Where is the Watermark? Interpretable Watermark Detection at the Block Level

PointSt3R: Point Tracking through 3D Ground Correspondence

Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation

FairVLM: Enhancing Fairness and Prompt Sensitivity in Vision Language Models for Medical Image Segmentation

SHaSaM: Submodular Hard Sample Mining for Fair Facial Attribute Recognition

DM3Net: Dual-Camera Super-Resolution via Domain Modulation and Multi-scale Matching

SuperRivolution: Fine-Scale Rivers from Coarse Temporal Satellite Imagery

Adversarial Pseudo-replay for Exemplar-free Class-incremental Learning

DiffRegCD: Integrated Registration and Change Detection with Diffusion Features

HDR Reconstruction Boosting with Training-Free and Exposure-Consistent Diffusion

Optimizing against Infeasible Inclusions from Data for Semantic Segmentation through Morphology

3D Cell Oversegmentation Correction via Geo-Wasserstein Divergence

TopoRec: Point Cloud Recognition Using Topological Data Analysis

MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping

Remote Sensing Forestry Similarity Convolution

RampWatch: An In-the-Wild Dataset and Text-Guided Detection Framework for Recreational Vessels

Enhancing Reverse Distillation with Core Exemplar Learning for Unified Multi-Class Anomaly Detection

Leveraging Sparsity for Privacy in Collaborative Inference

Improvise, Adapt, Overcome — Telescopic Adapters for Efficient fine-tuning of Vision Language Models in Medical Imaging

SVD-Det: A Lightweight Framework for Video Forgery Detection Using Semantic and Visual Defect Cues

Joint Optimization of Camera Model and Deep Neural Network for Image Recognition

Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

ReFineVQA: Iterative Refinement of Video Description via Feedback Generation for Video Question Answering

MIST: Multilingual Incidental Dataset for Scene Text Detection

Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios

MemeTAG: Keyword-Driven Meme Classification through Tag Embedding Reconstruction

Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models

MedPEFT-CL: Dual-Phase Parameter-Efficient Continual Learning with Medical Semantic Adapter and Bidirectional Memory Consolidation

Splatter Layout: Geometry-embedded 3D Reconstruction via Surface Unfolding

HOLO: Holistic Lightweight Optimization for Scene Understanding with Auto-Annotation and Multimodal Learning

PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval

GRAPE (Gaussian Rendering for Accelerated Pixel Enhancement) Brings Fast and Lightweight Arbitrary Super-Resolution

Fetal and Neonatal Cortical Surface Reconstruction with Anatomical Normal-guidance and Perceptual Enhancements

View-aware Cross-modal Distillation for Multi-view Action Recognition

NRGMark: Localized Watermarking for Energy Transparency in Images

Test-Time Consistency in Vision Language Models

Revisiting an Old Perspective Projection for Monocular 3D Morphable Models Regression

General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood

Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation

Being Positive about Negative Queries: Exclusion Aware Multimodal Retrieval using Disentangled Representations

DualRes: Production-ready Dynamic Object Detection

Semantic Map Guided Bird's-Eye View Learning for Online HD Map Construction

ISALux: Illumination and Semantics-Aware Transformer Employing Mixture of Experts for Low Light Image Enhancement

FastPose-ViT: A Vision Transformer for Real-Time Spacecraft Pose Estimation

QuEENet: Quantum-Enhanced Expressive Network for Image Classification

SOLAR: Switchable Output Layer for Accuracy and Robustness in Once-for-All Training

FG-TRACER: Tracing Information Flow in Multimodal Large Language Models in Free-Form Generation

Patch Your Matcher: Correspondence-Aware Image-to-Image Translation Unlocks Cross-Modal Matching via Single-Modality Priors

Diversity Preserving Coresets for Image Quality Assessment

SAVE: Sparse Autoencoder‑Driven Visual Information Enhancement for Mitigating Object Hallucination

NavMapFusion: Diffusion-based Fusion of Navigation Maps for Online Vectorized HD Map Construction

GFT: Graph Feature Tuning for Efficient Point Cloud Analysis

QAL : A Loss for Recall–Precision Balance in 3D Reconstruction

Meta-YOLO: Metadata-Guided Real-Time Object Detector in Aerial Imagery

Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection

AusSmoke meets MultiNatSmoke: a fully-labelled diverse smoke segmentation dataset

NAPP: Noise-Adaptive Prototype Perturbation for Few-Shot Learning

GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting

Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting

Optimization-Free Style Transfer for 3D Gaussian Splats

PEaRL: Pathway-Enhanced Representation Learning for Gene and Pathway Expression Prediction from Histology

Flood-LDM: Generalizable Latent Diffusion Models for rapid and accurate zero-shot High-Resolution Flood Mapping

LangPose: Language-Aligned Motion for Robust 3D Human Pose Estimation

SphereEdit: Spherical Semantic Editing in Diffusion Models

FedEFC: Federated Learning Using Enhanced Forward Correction Against Noisy Labels

Photo Dating by Facial Age Aggregation

Scalable Video Action Anticipation with Cross Linear Attentive Memory

DUDA: Distilled Unsupervised Domain Adaptation for Lightweight Semantic Segmentation

Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study

START: Spatial and Textual Learning for Chart Understanding

VRAgent: Self-Refining Agent for Zero-Shot Multimodal Video Retrieval

MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

IMPACT: Interpretable Most Important Person Analysis and Classification using Transformer-based Models

SurgXBench: Explainable Vision-Language Model Benchmark for Surgery

SeqFeedNet: Sequential Feature Feedback Network for Background Subtraction

Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors

VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models

AortaDiff: A Unified Multitask Diffusion Framework for Contrast-Free AAA Imaging

DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment

Bridging the Domain Gap in Small Multimodal Models: A Dual-level Alignment Perspective

Crash2DocAI: Automated Integration of Post-Crash Car Part Images into Technical Reports

Conversational Image Generation: Towards Multi-Round Personalized Generation with Multi-Modal Language Models

CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting

SceneShine: Illumination-aware Human Scene Gaussian Re-Splatting from Mobile Device Video

See, Think, Learn: A Self-Taught Multimodal Reasoner

SpecGen: Neural Spectral BRDF Generation via Spectral-Spatial Tri-plane Aggregation

Pretraining Helps When Capacity Allows: Evidence from Ultra-Small ConvNets

Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs

UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks

GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring

Reciprocal Teaching: Dynamic Multi-Model Teacher-Student Learning for Multiple Noisy Annotations

DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis

TriaGS: Differentiable Triangulation-Guided Geometric Consistency for 3D Gaussian Splatting

SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking

Generalized Category Discovery for LiDAR Semantic Segmentation

ICONIC-444: A 3.1-Million-Image Dataset for OOD Detection Research

Any Detector Can Detect Anything

Towards Unconstrained Cross-View Pose Estimation

Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization

SmoothDiffusion-VE: Real-time Generative Video Editing Using Adaptive Feature Cache

SafeguardGS: 3D Gaussian Primitive Pruning While Avoiding Catastrophic Scene Destruction

Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

Stabilizing Direct Training of Spiking Neural Networks: Membrane Potential Initialization and Threshold-robust Surrogate Gradient

DocWaveDiff: A Predict-and-Refine approch for Document Image Enhancement with Wavelet U-Nets and Diffusion models

Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting

Rethinking Latent Variable in Learned Image Compression

AugMapNet: Improving Spatial Latent Structure via BEV Grid Augmentation for Enhanced Vectorized Online HD Map Construction

From Cognitive Priors to Instance Semantics: A Unified Framework for Multi-task Affective Computing

FuLLaMa: Training-free Diffusion-based Object Removal with Context Preservation

STRinGS: Selective Text Refinement in Gaussian Splatting

VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning

Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering

CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation

FlyPose: Towards Robust Human Pose Estimation From Aerial Views

MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities

Exploiting Label-Independent Regularization from Spatial Patterns for Whole Slide Image Analysis

SVS-GAN for Semantic Synthesis of Traffic Videos for Autonomous Driving

Integrating Multi-scale and Multi-filtration Topological Features for Medical Image Classification

NeuroBridge: Few-Shot Cross-Modal Neuron Re-identification via Dual-Channel Deep Metric Learning

Sketch3R: Rapid and Realistic 3D VR Sketch Creation to Shape Retrieval

PhysEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education

Dual-Domain Multimodal Hyperbolic Fusion for Cardiopulmonary Disease Diagnosis in Emergency Care

(ends 5:30 PM)