WACV 2026 Events with Videos

Below are all events that have video recordings. We are still processing recordings. As we upload them to the website, they will appear here.

Keynotes

Sparse View Synthesis
Applications of Computer Vision in Healthcare: The Road to Autonomy
A short history of video Understanding - Past, Present, and Future

Meetings

PAMI-TC meeting

Posters

VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction
BanglaProtha: Evaluating Vision Language Models in Underrepresented Long-tail Cultural Contexts
Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis
Mitigating Backdoor Attacks via Trigger Reconstruction and Model Hardening
No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts
Revisiting Layer Normalization for Point Cloud Test Time Adaptation
Leveraging Pretrained Representations for Cross-Modal Point Cloud Completion
Cluster-Guided Adversarial Perturbations for Robust Contrastive Learning
ZonUI-3B: Competitive GUI Grounding with a 3B VLM Trained on a Single Consumer GPU
CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones
FAE-Net: Fashion Attribute Editing via Disentangled Latent Conditioning in Diffusion Models
Understanding Human-Like Biases in VLMs via Subjective Face Analytics
Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters
A Fast, Simple, and Flexible Scale Informative Feature Transform Module for Arbitrary Scale Image Super-Resolution
ImageNet-sES: A First Systematic Study of Sensor–Environment Simulation Anchored by Real Recaptures
M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models
Interleaved Vision-and-Language Generation via Generative Voken
From Lightweight CNNs to SpikeNets: Benchmarking Accuracy–Energy Tradeoffs with Pruned Spiking SqueezeNet
PSA-MIL: A Probabilistic Spatial Attention-Based Multiple Instance Learning for Whole Slide Image Classification
DreamAnywhere: Object-Centric Panoramic 3D Scene Generation
Enabling High-Quality In-the-Wild Imaging from Severely Aberrated Metalens Bursts
A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions
Beyond the Encoder: Joint Encoder-Decoder Contrastive Pre-Training Improves Dense Prediction
Gaussian Representations for Video
Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?
GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection
IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection
PS3: Part level instance segmentation in 3D
Training-free Multi-view 4D Human Motion Reconstruction Virtual Reality System
Mixed Diffusion for 3D Indoor Scene Synthesis
MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval
A Multi-Agent Diffusion Approach for MRI Anomaly Segmentation via Modality-Specific LoRA Specialization
Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space
ChameleonTuner: Automatic ISP Color Tuning in Subjective Scenarios
EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation
Inpaint360GS: Efficient Object-Aware 3D Inpainting via Gaussian Splatting for 360° Scenes
Towards Fast and Scalable Normal Integration using Continuous Components
Cluster-based Pseudo-labeling for Semi-Supervised LiDAR Semantic Segmentation
Root Completion from Intraoral Scans of Tooth Crowns using Diffusion with Patch Perturbation
SilverLining: Data-First Mitigation of Spatial and Spectral Shortcuts Without Introducing New Confounders
BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis
Network-agnostic distortion-robust projections for wide-angle image understanding
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
TS-PCI: Point Cloud Frame Interpolation with Time-Aware Point Cloud Sampling and Self-Supervised Learning Strategy
Odo: Depth-Guided Diffusion for Identity-Preserving Body Reshaping
MEDAL: multi-modal MEta-space Distillation and ALignment for Visual Compatibility Learning
Model-free Domain Adaptation for Concealed Multimodal Large-Language Models
Human Pose Aggregation for Multi-View Temporal Video Alignment
M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models
Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness
CAAC: Confidence-Aware Attention Calibration to Reduce Hallucinations in Large Vision-Language Models
OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting
AUTOCORRELATION-BASED FIDUCIAL MARKERS FOR TRACEABILITY
Accelerated Dose Generation in Gamma Knife Radiosurgery Using a Wavelet Diffusion Model for Sparse Representation
A framework for real-time Surgical Phase Recognition with application to Robot-Assisted Partial Nephrectomy
Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers
Saliency-Guided DETR for Moment Retrieval and Highlight Detection
Federated Model Synchronization for Diagnostic Redefinition through a Novel Selective Parameter Unlearning
PaRaChute: Pathology-Radiology Cross-Modal Fusion for Missing-Modality-Robust Survival Prediction
Beyond Realism: Learning the Art of Expressive Composition with StickerNet
SGPMIL: Sparse Gaussian Process Multiple Instance Learning
Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources
ART: Actor-Related Tubelet for Detecting Complex-shaped Action Tubes
Learning to Animate Images from A Few Videos to Portray Delicate Human Actions
Marshaled Learning: Bridging Large Neural Networks with Memory-Constrained Trusted Execution Environments in Federated Learning
OW-Rep: Open World Object Detection with Instance Representation Learning
TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors
CADE: Continual Weakly-supervised Video Anomaly Detection with Ensembles
MBTI: Metric-Based Textual Inversion for Fine-Grained Image Generation
AutoSew: A Geometric Approach to Stitching Prediction with Graph Neural Networks
4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos
SpikeRain: Towards Energy-Efficient Single Image Deraining with Spiking Neural Networks
Zero-Shot Video Deraining with Video Diffusion Models
A Universal Self-Attention Enhancement for Bridging Low-bit Quantization and Vision Transformers
From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance
Morphing Through Time: Diffusion-Based Bridging of Temporal Gaps for Robust Alignment in Change Detection
SCORP: Scene-Consistent Object Refinement via Proxy Generation and Tuning
CommonForms: A Large, Diverse Dataset for Form Field Detection
Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters
MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
End-to-End Fine-Tuning of 3D Texture Generation using Differentiable Rewards
Reverse Personalization
SymNet: A Multi-Task Network for Joint Radio Map Reconstruction and Transmitter Localization
MIX-based Foreground and Background Patch Augmentation Guided by Physics and Material Properties for X-ray Detection
ViGG: Robust RGB-D Point Cloud Registration using Visual-Geometric Mutual Guidance
MR-Pruner: Training-free Multi-resolution Visual Token Pruning for Multi-modal Large Language Models
UnderWater SLAM with Laser-light sectioning method using ST-GAT
LooC: Effective Low-Dimensional Codebook for Compositional Vector Quantization
Fully Unsupervised Self-debiasing of Text-to-Image Diffusion Models
Salience-SGG: Enhancing Unbiased Scene Graph Generation with Iterative Salience Estimation
Systematic Analysis of the Unintentional CSAM-Generation-Potential of Text-to-Image Models
Enhanced Back-Projection of Vision Features for 3D Symmetry Detection
Unified Video Anomaly Detection Model for Detecting Different Anomaly Types
MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation
GrounDiff: Diffusion-Based Ground Surface Generation from Digital Surface Models
AdaptViG: Adaptive Vision GNN with Exponential Decay Gating
Overcoming Fine-Grained Visual Challenges in Animal Re-Identification via Semantic Feature Alignment
Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?
HyperPose: Hyper-pose Embeddings for 3D-Aware Generative Models with Self-Supervised Disentangling of Pose and Scene
MooTrack360: A Novel Fisheye Camera Dataset for Robust Multi Diary Cow Detection and Tracking
Diverse Sketch Colorization with Content-Enhanced Style Representation and Recolorization Distillation
SaccadeX: Directed Acyclic Graph-based Semi-Supervised Learning of Continuous Ocular Dynamics from Sparse Neuromorphic Streams
Beyond Real Weights: Hypercomplex Representations for Stable Quantization
Analysis of Text Accuracy and Visual Alignment in Vision-Language Models for Artistic Text Generation
ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models
How to Design and Train Your Implicit Neural Representation for Video Compression
QuadraNet V2: Efficient and Sustainable Training of High-Order Neural Networks with Quadratic Adaptation
Feature-Disentangling RGB-NIR Fusion Network for Remote Driver Physiological Measurement
Deepfake Detection that Generalizes Across Benchmarks
Reinforcement Learning-based Adaptive Control of Classifier-Free Guidance and Timestep Embeddings in Diffusion Models
Zero-Shot Coreset Selection via Iterative Subspace Sampling
MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions
SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense
Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding
Temporal Object Captioning for Street Scene Videos from LiDAR Tracks
RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph
QCFace: Image Quality Control for boosting Face Representation & Recognition
CraftSVG: Multi-Object Text-to-SVG Synthesis via Layout Guided Diffusion
Pyramidal Spectrum: Frequency-based Hierarchically Vector Quantized VAE for Videos
Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning
Revisiting Retentive Networks for Fast Range-View 3D LiDAR Semantic Segmentation
Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning
You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
Multimodal Medical Image Binding via Shared Text Embeddings
Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention
Harnessing Object Grounding for Time-Sensitive Video Understanding
MEGA-PCC: A Mamba-based Efficient Approach for Joint Geometry and Attribute Point Cloud Compression
Layout Anything: One Transformer for Universal Room Layout Estimation
TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning
CycleSL: Server-Client Cyclical Update Driven Scalable Split Learning
Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model
MarineEval: Assessing the Marine Intelligence of Vision-Language Models
CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video
Tables Guide Vision: Learning to See the Heart through Tabular Data
OpenCowID: Zero-Shot Visual Identification of Dairy Cows
UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations
GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction
Human knowledge integrated multi-modal learning for single source domain generalization
milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion
RobustFormer: Noise-Robust Pre-training for Images and Videos
Imitating the Functionality of Image-to-Image Models Using a Single Example
Shift-Equivariant Complex-Valued Convolutional Neural Networks
LENVIZ: A High-Resolution Low-Exposure Night Vision Benchmark Dataset
AnyBald: Toward Realistic Diffusion-Based Hair Removal In-The-Wild
WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement
Personalized Image Privacy Advisors via Federated Daisy-Chaining
Illuminating Darkness: Learning to Enhance Low-light Images In-the-Wild
SSplain: Sparse and Smooth Explainer for Retinopathy of Prematurity Classification
Divide and Refine: Enhancing Multimodal Representation and Explainability for Emotion Recognition in Conversation
A-V Representation Learning via Audio Shift Prediction for Multimodal Deepfake Detection and Temporal Localization
Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models
Orca: Object Recognition and Comprehension for Archiving Marine Species
ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora
PVeRA: Probabilistic Vector-Based Random Matrix Adaptation
From Few-Shot to Zero-Shot Pallet Load Recognition: A Deployed Embedding-Based Vision System for Industrial Logistics
Tables Decoded: DELTA for Structure, TARQA for Understanding
DiRe: Diversity-promoting Regularization for Dataset Condensation
AD2: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems
BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity
Graph-Based Spectral Attention with Multi-Spectral Images for Illuminant Estimation
SUGAR: A Sweeter Spot for Generative Unlearning of Many Identities
DynaGSLAM: Real-Time Gaussian-Splatting SLAM for Online Rendering, Tracking, Motion Predictions of Moving Objects in Dynamic Scenes
PADM: A Physics-aware Diffusion Model for Attenuation Correction
ObjectMeshDeform : Towards recovering precise 3D geometry of real objects via image-guided mesh deformation of 3D generative priors
STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences
Vision-informed Semantic Text Alignment for Open-set Recognition in Remote Sensing
Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions
FlowMorph: Revealing an Optimizable Flow Latent Space for Controlled Image Morphing
CURIO: Curvature-Aligned and Efficient OCR for Low-Resource Historical Manuscripts
ScoliGaitX: A Deep Multi-Modal Fusion Network for Scoliosis Assessment via Gait Video Analysis
Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts
Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts
Enhancing Object Detection Training via Joint Image-Annotation Generation
UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network
BAFIS: Dataset + Framework to assess occupational Bias and Human Preference in modern Text-to-image Models
Yunheon Lee, Juncheol Ye, Jaehong Kim, Dongsu Han NerVast: Compression-Efficient Scaling of Implicit Neural Video Representations via Scene-based Parameter-sharing
Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel NeRA Adapter for Enhanced Feature Adaptation
Discrete Facial Encoding: A Framework for Data-driven Facial Display Discovery
Roadside Monocular 3D Detection Prompted by 2D Detection
Learnable Query-Enhanced Pose Transformation
BOP-Distrib: Revisiting 6D Pose Estimation Benchmarks for Better Evaluation under Visual Ambiguities
From Bands to Depth: Understanding Bathymetry Decisions on Sentinel-2
MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction
mmWeaver: Environment-Specific mmWave Signal Synthesis from a Photo and Activity Description
Test-Time Adaptation through Semantically-guided Feature Decomposition for Few-shot Chest X-ray Diagnosis
SPAR-Det: Segmentation-guided and Prior-Aided Routing for Small Object Detection
DRWKV: Focusing on Object Edges for Low-Light Image Enhancement
Matching Semantically Similar Non-Identical Objects
3D Gaussian Point Encoders
Segmentation-Aware Latent Diffusion for Satellite Image Super-Resolution: Enabling Smallholder Farm Boundary Delineation
Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models
GeoHSAF: Geometric Hippocampus Shape Analysis Framework for Longitudinal Alzheimer's Disease Classification
Cosine Similarity is Almost All You Need (for Prototypical-Part Models)
RobustGait: Robustness Analysis for Appearance Based Gait Recognition
DenseBEV: Transforming BEV Grid Cells into 3D Objects
Color Bind: Exploring Color Perception in Text-to-Image Models
HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis
DMAT: An End-to-End Framework for Joint Atmospheric Turbulence Mitigation and Object Detection
DoTA: Latent Distribution Conditioned Data Attribution for Diffusion Models
Trajectory Tactics: When Transformers Learn Exploration to Generate Online Signature
Reconstructing Realistic and Relightable Eyes
MorphXAI: An Explainable Framework for Morphological Analysis of Parasites in Blood Smear Images
Semi-Supervised Hierarchical Open-Set Classification
Ordinal-Aware Multimodal Engagement Recognition for Collaborative Learning
EllipssianNet: Image-guided Sampling of 2D Gaussians for Gaussian Splatting
Reviving Unsupervised Optical Flow: Concept Reevaluation, Multi-Scale Advances and Full Open-Source Release
ArchitectHead: Continuous Level of Detail Control for 3D Gaussian Head Avatars
From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation
Conditional Text-to-Image Generation with Reference Guidance
UniCalib: Targetless LiDAR-camera Calibration via Probabilistic Flow on Unified Depth Representations
Understanding Generative AI Capabilities in Everyday Image Editing Tasks
FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy
Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation
OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models
Identity Verification from Human Scent using Channel Representation of 2D Gas Chromatography-Mass Spectrometry Data
Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models
ConsensusXAI: A framework to examine class-wise agreement in medical imaging
PHYSPLAT: a Framework for Photorealistic Hybrid Simulation of Real and Synthetic Elements using 3D Gaussian Splatting
BrandFusion: Aligning Image Generation with Brand Styles
Detecting Out-of-Distribution Objects through Class-Conditioned Inpainting
ITSELF: Attention Guided Fine-Grained Alignment for Vision–Language Retrieval
InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation
FARF-Net: Frequency-guided Adaptive Receptive Field Network for Edge-enhanced Polyp Segmentation
Real-Time Tracking of Flexible Markers in Low-Contrast Fluoroscopy Using a Deep Neural Network Trained Solely on Synthetic Data
Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space
PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs
The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs
SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation
Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification
3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting
Perception-Inspired Color Space Design for Photo White Balance Editing
Efficient Text-Guided Convolutional Adapter for the Diffusion Model
Rethinking Real Image Editing: Unleashing Diverse Editing Operators via Multi-Objective Optimization
Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues
Towards Reliable Test-Time Adaptation: Style Invariance as a Correctness Likelihood
DTMIR-Pro: Domain Translation with Prompt-based Latent-Space Generalization for Multi-Weather Image Restoration
Sea-CLIP: Mining Semantic-Aware Representations for Few-Shot Anomaly Detection with CLIP
Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences
LighthouseGS: Indoor Structure-aware 3D Gaussian Splatting for Panorama-Style Mobile Captures
SENCA-st: Integrating Spatial Transcriptomics and Histopathology with Cross Attention Shared Encoder for Region Identification in Cancer Pathology
Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models
Dragonite: Single-Step Drag-based Image Editing with Geometric-Semantic Guidance
Visual Detector Compression via Location-Aware Discriminant Analysis
NERVE: Neighbourhood & Entropy-Guided Random-Walk for Training Free Open-Vocabulary Segmentation
2S-CEDiff: A Two-Stage Diffusion Framework for Generating High-Fidelity Contrast-Enhanced CT Images from Non-Contrast Scans
DOTGraph: CLIP-Driven Feature Disentanglement and Optimal Transport based Graph Learning for Few-Shot Segmentation
Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling
Detecting Social Engagement of Elderly From Lifelog Image-streams to Identify Effective Cues for Autobiographic Recall
TiCLS : Tightly Coupled Language Text Spotter
Learning from Unknown for Open-Set Test-Time Adaptation
Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-training
Multi-Modal Soccer Scene Analysis with Masked Pre-Training
HEART-PFL: Stable Personalized Federated Learning under Heterogeneity with Hierarchical Directional Alignment and Adversarial Knowledge Transfer
SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout
Crafting Descriptive Information for a Zero-shot Method to Improve Knowledge-Based Visual Question Answering Performance
Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory
High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization
Unified Alignment Protocol: Making Sense of the Unlabeled Data in New Domains
FedSCAl: Leveraging Server and Client Alignment for Unsupervised Federated Source-Free Domain Adaptation
Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction
Autoregressive Styled Text Image Generation, but Make it Reliable
FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation
Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models
Learning Group Actions In Disentangled Latent Image Representations
SimForce: Force and Surface Electromyography from Full Body Video with Graph Neural Nets
CLIP-IT: CLIP-based Pairing of Histology Images with Privileged Textual Information
LightGazeNet: A Lightweight GNN-based Architecture for Gaze Estimation
DATTA: Domain-Adversarial Test-Time Adaptation for Cross-Domain WiFi-Based Human Activity Recognition
Hybrid State Representation for Video Procedure Planning
Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement
Universal Neural Architecture Space: Covering ConvNets, Transformers and Everything in Between
Cross-Modal Event Encoder: Bridging Image–Text Knowledge to Event Streams
KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird’s-Eye-View Segmentation
Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation
Sketch-guided Cage-based 3D Gaussian Splatting Deformation
Gradient-Free Classifier Guidance for Diffusion Model Sampling
CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading
Efficient Vision Transformers via Token Merging with Head-wise Attention Correction
SOAF: Scene Occlusion-aware Neural Acoustic Field
Fused Similarity Measure Based Alignment with Dual-Scale Adaptive Selection for Weakly Supervised Video Anomaly Detection
A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis
Feature Inversion as a Lens on Vision Encoders
Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model
Towards High-Fidelity, Identity-Preserving Real-Time Makeup Transfer: Decoupling Style Generation
Guided Texture Segmentation via Coordinate-Aware Class-Ratio Mapping
Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-free Open-Vocabulary Semantic Segmentation
Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression
Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models
Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM
DCSHARP: 3D Gaussian Splatting with Direction Cosine Spherical Harmonics and Shape-Aware Pruning
Empowering Source-Free Domain Adaptation via MLLM-Guided Reliability-Based Curriculum Learning
ENCORE : A Neural Collapse Perspective on Out-of-Distribution Detection in Deep Neural Networks
StreetView-Waste: A Multi-Task Dataset for Urban Waste Management
FairScene: Learning Class-Disentangled 2D/3D Representations for Semantic Scene Completion
ExDDV: A New Dataset for Explainable Deepfake Detection in Video
Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy
SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering
SCALEX: Scalable Concept and Latent Exploration for Diffusion Models
Extreme Amodal Face Detection
ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion
Augmenting with NeRFs: Fast Relocalization on Densified Datasets
CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
VOCAL: Visual Odometry via ContrAstive Learning
Training-free Detection of Text-to-video Generations via Over-coherence
GAITGen: Disentangled Motion-Pathology Impaired Gait Generative Model -- Bringing Motion Generation to the Clinical Domain
One Model, Many Behaviors: Training-Induced Effects on Out-of-Distribution Detection
Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image
Show Me: Unifying Instructional Image and Video Generation with Diffusion Models
CaRS: A Causal Intervention Segmentation Framework and Benchmark Dataset for Autonomous Driving under Transitional Weather Conditions
Comp4D: Compositional 4D Scene Generation
OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding
FLoMo-Net: A Novel Task-Adaptive Mixture of Experts Routing Framework with Frequency and Uncertainty Correction for Medical Image Segmentation
KMOPS: Keypoint-Driven Method for Multi-Object Pose and Metric Size Estimation from Stereo Images
TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection
Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting
Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning
Domain Generalizing DINO for Visual Regression via Latent Distractor Subspace Consistency
Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries
WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion
MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training
FlowEO: Generative Unsupervised Domain Adaptation for Earth Observation
Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection
A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
JOCA: Task-Driven Joint Optimisation of Camera Hardware and Adaptive Camera Control Algorithms
Perceptually Guided 3DGS Streaming and Rendering for Mixed Reality
HistoMILKD: A Multiple Instance Learning based Multi-Teacher Knowledge Distillation Framework for Whole Slide Image Classification
Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing
HumanBench: Two Heads, No Legs, But Mostly Human, the State of Generative Capabilities in T2I Models
CLIP’s Visual Embedding Projector is a Few-shot Cornucopia
Workzone3D: A Multimodal Dataset for 3D Work Zone Perception in Autonomous Driving
SPOC: Spatially-Progressing Object State Change Segmentation in Video
Pose-Diverse Multi-View Virtual Try-on from a Single Frontal Image via Diffusion Transformer
GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving
Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings
Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar
MANTA: Physics-Informed Generalized Underwater Object Tracking
AGENet: Adaptive Edge-aware Geodesic Distance Learning for Few-Shot Medical Image Segmentation
MuseDance: A Diffusion-based Music-Driven Image Animation System
Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation
ART-ASyn: Anatomy-aware Realistic Texture-based Anomaly Synthesis Framework for Chest X-Rays
MAFM³: Modular Adaptation of Foundation Models for Multi-Modal Medical AI
Mobile-Oriented Video Diffusion: Enabling Text-to-Video Generation on Mobile Devices Without Retraining, Compression, or Pruning
From Detection to Anticipation: Online Understanding of Struggles across Various Tasks and Activities
Codebook Knowledge with Mamba-Transformer For Low-Light Image Enhancement
CLUE: Bringing Machine Unlearning to Mobile Devices
Decoupling Shape and Texture in SAM-2 via Controlled Texture Replacement
TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model
Event-based Graph Representation with Spatial and Motion Vectors for Asynchronous Object Detection
DNA: Dual-branch Network with Adaptation for Open-Set Online Handwriting Generation
Latent Uncertainty-Aware Multi-View SDF Scan Completion
Decomposition Sampling for Efficient Region Annotations in Active Learning
Curve Skeletonization in Continuous domain for Meshes and Point Clouds
F-ViTA: Foundation Model Guided Visible to Infrared Translation
BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries
AEON: Adaptive Embedding Optimized Noise for Robust Watermarking in Diffusion Models
MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency
CropAT: Leveraging Diffusion-Generated Target-Like Cropped Objects for Pseudo-Label Refinement in Domain-Adaptive Object Detection
GAEA: A Geolocation Aware Conversational Assistant
TimeRefine: Temporal Grounding with Time Refining Video LLM
Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation
Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning
Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery
Unsupervised Segmentation by Diffusing, Walking and Cutting
Eye-for-an-eye: Appearance Transfer with Dense Semantic Correspondence in Diffusion Models
Sun-E: Dataset and Benchmark for Event-Based Sun Sensing
Revisiting Vision–Language Foundations for No-Reference Image Quality Assessment
More Than Memory Savings: Zeroth-Order Optimization Mitigates Forgetting in Continual Learning
Histopath-C: Towards Realistic Domain Shifts for Histopathology Vision-Language Adaptation
UCDSC: Open Set UnCertainty aware Deep Simplex Classifier for Medical Image Datasets
DARB-Splatting: Generalizing Splatting with Decaying Anisotropic Radial Basis Functions
Subspace-Guided Knowledge Distillation for Efficient Model Transfer
HiMix : Hierarchical Visual-Textual Mixing Network for Lesion Segmentation
SIAM: Synchronous Interaction Attention for Human Mesh Recovery
R3: Reconstruction, Raw, and Rain: Deraining Directly in the Bayer Domain
VLMs Guided Interpretable Decision Making in Autonomous Driving
Surgical Gaussian Surfels: Highly Accurate Real-time Surgical Scene Rendering using Gaussian Surfels
GrowTAS: Progressive Expansion from Small to Large Subnets for Efficient ViT Architecture Search
Learning spatio-temporal feature representations for video-based gaze estimation
Grounding Descriptions in Images informs Zero-Shot Visual Recognition
Lorentz Entailment Cone for Semantic Segmentation
Countering Multi-modal Representation Collapse through Rank-targeted Fusion
Spec-Gloss Surfels and Normal–Diffuse Priors for Relightable Glossy Objects
DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy
One-Cycle Structured Pruning via Stability-Driven Subnetwork Search
FujiView: Multimodal Late-Fusion for Predicting Scenic Visibility
INRetouch: Context Aware Implicit Neural Representation for Photography Retouching
Inpainting of Sparse Depth Maps from Monocular Depth-from-Focus on Pixel Processor Arrays
Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups
An Efficient Multi-Rater Setup Towards Personalized and Diversified Medical Image Segmentation
OMeGa: Joint Optimization of Explicit Meshes and Gaussian Splats for Robust Scene-Level Surface Reconstruction
Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation
Style-Friendly SNR Sampler for Style-Driven Generation
PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection
WiSE-OD: Benchmarking Robustness in Infrared Object Detection
Lose Your Self (LoYS): an adversarial entropy-based unsupervised approach for model debiasing
A Dataset and Framework for Learning State-invariant Object Representations
Towards Egocentric 3D Hand Pose Estimation in Unseen Domains
PrevMatch: Revisiting and Maximizing Temporal Knowledge in Semi-Supervised Semantic Segmentation
ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects
Robust Multimodal Emotion Recognition from Incomplete Modalities via Query-Based Unimodal and Cross-Modal Learning
Multi-view stereo with multiple projectors for oneshot entire shape scan based on Neural SDF and DSSS demultiplexing
DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions
Towards Photorealistic Style Transfer with Multimodal Guidance and Robustness to Content Images in Arbitrary Styles
EndoPBR: Photorealistic Synthetic Data for Surgical 3D Vision via Physically-based Rendering
Beyond Faces: A Multimodal Person Clustering for Unconstrained Environments
ControlEvents: Controllable Synthesis of Event Camera Data with Foundational Prior from Image Diffusion Models
SSMRadNet : A Sample-wise State-Space Framework for Efficient and Ultra-Light Radar Segmentation and Object Detection
MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes
Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching
ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points
Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection
Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance
Improving Animal Pose Estimation through Species Similarity Measures and Rigorous Label Definition
ProtoGMVAE: A Variational Auto-Encoder with True Gaussian Mixture Prior for Prototypical-based Self-Explainability
CalibBEV: LiDAR-Camera Calibration via BEV Alignment
CoL2A: Convolution-free Local Linear Attention for SpatioTemporal Event Processing
FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks
Context-Preserving Dermoscopic Editing: Mask-Guided Lesion-Aware Diffusion for Attribute Modification
X-JEPA: A Novel Joint Learning Cross-Modal Predictive Alignment Framework for Remote Sensing Image Retrieval
Zero-Shot Table Extraction in Business Documents: A Unified Benchmark with Error Taxonomy and Ecological Analysis
Improving Out-of-Distribution Detection Using Segmented Images and Cross-View Attention Fusion
WarpRF: Multi-View Consistency for Training-Free Uncertainty Quantification and Applications in Radiance Fields
Direct Visual Grounding by Directing Attention of Visual Tokens
Color Preserving CMOS-SPAD Fusion for Multi-Frame HDR
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos
CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow
SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding
HiGlassRM: Learning to Remove High-prescription Glasses via Synthetic Dataset Generation
Learning Unified Spatio-temporal Representations for Efficient Compressed Video Understanding
Rank-based Geographical Regularization: Revisiting Contrastive Self-Supervised Learning for Multispectral Remote Sensing Imagery
Line Art Colorization with Offset Prior-based Diffusion Model
WSSSP-Net: Weakly Supervised Semantic Segmentation Plugin Network for Face Anti-Spoofing
KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding
An improved architecture for part-based animal re-identification through semantic segmentation distillation
FCC: Fully Connected Correlation for One-Shot Segmentation
Enhancing Vision Language Corruption Robustness using Cross Distribution & Prompted Denoisers
DMS2F-HAD: A Dual-branch Mamba-based Spatial–Spectral Fusion Network for Hyperspectral Anomaly Detection
3D Superquadric Splatting
ProSkill: Segment-Level Skill Assessment in Procedural Videos
SeaClips: A Video Dataset for Maritime Object Detection.
CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization
RemEdit: Efficient Diffusion Editing with Riemannian Geometry
VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping
SOPHY: Generating Simulation-Ready Objects with Physical Materials
Automated Pore Detection from In-Situ FDM 3D Printing Video: A Comparative Evaluation of Modern Segmentation Models
Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss
DPBridge: Latent Diffusion Bridge for Dense Prediction
LASOR: Towards Clinically Transparent and Explainable Ophthalmic Report Generation via Lesion-Aware Segmentation
VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework
Semi-supervised Domain Adaptation via Mutual Alignment through Joint Error
SegMango: Early Deep Mango Yield Prediction based on Flower Segmentation and Weather Data
RobuMTL: Enhancing Multi-Task Learning Robustness Against Weather Conditions
One-shot Portrait Stylizaiton via Geometric Alignment
Semi-supervised Key-Point Estimation for Echocardiography Video
PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction
Similarity-aware Probabilistic Embeddings Modeling for Video-Text Retrieval
UniDiff: Parameter-Efficient Adaptation of Diffusion Models for Land Cover Classification with Multi-Modal Remotely Sensed Imagery and Sparse Annotations
Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction
Dronaquatics: Real-time Swimming Analytics Using Drone Captured Imagery
Generalization of Real World Video Deblurring By Image-to-Image Translation
Confidence Through Parallel Attention for Depth and Uncertainty Estimation in Dynamic Environments
Controllable Long-term Motion Generation with Extended Joint Targets
How I Met Your Bias: Investigating Bias Amplification in Diffusion Models
Safe Vision-Language Models via Unsafe Weights Manipulation
Broadcast2Pitch: Game State Reconstruction from Unconstrained Soccer Videos
FSP-DETR: Few-Shot Prototypical Parasitic Ova Detection
Global Focal and Radial Distortion Averaging from Radial Fundamental Matrices for Robust Self-Calibration
Food Image Generation on Multi-Noun Categories
4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimer's Diagnosis
Learning Subglacial Bed Topography from Sparse Radar with Physics-Guided Residuals
Restora-Flow: Mask-Guided Image Restoration with Flow Matching
Gated Temporal Fusion Transformers for Robust Multi-Object Tracking
RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution
BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining
SegMo: Segment-aligned Text to 3D Human Motion Generation
ChartQA-X: Generating Explanations for Visual Chart Reasoning
SFMNet: Sparse Focal Modulation for 3D Object Detection
Zero‑Shot Domain Generalisation via Prompt-Driven Feature Refinement
LogicCBMs: Logic-Enhanced Concept-Based Learning
TacticalCalib: End-to-End 6-DoF Camera Pose Regression for Tactical Camera Calibration
T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation
Clear Sights on Site: A Spatial-Adaptive Channel Network for Deblurring Construction Site Images
GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts
brat: Aligned Multi-View Embeddings for Brain MRI Analysis
CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering
Learning Beyond Labels: Self-Supervised Handwritten Text Recognition
GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion
Distribution Highlighted Reference-based Label Distribution Learning for Facial Age Estimation
Can Image Splicing and Copy-Move Forgery Be Detected by the Same Model? Forensim: An Attention-Based State-Space Approach
DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors
GDoFS: Gaussian DoF Separation for Plausible 3D Geometry in Sparse-View 3DGS
MoSCo: Real-time and Efficient Text-to-Motion Synthesis via Delta Training
TED-4DGS: Temporally Activated and Embedding-based Deformation for 4DGS Compression
Digital Forensic AI You Can Explain: A Case Study on Video Source Camera Identification
SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding
PoseAdapt: Sustainable Human Pose Estimation via Continual Learning Benchmarks and Toolkit
FAIR-SIGHT: Fairness Assurance in Image Recognition via Simultaneous Conformal Thresholding and Dynamic Output Repair
FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding
IPCD: Intrinsic Point-Cloud Decomposition
TRACE: Confounder-free Adversarial Fine-tuning for Robust Object Detection
DREAM: Dynamic Prompts and GuidedMix for Efficient Continual Adaptation of Visual-Language Models
Virtually Unrolling the Herculaneum Papyri by Diffeomorphic Spiral Fitting
PALMS+: Modular Image-Based Floor Plan Localization Leveraging Depth Foundation Model
Overcoming Small Data Limitations in Video-Based Infant Respiration Estimation
Zero-LEAD: Source-Free Universal Domain Adaptation for Abdominal Multi-Organ Segmentation
PerVL-Bench: Benchmarking Multimodal Personalization for Large Vision–Language Models
AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization
SD-CSFL: A Synthetic Data-Driven Conformity Scoring Framework for Robust Federated Learning
See, Record, Do: Automated Generation of UI Workflows from Tutorial Videos
D2Mamba: Dual Domain Guided Informed Search in State Space Model for Underwater Image Enhancement
TM-Adapter: Temporal Merge Adapter for Efficient Global Temporal Modeling
CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation
Enhancing Monocular 3D Hand Reconstruction with Learned Texture Priors
Diffusion Noise Optimization for Synthetic VLM Training
BAFLE-DCT: Bypassing Adversarial Filters via Frequency-Selective Embedding in the DCT Domain
Histogram Assisted Quality Aware Generative Model for Resolution Invariant NIR Image Colorization
SSMT-Net: A Semi-Supervised Multitask Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images
Understanding the Visual Projection Space of Multimodal LLMs
NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining
R-MMA: Enhancing Vision-Language Models with Recurrent Adapters for Few-Shot and Cross-Domain Generalization
Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score
Optimal Transport for Rectified Flow Image Editing: Unifying Inversion-Based and Direct Methods
Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships
VAST-ReID: A Low-Light Benchmark Dataset for Person Re-Identification with Visual and Attribute-Rich Semantic Tracking
A Deep Network for Object Detection on Inland Waters
DreamCatcher: Efficient Multi-Concept Customization via Representation Finetuning
ZebraPose: Zebra Detection and Pose Estimation using only Synthetic Data
ODEt(ODEl): Shortcutting the Time and the Length in Diffusion and Flow Models for Faster Sampling
UniTabBank: A Large Scale Multi-Lingual, Multi-Layout, Multi-Type, Multi-Format Dataset for Table Detection
VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion
Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control
IMKD: Intensity-Aware Multi-Level Knowledge Distillation for Camera-Radar Fusion
SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation
Training-Free Few-Shot Segmentation via Vision-Language Guided Prompting
Memoire: Learning User Personas from Gallery Tags for Personalized Photo Curation
Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation
Correcting and Quantifying Systematic Errors in 3D Box Annotations for Autonomous Driving
Synthesizing Compositional Videos from Text Description
Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices
RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding
RealDroneVision: Dataset and Architecture Advancements for Small-Object Drone Detection
CSGaussian: Progressive Rate-Distortion Compression and Segmentation for 3D Gaussian Splatting
Modeling and Learning Multiple Hypotheses for Monocular 3D Object Detection
HodgeFormer: Transformers for Learnable Operators on Triangular Meshes through Data-Driven Hodge Matrices
Diffusion-Based Action Recognition Generalizes to Untrained Domains
WiSAR3D - Aerial LiDAR dataset for 3D object detection
ATM: Enhanced Alignment for Text-to-Motion Generation
Gaussian Splatting Map Registration with Orthographic Bird's-Eye-View Renderings
FNOPT: Resolution-Agnostic, Self-Supervised Cloth Simulation using Meta-Optimization with Fourier Neural Operators
UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training
A Unified Diffusion-Based Framework for Multi-Agent Trajectory Prediction Integrating Structured Multi-Modal Representations
Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset
V2XScene: Multi-View Consistent 3D Scene Simulation for Collaborative Perception
Test Time Adaptation Using Adaptive Quantile Recalibration
OSEG: Improving Diffusion sampling through Orthogonal Smoothed Energy Guidance
Image-Guided Semantic Pseudo-LiDAR Point Generation for 3D Object Detection
Distilling Offline Action Detection Models into Real-Time Streaming Models
FlowCLAS: Enhancing Normalizing Flow-Based Anomaly Segmentation Via Contrastive Learning
Single-step Diffusion for Image Compression at Ultra-Low Bitrates
Motion-Aware Graph Fusion NetWork for 3D Human Pose Estimation
Multimodal Graph Representation Learning over Arbitrary Sets of Modalities
UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance
Causality-Driven Audits of Model Robustness
Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport
Denoise, Divide, Distill, and Predict (D3P): Towards Forecasting Long-horizon Real-world Anomaly from Normalcy
Test-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations
Alignment and Distillation: A Robust Framework for Multimodal Domain Generalizable Human Action Recognition
Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information
ForestSplats: Deformable transient field for Gaussian Splatting in the Wild
Predicting Task fMRI Contrasts from Resting-State fMRI Using Sparse 3D Convolutions
SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection
MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection
ScoreNet: Netting Lightweight Quality Scores for Better Visual Assessment with Large Multi-Modality Models
From Prompt to Production: Automating Brand-Safe Marketing Imagery with Text-to-Image Models
GroupPortrait: Multi-ID Portrait Generation with High Identity Preservation and Fine-Grained Control
From Darkness to Detail: Frequency-Aware SSMs for Low-Light Vision
Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities
MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding
Logit-Adjusted Test-Time Adaptation under Partial Class Imbalance
S2O: Static to Openable Enhancement for Articulated 3D Objects
CoreCaption: Core Caption based Text-to-Video Retrieval
Exploring the Boundaries of Diffusion Models for Offline Writer Identification with Sparse and Intra-Variable Data
Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment
Isolating the Role of Temporal Information in Video Saliency: A Controlled Experimental Analysis
Diagnose Like A REAL Pathologist: An Uncertainty-Focused Approach for Trustworthy Multi-Resolution Multiple Instance Learning
Hymavi : A Hybrid Mamba-Attention Network in Multi-View Framework for Volumetric Medical Image Segmentation
MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation
Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data
Align Video Diffusion Model with Online Video-Centric Preference Optimization
HABIT: Human Action Benchmark for Interactive Traffic in CARLA
VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics
F-INR: Functional Tensor Decomposition for Implicit Neural Representations
FocalComm: Hard Instance-Aware Multi-Agent Perception
High-Level Semantics and Low-Level Features Fusion for Multi-Scale Object Detection in Dynamic Construction Environments
Neural Geometry Image-Based Representations with Optimal Transport (OT)
Graph Query Networks for Object Detection with Automotive Radar
SVD-Det: A Lightweight Framework for Video Forgery Detection Using Semantic and Visual Defect Cues
PointSt3R: Point Tracking through 3D Ground Correspondence
Optimization-Free Style Transfer for 3D Gaussian Splats
SeqFeedNet: Sequential Feature Feedback Network for Background Subtraction
CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation
GRAPE (Gaussian Rendering for Accelerated Pixel Enhancement) Brings Fast and Lightweight Arbitrary Super-Resolution
LASER: Lip Landmark Assisted Speaker Detection for Robustness
Intraoperative 2D/3D Registration via Spherical Similarity Learning and Differentiable Levenberg-Marquardt Optimization
Reciprocal Teaching: Dynamic Multi-Model Teacher-Student Learning for Multiple Noisy Annotations
MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
DM3Net: Dual-Camera Super-Resolution via Domain Modulation and Multi-scale Matching
DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment
Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering
Splatter Layout: Geometry-embedded 3D Reconstruction via Surface Unfolding
SVS-GAN for Semantic Synthesis of Traffic Videos for Autonomous Driving
Distilling What and Why: Enhancing Driver Intention Prediction with MLLMs
Any Detector Can Detect Anything
VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
QuEENet: Quantum-Enhanced Expressive Network for Image Classification
Leveraging Sparsity for Privacy in Collaborative Inference
CAST: Evaluating Multi-Object Trackers with Context-Aware Switch and Transfer Scores
Adversarial Pseudo-replay for Exemplar-free Class-incremental Learning
SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking
Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study
DualRes: Production-ready Dynamic Object Detection
DocWaveDiff: A Predict-and-Refine approch for Document Image Enhancement with Wavelet U-Nets and Diffusion models
UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks
VRAgent: Self-Refining Agent for Zero-Shot Multimodal Video Retrieval
ICONIC-444: A 3.1-Million-Image Dataset for OOD Detection Research
Sketch3R: Rapid and Realistic 3D VR Sketch Creation to Shape Retrieval
Conversational Image Generation: Towards Multi-Round Personalized Generation with Multi-Modal Language Models
LangPose: Language-Aligned Motion for Robust 3D Human Pose Estimation
Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models
Joint Optimization of Camera Model and Deep Neural Network for Image Recognition
Where is the Watermark? Interpretable Watermark Detection at the Block Level
Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training
Scalable Video Action Anticipation with Cross Linear Attentive Memory
IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers
Bridging the Domain Gap in Small Multimodal Models: A Dual-level Alignment Perspective
FlyPose: Towards Robust Human Pose Estimation From Aerial Views
VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models
Optimizing against Infeasible Inclusions from Data for Semantic Segmentation through Morphology
IMPACT: Interpretable Most Important Person Analysis and Classification using Transformer-based Models
MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps
Cycle-consistent Multi-graph Matching for Self-supervised Annotation of C. Elegans
DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis
Dual-Domain Multimodal Hyperbolic Fusion for Cardiopulmonary Disease Diagnosis in Emergency Care
CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting
SmoothDiffusion-VE: Real-time Generative Video Editing Using Adaptive Feature Cache
Advancing Player Identification and Tracking with Global ID Fusion (GIF)
Rethinking Latent Variable in Learned Image Compression
HDR Reconstruction Boosting with Training-Free and Exposure-Consistent Diffusion
DiffRegCD: Integrated Registration and Change Detection with Diffusion Features
Flood-LDM: Generalizable Latent Diffusion Models for rapid and accurate zero-shot High-Resolution Flood Mapping
PhysEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education
Automated Suturing Skill Assessment in Robot-assisted Surgery from Endoscopic Videos using Clinically-guided Evaluation Criteria
FastPose-ViT: A Vision Transformer for Real-Time Spacecraft Pose Estimation
Enhancing Reverse Distillation with Core Exemplar Learning for Unified Multi-Class Anomaly Detection
FG-TRACER: Tracing Information Flow in Multimodal Large Language Models in Free-Form Generation
Remote Sensing Forestry Similarity Convolution
STRinGS: Selective Text Refinement in Gaussian Splatting
ISALux: Illumination and Semantics-Aware Transformer Employing Mixture of Experts for Low Light Image Enhancement
SOLAR: Switchable Output Layer for Accuracy and Robustness in Once-for-All Training
Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors
Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs
Fetal and Neonatal Cortical Surface Reconstruction with Anatomical Normal-guidance and Perceptual Enhancements
SphereEdit: Spherical Semantic Editing in Diffusion Models
Stabilizing Direct Training of Spiking Neural Networks: Membrane Potential Initialization and Threshold-robust Surrogate Gradient
MemeTAG: Keyword-Driven Meme Classification through Tag Embedding Reconstruction
FairVLM: Enhancing Fairness and Prompt Sensitivity in Vision Language Models for Medical Image Segmentation
MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data
Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation
ReFineVQA: Iterative Refinement of Video Description via Feedback Generation for Video Question Answering
RampWatch: An In-the-Wild Dataset and Text-Guided Detection Framework for Recreational Vessels
TopoRec: Point Cloud Recognition Using Topological Data Analysis
SafeguardGS: 3D Gaussian Primitive Pruning While Avoiding Catastrophic Scene Destruction
PEaRL: Pathway-Enhanced Representation Learning for Gene and Pathway Expression Prediction from Histology
Integrating Multi-scale and Multi-filtration Topological Features for Medical Image Classification
VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning
Diversity Preserving Coresets for Image Quality Assessment
MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping
Exploiting Label-Independent Regularization from Spatial Patterns for Whole Slide Image Analysis
GFT: Graph Feature Tuning for Efficient Point Cloud Analysis
GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring
See, Think, Learn: A Self-Taught Multimodal Reasoner
View-aware Cross-modal Distillation for Multi-view Action Recognition
SpecGen: Neural Spectral BRDF Generation via Spectral-Spatial Tri-plane Aggregation
QAL : A Loss for Recall–Precision Balance in 3D Reconstruction
Improvise, Adapt, Overcome — Telescopic Adapters for Efficient fine-tuning of Vision Language Models in Medical Imaging
FedEFC: Federated Learning Using Enhanced Forward Correction Against Noisy Labels
AortaDiff: A Unified Multitask Diffusion Framework for Contrast-Free AAA Imaging
NavMapFusion: Diffusion-based Fusion of Navigation Maps for Online Vectorized HD Map Construction
Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting
GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting
Pretraining Helps When Capacity Allows: Evidence from Ultra-Small ConvNets
Meta-YOLO: Metadata-Guided Real-Time Object Detector in Aerial Imagery
MedPEFT-CL: Dual-Phase Parameter-Efficient Continual Learning with Medical Semantic Adapter and Bidirectional Memory Consolidation
Patch Your Matcher: Correspondence-Aware Image-to-Image Translation Unlocks Cross-Modal Matching via Single-Modality Priors
Test-Time Consistency in Vision Language Models
Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation
Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting
Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone
AugMapNet: Improving Spatial Latent Structure via BEV Grid Augmentation for Enhanced Vectorized Online HD Map Construction
Uncertainty-Aware Vision-Language Segmentation for Medical Imaging
Semantic Map Guided Bird's-Eye View Learning for Online HD Map Construction
Towards Unconstrained Cross-View Pose Estimation
Being Positive about Negative Queries: Exclusion Aware Multimodal Retrieval using Disentangled Representations
SCAdapter: Content-Style Disentanglement for Diffusion Style Transfer
VideoSketcher: A Training-Free Approach for Coherent Video Sketch Transfer
From Cognitive Priors to Instance Semantics: A Unified Framework for Multi-task Affective Computing
Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization
Photo Dating by Facial Age Aggregation
SceneShine: Illumination-aware Human Scene Gaussian Re-Splatting from Mobile Device Video
MIST: Multilingual Incidental Dataset for Scene Text Detection
NeuroBridge: Few-Shot Cross-Modal Neuron Re-identification via Dual-Channel Deep Metric Learning
General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood
Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios
Crash2DocAI: Automated Integration of Post-Crash Car Part Images into Technical Reports
HOLO: Holistic Lightweight Optimization for Scene Understanding with Auto-Annotation and Multimodal Learning
SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models
ACuRE: Accurate Continuity-Regularized SpO2 Estimation Using Liquid Time-Constant Networks
Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection
SuperRivolution: Fine-Scale Rivers from Coarse Temporal Satellite Imagery
T2LF: LLM-Guided Multimodal Diffusion for Text-to-Light Field Synthesis
SurgXBench: Explainable Vision-Language Model Benchmark for Surgery
Deep Image Decomposition for Medical Imaging Anonymization and Curation
Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients
NRGMark: Localized Watermarking for Energy Transparency in Images
SAVE: Sparse Autoencoder‑Driven Visual Information Enhancement for Mitigating Object Hallucination
PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval

Remarks

Opening Remarks and Paper Awards

Tutorials

Beyond Vision: Multimodal Perspectives for Cross-View Geo-Localization

Workshops

Foundational Models Beyond the Visual Spectrum
LENS: Learning and Exploitation of Latent Space Geometries
3rd Physical Retail AI Workshop
WACV 2026 Workshop Proposal Scene Graph for Structured Intelligence
Workshop on Generative AI for Photography

Report issues here.