WACV 2026 Events with Videos
Below are all events that have video recordings. We are still processing recordings. As we upload them to the website, they will appear here.
Keynotes
Meetings
Posters
- VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction
- BanglaProtha: Evaluating Vision Language Models in Underrepresented Long-tail Cultural Contexts
- Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis
- Mitigating Backdoor Attacks via Trigger Reconstruction and Model Hardening
- No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts
- Revisiting Layer Normalization for Point Cloud Test Time Adaptation
- Leveraging Pretrained Representations for Cross-Modal Point Cloud Completion
- Cluster-Guided Adversarial Perturbations for Robust Contrastive Learning
- ZonUI-3B: Competitive GUI Grounding with a 3B VLM Trained on a Single Consumer GPU
- CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones
- FAE-Net: Fashion Attribute Editing via Disentangled Latent Conditioning in Diffusion Models
- Understanding Human-Like Biases in VLMs via Subjective Face Analytics
- Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters
- A Fast, Simple, and Flexible Scale Informative Feature Transform Module for Arbitrary Scale Image Super-Resolution
- ImageNet-sES: A First Systematic Study of Sensor–Environment Simulation Anchored by Real Recaptures
- M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models
- Interleaved Vision-and-Language Generation via Generative Voken
- From Lightweight CNNs to SpikeNets: Benchmarking Accuracy–Energy Tradeoffs with Pruned Spiking SqueezeNet
- PSA-MIL: A Probabilistic Spatial Attention-Based Multiple Instance Learning for Whole Slide Image Classification
- DreamAnywhere: Object-Centric Panoramic 3D Scene Generation
- Enabling High-Quality In-the-Wild Imaging from Severely Aberrated Metalens Bursts
- A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions
- Beyond the Encoder: Joint Encoder-Decoder Contrastive Pre-Training Improves Dense Prediction
- Gaussian Representations for Video
- Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?
- GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection
- IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection
- PS3: Part level instance segmentation in 3D
- Training-free Multi-view 4D Human Motion Reconstruction Virtual Reality System
- Mixed Diffusion for 3D Indoor Scene Synthesis
- MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval
- A Multi-Agent Diffusion Approach for MRI Anomaly Segmentation via Modality-Specific LoRA Specialization
- Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space
- ChameleonTuner: Automatic ISP Color Tuning in Subjective Scenarios
- EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation
- Inpaint360GS: Efficient Object-Aware 3D Inpainting via Gaussian Splatting for 360° Scenes
- Towards Fast and Scalable Normal Integration using Continuous Components
- Cluster-based Pseudo-labeling for Semi-Supervised LiDAR Semantic Segmentation
- Root Completion from Intraoral Scans of Tooth Crowns using Diffusion with Patch Perturbation
- SilverLining: Data-First Mitigation of Spatial and Spectral Shortcuts Without Introducing New Confounders
- BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis
- Network-agnostic distortion-robust projections for wide-angle image understanding
- mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
- PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models
- FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
- TS-PCI: Point Cloud Frame Interpolation with Time-Aware Point Cloud Sampling and Self-Supervised Learning Strategy
- Odo: Depth-Guided Diffusion for Identity-Preserving Body Reshaping
- MEDAL: multi-modal MEta-space Distillation and ALignment for Visual Compatibility Learning
- Model-free Domain Adaptation for Concealed Multimodal Large-Language Models
- Human Pose Aggregation for Multi-View Temporal Video Alignment
- M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models
- Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness
- CAAC: Confidence-Aware Attention Calibration to Reduce Hallucinations in Large Vision-Language Models
- OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting
- AUTOCORRELATION-BASED FIDUCIAL MARKERS FOR TRACEABILITY
- Accelerated Dose Generation in Gamma Knife Radiosurgery Using a Wavelet Diffusion Model for Sparse Representation
- A framework for real-time Surgical Phase Recognition with application to Robot-Assisted Partial Nephrectomy
- Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers
- Saliency-Guided DETR for Moment Retrieval and Highlight Detection
- Federated Model Synchronization for Diagnostic Redefinition through a Novel Selective Parameter Unlearning
- PaRaChute: Pathology-Radiology Cross-Modal Fusion for Missing-Modality-Robust Survival Prediction
- Beyond Realism: Learning the Art of Expressive Composition with StickerNet
- SGPMIL: Sparse Gaussian Process Multiple Instance Learning
- Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources
- ART: Actor-Related Tubelet for Detecting Complex-shaped Action Tubes
- Learning to Animate Images from A Few Videos to Portray Delicate Human Actions
- Marshaled Learning: Bridging Large Neural Networks with Memory-Constrained Trusted Execution Environments in Federated Learning
- OW-Rep: Open World Object Detection with Instance Representation Learning
- TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors
- CADE: Continual Weakly-supervised Video Anomaly Detection with Ensembles
- MBTI: Metric-Based Textual Inversion for Fine-Grained Image Generation
- AutoSew: A Geometric Approach to Stitching Prediction with Graph Neural Networks
- 4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos
- SpikeRain: Towards Energy-Efficient Single Image Deraining with Spiking Neural Networks
- Zero-Shot Video Deraining with Video Diffusion Models
- A Universal Self-Attention Enhancement for Bridging Low-bit Quantization and Vision Transformers
- From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance
- Morphing Through Time: Diffusion-Based Bridging of Temporal Gaps for Robust Alignment in Change Detection
- SCORP: Scene-Consistent Object Refinement via Proxy Generation and Tuning
- CommonForms: A Large, Diverse Dataset for Form Field Detection
- Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters
- MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
- End-to-End Fine-Tuning of 3D Texture Generation using Differentiable Rewards
- Reverse Personalization
- SymNet: A Multi-Task Network for Joint Radio Map Reconstruction and Transmitter Localization
- MIX-based Foreground and Background Patch Augmentation Guided by Physics and Material Properties for X-ray Detection
- ViGG: Robust RGB-D Point Cloud Registration using Visual-Geometric Mutual Guidance
- MR-Pruner: Training-free Multi-resolution Visual Token Pruning for Multi-modal Large Language Models
- UnderWater SLAM with Laser-light sectioning method using ST-GAT
- LooC: Effective Low-Dimensional Codebook for Compositional Vector Quantization
- Fully Unsupervised Self-debiasing of Text-to-Image Diffusion Models
- Salience-SGG: Enhancing Unbiased Scene Graph Generation with Iterative Salience Estimation
- Systematic Analysis of the Unintentional CSAM-Generation-Potential of Text-to-Image Models
- Enhanced Back-Projection of Vision Features for 3D Symmetry Detection
- Unified Video Anomaly Detection Model for Detecting Different Anomaly Types
- MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation
- GrounDiff: Diffusion-Based Ground Surface Generation from Digital Surface Models
- AdaptViG: Adaptive Vision GNN with Exponential Decay Gating
- Overcoming Fine-Grained Visual Challenges in Animal Re-Identification via Semantic Feature Alignment
- Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?
- HyperPose: Hyper-pose Embeddings for 3D-Aware Generative Models with Self-Supervised Disentangling of Pose and Scene
- MooTrack360: A Novel Fisheye Camera Dataset for Robust Multi Diary Cow Detection and Tracking
- Diverse Sketch Colorization with Content-Enhanced Style Representation and Recolorization Distillation
- SaccadeX: Directed Acyclic Graph-based Semi-Supervised Learning of Continuous Ocular Dynamics from Sparse Neuromorphic Streams
- Beyond Real Weights: Hypercomplex Representations for Stable Quantization
- Analysis of Text Accuracy and Visual Alignment in Vision-Language Models for Artistic Text Generation
- ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models
- How to Design and Train Your Implicit Neural Representation for Video Compression
- QuadraNet V2: Efficient and Sustainable Training of High-Order Neural Networks with Quadratic Adaptation
- Feature-Disentangling RGB-NIR Fusion Network for Remote Driver Physiological Measurement
- Deepfake Detection that Generalizes Across Benchmarks
- Reinforcement Learning-based Adaptive Control of Classifier-Free Guidance and Timestep Embeddings in Diffusion Models
- Zero-Shot Coreset Selection via Iterative Subspace Sampling
- MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions
- SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense
- Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding
- Temporal Object Captioning for Street Scene Videos from LiDAR Tracks
- RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph
- QCFace: Image Quality Control for boosting Face Representation & Recognition
- CraftSVG: Multi-Object Text-to-SVG Synthesis via Layout Guided Diffusion
- Pyramidal Spectrum: Frequency-based Hierarchically Vector Quantized VAE for Videos
- Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning
- Revisiting Retentive Networks for Fast Range-View 3D LiDAR Semantic Segmentation
- Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning
- You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
- Multimodal Medical Image Binding via Shared Text Embeddings
- Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention
- Harnessing Object Grounding for Time-Sensitive Video Understanding
- MEGA-PCC: A Mamba-based Efficient Approach for Joint Geometry and Attribute Point Cloud Compression
- Layout Anything: One Transformer for Universal Room Layout Estimation
- TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning
- CycleSL: Server-Client Cyclical Update Driven Scalable Split Learning
- Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model
- MarineEval: Assessing the Marine Intelligence of Vision-Language Models
- CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video
- Tables Guide Vision: Learning to See the Heart through Tabular Data
- OpenCowID: Zero-Shot Visual Identification of Dairy Cows
- UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations
- GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction
- Human knowledge integrated multi-modal learning for single source domain generalization
- milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion
- RobustFormer: Noise-Robust Pre-training for Images and Videos
- Imitating the Functionality of Image-to-Image Models Using a Single Example
- Shift-Equivariant Complex-Valued Convolutional Neural Networks
- LENVIZ: A High-Resolution Low-Exposure Night Vision Benchmark Dataset
- AnyBald: Toward Realistic Diffusion-Based Hair Removal In-The-Wild
- WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement
- Personalized Image Privacy Advisors via Federated Daisy-Chaining
- Illuminating Darkness: Learning to Enhance Low-light Images In-the-Wild
- SSplain: Sparse and Smooth Explainer for Retinopathy of Prematurity Classification
- Divide and Refine: Enhancing Multimodal Representation and Explainability for Emotion Recognition in Conversation
- A-V Representation Learning via Audio Shift Prediction for Multimodal Deepfake Detection and Temporal Localization
- Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models
- Orca: Object Recognition and Comprehension for Archiving Marine Species
- ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora
- PVeRA: Probabilistic Vector-Based Random Matrix Adaptation
- From Few-Shot to Zero-Shot Pallet Load Recognition: A Deployed Embedding-Based Vision System for Industrial Logistics
- Tables Decoded: DELTA for Structure, TARQA for Understanding
- DiRe: Diversity-promoting Regularization for Dataset Condensation
- AD2: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems
- BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity
- Graph-Based Spectral Attention with Multi-Spectral Images for Illuminant Estimation
- SUGAR: A Sweeter Spot for Generative Unlearning of Many Identities
- DynaGSLAM: Real-Time Gaussian-Splatting SLAM for Online Rendering, Tracking, Motion Predictions of Moving Objects in Dynamic Scenes
- PADM: A Physics-aware Diffusion Model for Attenuation Correction
- ObjectMeshDeform : Towards recovering precise 3D geometry of real objects via image-guided mesh deformation of 3D generative priors
- STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences
- Vision-informed Semantic Text Alignment for Open-set Recognition in Remote Sensing
- Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions
- FlowMorph: Revealing an Optimizable Flow Latent Space for Controlled Image Morphing
- CURIO: Curvature-Aligned and Efficient OCR for Low-Resource Historical Manuscripts
- ScoliGaitX: A Deep Multi-Modal Fusion Network for Scoliosis Assessment via Gait Video Analysis
- Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts
- Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts
- Enhancing Object Detection Training via Joint Image-Annotation Generation
- UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network
- BAFIS: Dataset + Framework to assess occupational Bias and Human Preference in modern Text-to-image Models
- Yunheon Lee, Juncheol Ye, Jaehong Kim, Dongsu Han NerVast: Compression-Efficient Scaling of Implicit Neural Video Representations via Scene-based Parameter-sharing
- Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel NeRA Adapter for Enhanced Feature Adaptation
- Discrete Facial Encoding: A Framework for Data-driven Facial Display Discovery
- Roadside Monocular 3D Detection Prompted by 2D Detection
- Learnable Query-Enhanced Pose Transformation
- BOP-Distrib: Revisiting 6D Pose Estimation Benchmarks for Better Evaluation under Visual Ambiguities
- From Bands to Depth: Understanding Bathymetry Decisions on Sentinel-2
- MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction
- mmWeaver: Environment-Specific mmWave Signal Synthesis from a Photo and Activity Description
- Test-Time Adaptation through Semantically-guided Feature Decomposition for Few-shot Chest X-ray Diagnosis
- SPAR-Det: Segmentation-guided and Prior-Aided Routing for Small Object Detection
- DRWKV: Focusing on Object Edges for Low-Light Image Enhancement
- Matching Semantically Similar Non-Identical Objects
- 3D Gaussian Point Encoders
- Segmentation-Aware Latent Diffusion for Satellite Image Super-Resolution: Enabling Smallholder Farm Boundary Delineation
- Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models
- GeoHSAF: Geometric Hippocampus Shape Analysis Framework for Longitudinal Alzheimer's Disease Classification
- Cosine Similarity is Almost All You Need (for Prototypical-Part Models)
- RobustGait: Robustness Analysis for Appearance Based Gait Recognition
- DenseBEV: Transforming BEV Grid Cells into 3D Objects
- Color Bind: Exploring Color Perception in Text-to-Image Models
- HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis
- DMAT: An End-to-End Framework for Joint Atmospheric Turbulence Mitigation and Object Detection
- DoTA: Latent Distribution Conditioned Data Attribution for Diffusion Models
- Trajectory Tactics: When Transformers Learn Exploration to Generate Online Signature
- Reconstructing Realistic and Relightable Eyes
- MorphXAI: An Explainable Framework for Morphological Analysis of Parasites in Blood Smear Images
- Semi-Supervised Hierarchical Open-Set Classification
- Ordinal-Aware Multimodal Engagement Recognition for Collaborative Learning
- EllipssianNet: Image-guided Sampling of 2D Gaussians for Gaussian Splatting
- Reviving Unsupervised Optical Flow: Concept Reevaluation, Multi-Scale Advances and Full Open-Source Release
- ArchitectHead: Continuous Level of Detail Control for 3D Gaussian Head Avatars
- From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation
- Conditional Text-to-Image Generation with Reference Guidance
- UniCalib: Targetless LiDAR-camera Calibration via Probabilistic Flow on Unified Depth Representations
- Understanding Generative AI Capabilities in Everyday Image Editing Tasks
- FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy
- Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation
- OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models
- Identity Verification from Human Scent using Channel Representation of 2D Gas Chromatography-Mass Spectrometry Data
- Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models
- ConsensusXAI: A framework to examine class-wise agreement in medical imaging
- PHYSPLAT: a Framework for Photorealistic Hybrid Simulation of Real and Synthetic Elements using 3D Gaussian Splatting
- BrandFusion: Aligning Image Generation with Brand Styles
- Detecting Out-of-Distribution Objects through Class-Conditioned Inpainting
- ITSELF: Attention Guided Fine-Grained Alignment for Vision–Language Retrieval
- InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation
- FARF-Net: Frequency-guided Adaptive Receptive Field Network for Edge-enhanced Polyp Segmentation
- Real-Time Tracking of Flexible Markers in Low-Contrast Fluoroscopy Using a Deep Neural Network Trained Solely on Synthetic Data
- Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space
- PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs
- The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs
- SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation
- Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification
- 3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting
- Perception-Inspired Color Space Design for Photo White Balance Editing
- Efficient Text-Guided Convolutional Adapter for the Diffusion Model
- Rethinking Real Image Editing: Unleashing Diverse Editing Operators via Multi-Objective Optimization
- Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues
- Towards Reliable Test-Time Adaptation: Style Invariance as a Correctness Likelihood
- DTMIR-Pro: Domain Translation with Prompt-based Latent-Space Generalization for Multi-Weather Image Restoration
- Sea-CLIP: Mining Semantic-Aware Representations for Few-Shot Anomaly Detection with CLIP
- Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences
- LighthouseGS: Indoor Structure-aware 3D Gaussian Splatting for Panorama-Style Mobile Captures
- SENCA-st: Integrating Spatial Transcriptomics and Histopathology with Cross Attention Shared Encoder for Region Identification in Cancer Pathology
- Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models
- Dragonite: Single-Step Drag-based Image Editing with Geometric-Semantic Guidance
- Visual Detector Compression via Location-Aware Discriminant Analysis
- NERVE: Neighbourhood & Entropy-Guided Random-Walk for Training Free Open-Vocabulary Segmentation
- 2S-CEDiff: A Two-Stage Diffusion Framework for Generating High-Fidelity Contrast-Enhanced CT Images from Non-Contrast Scans
- DOTGraph: CLIP-Driven Feature Disentanglement and Optimal Transport based Graph Learning for Few-Shot Segmentation
- Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling
- Detecting Social Engagement of Elderly From Lifelog Image-streams to Identify Effective Cues for Autobiographic Recall
- TiCLS : Tightly Coupled Language Text Spotter
- Learning from Unknown for Open-Set Test-Time Adaptation
- Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-training
- Multi-Modal Soccer Scene Analysis with Masked Pre-Training
- HEART-PFL: Stable Personalized Federated Learning under Heterogeneity with Hierarchical Directional Alignment and Adversarial Knowledge Transfer
- SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout
- Crafting Descriptive Information for a Zero-shot Method to Improve Knowledge-Based Visual Question Answering Performance
- Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory
- High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization
- Unified Alignment Protocol: Making Sense of the Unlabeled Data in New Domains
- FedSCAl: Leveraging Server and Client Alignment for Unsupervised Federated Source-Free Domain Adaptation
- Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction
- Autoregressive Styled Text Image Generation, but Make it Reliable
- FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation
- Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models
- Learning Group Actions In Disentangled Latent Image Representations
- SimForce: Force and Surface Electromyography from Full Body Video with Graph Neural Nets
- CLIP-IT: CLIP-based Pairing of Histology Images with Privileged Textual Information
- LightGazeNet: A Lightweight GNN-based Architecture for Gaze Estimation
- DATTA: Domain-Adversarial Test-Time Adaptation for Cross-Domain WiFi-Based Human Activity Recognition
- Hybrid State Representation for Video Procedure Planning
- Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement
- Universal Neural Architecture Space: Covering ConvNets, Transformers and Everything in Between
- Cross-Modal Event Encoder: Bridging Image–Text Knowledge to Event Streams
- KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird’s-Eye-View Segmentation
- Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation
- Sketch-guided Cage-based 3D Gaussian Splatting Deformation
- Gradient-Free Classifier Guidance for Diffusion Model Sampling
- CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading
- Efficient Vision Transformers via Token Merging with Head-wise Attention Correction
- SOAF: Scene Occlusion-aware Neural Acoustic Field
- Fused Similarity Measure Based Alignment with Dual-Scale Adaptive Selection for Weakly Supervised Video Anomaly Detection
- A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis
- Feature Inversion as a Lens on Vision Encoders
- Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model
- Towards High-Fidelity, Identity-Preserving Real-Time Makeup Transfer: Decoupling Style Generation
- Guided Texture Segmentation via Coordinate-Aware Class-Ratio Mapping
- Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-free Open-Vocabulary Semantic Segmentation
- Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression
- Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models
- Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination
- AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM
- DCSHARP: 3D Gaussian Splatting with Direction Cosine Spherical Harmonics and Shape-Aware Pruning
- Empowering Source-Free Domain Adaptation via MLLM-Guided Reliability-Based Curriculum Learning
- ENCORE : A Neural Collapse Perspective on Out-of-Distribution Detection in Deep Neural Networks
- StreetView-Waste: A Multi-Task Dataset for Urban Waste Management
- FairScene: Learning Class-Disentangled 2D/3D Representations for Semantic Scene Completion
- ExDDV: A New Dataset for Explainable Deepfake Detection in Video
- Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy
- SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering
- SCALEX: Scalable Concept and Latent Exploration for Diffusion Models
- Extreme Amodal Face Detection
- ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion
- Augmenting with NeRFs: Fast Relocalization on Densified Datasets
- CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
- VOCAL: Visual Odometry via ContrAstive Learning
- Training-free Detection of Text-to-video Generations via Over-coherence
- GAITGen: Disentangled Motion-Pathology Impaired Gait Generative Model -- Bringing Motion Generation to the Clinical Domain
- One Model, Many Behaviors: Training-Induced Effects on Out-of-Distribution Detection
- Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image
- Show Me: Unifying Instructional Image and Video Generation with Diffusion Models
- CaRS: A Causal Intervention Segmentation Framework and Benchmark Dataset for Autonomous Driving under Transitional Weather Conditions
- Comp4D: Compositional 4D Scene Generation
- OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding
- FLoMo-Net: A Novel Task-Adaptive Mixture of Experts Routing Framework with Frequency and Uncertainty Correction for Medical Image Segmentation
- KMOPS: Keypoint-Driven Method for Multi-Object Pose and Metric Size Estimation from Stereo Images
- TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection
- Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting
- Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning
- Domain Generalizing DINO for Visual Regression via Latent Distractor Subspace Consistency
- Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries
- WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion
- MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training
- FlowEO: Generative Unsupervised Domain Adaptation for Earth Observation
- Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection
- A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback
- Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
- JOCA: Task-Driven Joint Optimisation of Camera Hardware and Adaptive Camera Control Algorithms
- Perceptually Guided 3DGS Streaming and Rendering for Mixed Reality
- HistoMILKD: A Multiple Instance Learning based Multi-Teacher Knowledge Distillation Framework for Whole Slide Image Classification
- Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing
- HumanBench: Two Heads, No Legs, But Mostly Human, the State of Generative Capabilities in T2I Models
- CLIP’s Visual Embedding Projector is a Few-shot Cornucopia
- Workzone3D: A Multimodal Dataset for 3D Work Zone Perception in Autonomous Driving
- SPOC: Spatially-Progressing Object State Change Segmentation in Video
- Pose-Diverse Multi-View Virtual Try-on from a Single Frontal Image via Diffusion Transformer
- GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving
- Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings
- Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar
- MANTA: Physics-Informed Generalized Underwater Object Tracking
- AGENet: Adaptive Edge-aware Geodesic Distance Learning for Few-Shot Medical Image Segmentation
- MuseDance: A Diffusion-based Music-Driven Image Animation System
- Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation
- ART-ASyn: Anatomy-aware Realistic Texture-based Anomaly Synthesis Framework for Chest X-Rays
- MAFM³: Modular Adaptation of Foundation Models for Multi-Modal Medical AI
- Mobile-Oriented Video Diffusion: Enabling Text-to-Video Generation on Mobile Devices Without Retraining, Compression, or Pruning
- From Detection to Anticipation: Online Understanding of Struggles across Various Tasks and Activities
- Codebook Knowledge with Mamba-Transformer For Low-Light Image Enhancement
- CLUE: Bringing Machine Unlearning to Mobile Devices
- Decoupling Shape and Texture in SAM-2 via Controlled Texture Replacement
- TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model
- Event-based Graph Representation with Spatial and Motion Vectors for Asynchronous Object Detection
- DNA: Dual-branch Network with Adaptation for Open-Set Online Handwriting Generation
- Latent Uncertainty-Aware Multi-View SDF Scan Completion
- Decomposition Sampling for Efficient Region Annotations in Active Learning
- Curve Skeletonization in Continuous domain for Meshes and Point Clouds
- F-ViTA: Foundation Model Guided Visible to Infrared Translation
- BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries
- AEON: Adaptive Embedding Optimized Noise for Robust Watermarking in Diffusion Models
- MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency
- CropAT: Leveraging Diffusion-Generated Target-Like Cropped Objects for Pseudo-Label Refinement in Domain-Adaptive Object Detection
- GAEA: A Geolocation Aware Conversational Assistant
- TimeRefine: Temporal Grounding with Time Refining Video LLM
- Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation
- Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning
- Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery
- Unsupervised Segmentation by Diffusing, Walking and Cutting
- Eye-for-an-eye: Appearance Transfer with Dense Semantic Correspondence in Diffusion Models
- Sun-E: Dataset and Benchmark for Event-Based Sun Sensing
- Revisiting Vision–Language Foundations for No-Reference Image Quality Assessment
- More Than Memory Savings: Zeroth-Order Optimization Mitigates Forgetting in Continual Learning
- Histopath-C: Towards Realistic Domain Shifts for Histopathology Vision-Language Adaptation
- UCDSC: Open Set UnCertainty aware Deep Simplex Classifier for Medical Image Datasets
- DARB-Splatting: Generalizing Splatting with Decaying Anisotropic Radial Basis Functions
- Subspace-Guided Knowledge Distillation for Efficient Model Transfer
- HiMix : Hierarchical Visual-Textual Mixing Network for Lesion Segmentation
- SIAM: Synchronous Interaction Attention for Human Mesh Recovery
- R3: Reconstruction, Raw, and Rain: Deraining Directly in the Bayer Domain
- VLMs Guided Interpretable Decision Making in Autonomous Driving
- Surgical Gaussian Surfels: Highly Accurate Real-time Surgical Scene Rendering using Gaussian Surfels
- GrowTAS: Progressive Expansion from Small to Large Subnets for Efficient ViT Architecture Search
- Learning spatio-temporal feature representations for video-based gaze estimation
- Grounding Descriptions in Images informs Zero-Shot Visual Recognition
- Lorentz Entailment Cone for Semantic Segmentation
- Countering Multi-modal Representation Collapse through Rank-targeted Fusion
- Spec-Gloss Surfels and Normal–Diffuse Priors for Relightable Glossy Objects
- DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy
- One-Cycle Structured Pruning via Stability-Driven Subnetwork Search
- FujiView: Multimodal Late-Fusion for Predicting Scenic Visibility
- INRetouch: Context Aware Implicit Neural Representation for Photography Retouching
- Inpainting of Sparse Depth Maps from Monocular Depth-from-Focus on Pixel Processor Arrays
- Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups
- An Efficient Multi-Rater Setup Towards Personalized and Diversified Medical Image Segmentation
- OMeGa: Joint Optimization of Explicit Meshes and Gaussian Splats for Robust Scene-Level Surface Reconstruction
- Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation
- Style-Friendly SNR Sampler for Style-Driven Generation
- PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection
- WiSE-OD: Benchmarking Robustness in Infrared Object Detection
- Lose Your Self (LoYS): an adversarial entropy-based unsupervised approach for model debiasing
- A Dataset and Framework for Learning State-invariant Object Representations
- Towards Egocentric 3D Hand Pose Estimation in Unseen Domains
- PrevMatch: Revisiting and Maximizing Temporal Knowledge in Semi-Supervised Semantic Segmentation
- ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
- Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects
- Robust Multimodal Emotion Recognition from Incomplete Modalities via Query-Based Unimodal and Cross-Modal Learning
- Multi-view stereo with multiple projectors for oneshot entire shape scan based on Neural SDF and DSSS demultiplexing
- DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions
- Towards Photorealistic Style Transfer with Multimodal Guidance and Robustness to Content Images in Arbitrary Styles
- EndoPBR: Photorealistic Synthetic Data for Surgical 3D Vision via Physically-based Rendering
- Beyond Faces: A Multimodal Person Clustering for Unconstrained Environments
- ControlEvents: Controllable Synthesis of Event Camera Data with Foundational Prior from Image Diffusion Models
- SSMRadNet : A Sample-wise State-Space Framework for Efficient and Ultra-Light Radar Segmentation and Object Detection
- MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes
- Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching
- ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points
- Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection
- Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance
- Improving Animal Pose Estimation through Species Similarity Measures and Rigorous Label Definition
- ProtoGMVAE: A Variational Auto-Encoder with True Gaussian Mixture Prior for Prototypical-based Self-Explainability
- CalibBEV: LiDAR-Camera Calibration via BEV Alignment
- CoL2A: Convolution-free Local Linear Attention for SpatioTemporal Event Processing
- FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks
- Context-Preserving Dermoscopic Editing: Mask-Guided Lesion-Aware Diffusion for Attribute Modification
- X-JEPA: A Novel Joint Learning Cross-Modal Predictive Alignment Framework for Remote Sensing Image Retrieval
- Zero-Shot Table Extraction in Business Documents: A Unified Benchmark with Error Taxonomy and Ecological Analysis
- Improving Out-of-Distribution Detection Using Segmented Images and Cross-View Attention Fusion
- WarpRF: Multi-View Consistency for Training-Free Uncertainty Quantification and Applications in Radiance Fields
- Direct Visual Grounding by Directing Attention of Visual Tokens
- Color Preserving CMOS-SPAD Fusion for Multi-Frame HDR
- ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos
- CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow
- SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding
- HiGlassRM: Learning to Remove High-prescription Glasses via Synthetic Dataset Generation
- Learning Unified Spatio-temporal Representations for Efficient Compressed Video Understanding
- Rank-based Geographical Regularization: Revisiting Contrastive Self-Supervised Learning for Multispectral Remote Sensing Imagery
- Line Art Colorization with Offset Prior-based Diffusion Model
- WSSSP-Net: Weakly Supervised Semantic Segmentation Plugin Network for Face Anti-Spoofing
- KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding
- An improved architecture for part-based animal re-identification through semantic segmentation distillation
- FCC: Fully Connected Correlation for One-Shot Segmentation
- Enhancing Vision Language Corruption Robustness using Cross Distribution & Prompted Denoisers
- DMS2F-HAD: A Dual-branch Mamba-based Spatial–Spectral Fusion Network for Hyperspectral Anomaly Detection
- 3D Superquadric Splatting
- ProSkill: Segment-Level Skill Assessment in Procedural Videos
- SeaClips: A Video Dataset for Maritime Object Detection.
- CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization
- RemEdit: Efficient Diffusion Editing with Riemannian Geometry
- VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping
- SOPHY: Generating Simulation-Ready Objects with Physical Materials
- Automated Pore Detection from In-Situ FDM 3D Printing Video: A Comparative Evaluation of Modern Segmentation Models
- Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss
- DPBridge: Latent Diffusion Bridge for Dense Prediction
- LASOR: Towards Clinically Transparent and Explainable Ophthalmic Report Generation via Lesion-Aware Segmentation
- VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework
- Semi-supervised Domain Adaptation via Mutual Alignment through Joint Error
- SegMango: Early Deep Mango Yield Prediction based on Flower Segmentation and Weather Data
- RobuMTL: Enhancing Multi-Task Learning Robustness Against Weather Conditions
- One-shot Portrait Stylizaiton via Geometric Alignment
- Semi-supervised Key-Point Estimation for Echocardiography Video
- PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction
- Similarity-aware Probabilistic Embeddings Modeling for Video-Text Retrieval
- UniDiff: Parameter-Efficient Adaptation of Diffusion Models for Land Cover Classification with Multi-Modal Remotely Sensed Imagery and Sparse Annotations
- Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction
- Dronaquatics: Real-time Swimming Analytics Using Drone Captured Imagery
- Generalization of Real World Video Deblurring By Image-to-Image Translation
- Confidence Through Parallel Attention for Depth and Uncertainty Estimation in Dynamic Environments
- Controllable Long-term Motion Generation with Extended Joint Targets
- How I Met Your Bias: Investigating Bias Amplification in Diffusion Models
- Safe Vision-Language Models via Unsafe Weights Manipulation
- Broadcast2Pitch: Game State Reconstruction from Unconstrained Soccer Videos
- FSP-DETR: Few-Shot Prototypical Parasitic Ova Detection
- Global Focal and Radial Distortion Averaging from Radial Fundamental Matrices for Robust Self-Calibration
- Food Image Generation on Multi-Noun Categories
- 4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimer's Diagnosis
- Learning Subglacial Bed Topography from Sparse Radar with Physics-Guided Residuals
- Restora-Flow: Mask-Guided Image Restoration with Flow Matching
- Gated Temporal Fusion Transformers for Robust Multi-Object Tracking
- RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution
- BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining
- SegMo: Segment-aligned Text to 3D Human Motion Generation
- ChartQA-X: Generating Explanations for Visual Chart Reasoning
- SFMNet: Sparse Focal Modulation for 3D Object Detection
- Zero‑Shot Domain Generalisation via Prompt-Driven Feature Refinement
- LogicCBMs: Logic-Enhanced Concept-Based Learning
- TacticalCalib: End-to-End 6-DoF Camera Pose Regression for Tactical Camera Calibration
- T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation
- Clear Sights on Site: A Spatial-Adaptive Channel Network for Deblurring Construction Site Images
- GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts
- brat: Aligned Multi-View Embeddings for Brain MRI Analysis
- CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering
- Learning Beyond Labels: Self-Supervised Handwritten Text Recognition
- GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion
- Distribution Highlighted Reference-based Label Distribution Learning for Facial Age Estimation
- Can Image Splicing and Copy-Move Forgery Be Detected by the Same Model? Forensim: An Attention-Based State-Space Approach
- DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors
- GDoFS: Gaussian DoF Separation for Plausible 3D Geometry in Sparse-View 3DGS
- MoSCo: Real-time and Efficient Text-to-Motion Synthesis via Delta Training
- TED-4DGS: Temporally Activated and Embedding-based Deformation for 4DGS Compression
- Digital Forensic AI You Can Explain: A Case Study on Video Source Camera Identification
- SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding
- PoseAdapt: Sustainable Human Pose Estimation via Continual Learning Benchmarks and Toolkit
- FAIR-SIGHT: Fairness Assurance in Image Recognition via Simultaneous Conformal Thresholding and Dynamic Output Repair
- FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding
- IPCD: Intrinsic Point-Cloud Decomposition
- TRACE: Confounder-free Adversarial Fine-tuning for Robust Object Detection
- DREAM: Dynamic Prompts and GuidedMix for Efficient Continual Adaptation of Visual-Language Models
- Virtually Unrolling the Herculaneum Papyri by Diffeomorphic Spiral Fitting
- PALMS+: Modular Image-Based Floor Plan Localization Leveraging Depth Foundation Model
- Overcoming Small Data Limitations in Video-Based Infant Respiration Estimation
- Zero-LEAD: Source-Free Universal Domain Adaptation for Abdominal Multi-Organ Segmentation
- PerVL-Bench: Benchmarking Multimodal Personalization for Large Vision–Language Models
- AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization
- SD-CSFL: A Synthetic Data-Driven Conformity Scoring Framework for Robust Federated Learning
- See, Record, Do: Automated Generation of UI Workflows from Tutorial Videos
- D2Mamba: Dual Domain Guided Informed Search in State Space Model for Underwater Image Enhancement
- TM-Adapter: Temporal Merge Adapter for Efficient Global Temporal Modeling
- CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation
- Enhancing Monocular 3D Hand Reconstruction with Learned Texture Priors
- Diffusion Noise Optimization for Synthetic VLM Training
- BAFLE-DCT: Bypassing Adversarial Filters via Frequency-Selective Embedding in the DCT Domain
- Histogram Assisted Quality Aware Generative Model for Resolution Invariant NIR Image Colorization
- SSMT-Net: A Semi-Supervised Multitask Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images
- Understanding the Visual Projection Space of Multimodal LLMs
- NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining
- R-MMA: Enhancing Vision-Language Models with Recurrent Adapters for Few-Shot and Cross-Domain Generalization
- Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score
- Optimal Transport for Rectified Flow Image Editing: Unifying Inversion-Based and Direct Methods
- Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships
- VAST-ReID: A Low-Light Benchmark Dataset for Person Re-Identification with Visual and Attribute-Rich Semantic Tracking
- A Deep Network for Object Detection on Inland Waters
- DreamCatcher: Efficient Multi-Concept Customization via Representation Finetuning
- ZebraPose: Zebra Detection and Pose Estimation using only Synthetic Data
- ODEt(ODEl): Shortcutting the Time and the Length in Diffusion and Flow Models for Faster Sampling
- UniTabBank: A Large Scale Multi-Lingual, Multi-Layout, Multi-Type, Multi-Format Dataset for Table Detection
- VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion
- Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control
- IMKD: Intensity-Aware Multi-Level Knowledge Distillation for Camera-Radar Fusion
- SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation
- Training-Free Few-Shot Segmentation via Vision-Language Guided Prompting
- Memoire: Learning User Personas from Gallery Tags for Personalized Photo Curation
- Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation
- Correcting and Quantifying Systematic Errors in 3D Box Annotations for Autonomous Driving
- Synthesizing Compositional Videos from Text Description
- Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices
- RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding
- RealDroneVision: Dataset and Architecture Advancements for Small-Object Drone Detection
- CSGaussian: Progressive Rate-Distortion Compression and Segmentation for 3D Gaussian Splatting
- Modeling and Learning Multiple Hypotheses for Monocular 3D Object Detection
- HodgeFormer: Transformers for Learnable Operators on Triangular Meshes through Data-Driven Hodge Matrices
- Diffusion-Based Action Recognition Generalizes to Untrained Domains
- WiSAR3D - Aerial LiDAR dataset for 3D object detection
- ATM: Enhanced Alignment for Text-to-Motion Generation
- Gaussian Splatting Map Registration with Orthographic Bird's-Eye-View Renderings
- FNOPT: Resolution-Agnostic, Self-Supervised Cloth Simulation using Meta-Optimization with Fourier Neural Operators
- UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training
- A Unified Diffusion-Based Framework for Multi-Agent Trajectory Prediction Integrating Structured Multi-Modal Representations
- Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset
- V2XScene: Multi-View Consistent 3D Scene Simulation for Collaborative Perception
- Test Time Adaptation Using Adaptive Quantile Recalibration
- OSEG: Improving Diffusion sampling through Orthogonal Smoothed Energy Guidance
- Image-Guided Semantic Pseudo-LiDAR Point Generation for 3D Object Detection
- Distilling Offline Action Detection Models into Real-Time Streaming Models
- FlowCLAS: Enhancing Normalizing Flow-Based Anomaly Segmentation Via Contrastive Learning
- Single-step Diffusion for Image Compression at Ultra-Low Bitrates
- Motion-Aware Graph Fusion NetWork for 3D Human Pose Estimation
- Multimodal Graph Representation Learning over Arbitrary Sets of Modalities
- UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
- SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance
- Causality-Driven Audits of Model Robustness
- Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport
- Denoise, Divide, Distill, and Predict (D3P): Towards Forecasting Long-horizon Real-world Anomaly from Normalcy
- Test-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations
- Alignment and Distillation: A Robust Framework for Multimodal Domain Generalizable Human Action Recognition
- Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information
- ForestSplats: Deformable transient field for Gaussian Splatting in the Wild
- Predicting Task fMRI Contrasts from Resting-State fMRI Using Sparse 3D Convolutions
- SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection
- MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection
- ScoreNet: Netting Lightweight Quality Scores for Better Visual Assessment with Large Multi-Modality Models
- From Prompt to Production: Automating Brand-Safe Marketing Imagery with Text-to-Image Models
- GroupPortrait: Multi-ID Portrait Generation with High Identity Preservation and Fine-Grained Control
- From Darkness to Detail: Frequency-Aware SSMs for Low-Light Vision
- Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities
- MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding
- Logit-Adjusted Test-Time Adaptation under Partial Class Imbalance
- S2O: Static to Openable Enhancement for Articulated 3D Objects
- CoreCaption: Core Caption based Text-to-Video Retrieval
- Exploring the Boundaries of Diffusion Models for Offline Writer Identification with Sparse and Intra-Variable Data
- Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment
- Isolating the Role of Temporal Information in Video Saliency: A Controlled Experimental Analysis
- Diagnose Like A REAL Pathologist: An Uncertainty-Focused Approach for Trustworthy Multi-Resolution Multiple Instance Learning
- Hymavi : A Hybrid Mamba-Attention Network in Multi-View Framework for Volumetric Medical Image Segmentation
- MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation
- Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data
- Align Video Diffusion Model with Online Video-Centric Preference Optimization
- HABIT: Human Action Benchmark for Interactive Traffic in CARLA
- VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics
- F-INR: Functional Tensor Decomposition for Implicit Neural Representations
- FocalComm: Hard Instance-Aware Multi-Agent Perception
- High-Level Semantics and Low-Level Features Fusion for Multi-Scale Object Detection in Dynamic Construction Environments
- Neural Geometry Image-Based Representations with Optimal Transport (OT)
- Graph Query Networks for Object Detection with Automotive Radar
- SVD-Det: A Lightweight Framework for Video Forgery Detection Using Semantic and Visual Defect Cues
- PointSt3R: Point Tracking through 3D Ground Correspondence
- Optimization-Free Style Transfer for 3D Gaussian Splats
- SeqFeedNet: Sequential Feature Feedback Network for Background Subtraction
- CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation
- GRAPE (Gaussian Rendering for Accelerated Pixel Enhancement) Brings Fast and Lightweight Arbitrary Super-Resolution
- LASER: Lip Landmark Assisted Speaker Detection for Robustness
- Intraoperative 2D/3D Registration via Spherical Similarity Learning and Differentiable Levenberg-Marquardt Optimization
- Reciprocal Teaching: Dynamic Multi-Model Teacher-Student Learning for Multiple Noisy Annotations
- MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
- DM3Net: Dual-Camera Super-Resolution via Domain Modulation and Multi-scale Matching
- DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment
- Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering
- Splatter Layout: Geometry-embedded 3D Reconstruction via Surface Unfolding
- SVS-GAN for Semantic Synthesis of Traffic Videos for Autonomous Driving
- Distilling What and Why: Enhancing Driver Intention Prediction with MLLMs
- Any Detector Can Detect Anything
- VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
- QuEENet: Quantum-Enhanced Expressive Network for Image Classification
- Leveraging Sparsity for Privacy in Collaborative Inference
- CAST: Evaluating Multi-Object Trackers with Context-Aware Switch and Transfer Scores
- Adversarial Pseudo-replay for Exemplar-free Class-incremental Learning
- SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking
- Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study
- DualRes: Production-ready Dynamic Object Detection
- DocWaveDiff: A Predict-and-Refine approch for Document Image Enhancement with Wavelet U-Nets and Diffusion models
- UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks
- VRAgent: Self-Refining Agent for Zero-Shot Multimodal Video Retrieval
- ICONIC-444: A 3.1-Million-Image Dataset for OOD Detection Research
- Sketch3R: Rapid and Realistic 3D VR Sketch Creation to Shape Retrieval
- Conversational Image Generation: Towards Multi-Round Personalized Generation with Multi-Modal Language Models
- LangPose: Language-Aligned Motion for Robust 3D Human Pose Estimation
- Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models
- Joint Optimization of Camera Model and Deep Neural Network for Image Recognition
- Where is the Watermark? Interpretable Watermark Detection at the Block Level
- Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training
- Scalable Video Action Anticipation with Cross Linear Attentive Memory
- IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers
- Bridging the Domain Gap in Small Multimodal Models: A Dual-level Alignment Perspective
- FlyPose: Towards Robust Human Pose Estimation From Aerial Views
- VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models
- Optimizing against Infeasible Inclusions from Data for Semantic Segmentation through Morphology
- IMPACT: Interpretable Most Important Person Analysis and Classification using Transformer-based Models
- MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps
- Cycle-consistent Multi-graph Matching for Self-supervised Annotation of C. Elegans
- DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis
- Dual-Domain Multimodal Hyperbolic Fusion for Cardiopulmonary Disease Diagnosis in Emergency Care
- CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting
- SmoothDiffusion-VE: Real-time Generative Video Editing Using Adaptive Feature Cache
- Advancing Player Identification and Tracking with Global ID Fusion (GIF)
- Rethinking Latent Variable in Learned Image Compression
- HDR Reconstruction Boosting with Training-Free and Exposure-Consistent Diffusion
- DiffRegCD: Integrated Registration and Change Detection with Diffusion Features
- Flood-LDM: Generalizable Latent Diffusion Models for rapid and accurate zero-shot High-Resolution Flood Mapping
- PhysEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education
- Automated Suturing Skill Assessment in Robot-assisted Surgery from Endoscopic Videos using Clinically-guided Evaluation Criteria
- FastPose-ViT: A Vision Transformer for Real-Time Spacecraft Pose Estimation
- Enhancing Reverse Distillation with Core Exemplar Learning for Unified Multi-Class Anomaly Detection
- FG-TRACER: Tracing Information Flow in Multimodal Large Language Models in Free-Form Generation
- Remote Sensing Forestry Similarity Convolution
- STRinGS: Selective Text Refinement in Gaussian Splatting
- ISALux: Illumination and Semantics-Aware Transformer Employing Mixture of Experts for Low Light Image Enhancement
- SOLAR: Switchable Output Layer for Accuracy and Robustness in Once-for-All Training
- Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors
- Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs
- Fetal and Neonatal Cortical Surface Reconstruction with Anatomical Normal-guidance and Perceptual Enhancements
- SphereEdit: Spherical Semantic Editing in Diffusion Models
- Stabilizing Direct Training of Spiking Neural Networks: Membrane Potential Initialization and Threshold-robust Surrogate Gradient
- MemeTAG: Keyword-Driven Meme Classification through Tag Embedding Reconstruction
- FairVLM: Enhancing Fairness and Prompt Sensitivity in Vision Language Models for Medical Image Segmentation
- MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data
- Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation
- ReFineVQA: Iterative Refinement of Video Description via Feedback Generation for Video Question Answering
- RampWatch: An In-the-Wild Dataset and Text-Guided Detection Framework for Recreational Vessels
- TopoRec: Point Cloud Recognition Using Topological Data Analysis
- SafeguardGS: 3D Gaussian Primitive Pruning While Avoiding Catastrophic Scene Destruction
- PEaRL: Pathway-Enhanced Representation Learning for Gene and Pathway Expression Prediction from Histology
- Integrating Multi-scale and Multi-filtration Topological Features for Medical Image Classification
- VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning
- Diversity Preserving Coresets for Image Quality Assessment
- MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping
- Exploiting Label-Independent Regularization from Spatial Patterns for Whole Slide Image Analysis
- GFT: Graph Feature Tuning for Efficient Point Cloud Analysis
- GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring
- See, Think, Learn: A Self-Taught Multimodal Reasoner
- View-aware Cross-modal Distillation for Multi-view Action Recognition
- SpecGen: Neural Spectral BRDF Generation via Spectral-Spatial Tri-plane Aggregation
- QAL : A Loss for Recall–Precision Balance in 3D Reconstruction
- Improvise, Adapt, Overcome — Telescopic Adapters for Efficient fine-tuning of Vision Language Models in Medical Imaging
- FedEFC: Federated Learning Using Enhanced Forward Correction Against Noisy Labels
- AortaDiff: A Unified Multitask Diffusion Framework for Contrast-Free AAA Imaging
- NavMapFusion: Diffusion-based Fusion of Navigation Maps for Online Vectorized HD Map Construction
- Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting
- GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting
- Pretraining Helps When Capacity Allows: Evidence from Ultra-Small ConvNets
- Meta-YOLO: Metadata-Guided Real-Time Object Detector in Aerial Imagery
- MedPEFT-CL: Dual-Phase Parameter-Efficient Continual Learning with Medical Semantic Adapter and Bidirectional Memory Consolidation
- Patch Your Matcher: Correspondence-Aware Image-to-Image Translation Unlocks Cross-Modal Matching via Single-Modality Priors
- Test-Time Consistency in Vision Language Models
- Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation
- Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting
- Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone
- AugMapNet: Improving Spatial Latent Structure via BEV Grid Augmentation for Enhanced Vectorized Online HD Map Construction
- Uncertainty-Aware Vision-Language Segmentation for Medical Imaging
- Semantic Map Guided Bird's-Eye View Learning for Online HD Map Construction
- Towards Unconstrained Cross-View Pose Estimation
- Being Positive about Negative Queries: Exclusion Aware Multimodal Retrieval using Disentangled Representations
- SCAdapter: Content-Style Disentanglement for Diffusion Style Transfer
- VideoSketcher: A Training-Free Approach for Coherent Video Sketch Transfer
- From Cognitive Priors to Instance Semantics: A Unified Framework for Multi-task Affective Computing
- Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization
- Photo Dating by Facial Age Aggregation
- SceneShine: Illumination-aware Human Scene Gaussian Re-Splatting from Mobile Device Video
- MIST: Multilingual Incidental Dataset for Scene Text Detection
- NeuroBridge: Few-Shot Cross-Modal Neuron Re-identification via Dual-Channel Deep Metric Learning
- General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood
- Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios
- Crash2DocAI: Automated Integration of Post-Crash Car Part Images into Technical Reports
- HOLO: Holistic Lightweight Optimization for Scene Understanding with Auto-Annotation and Multimodal Learning
- SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis
- Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
- Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models
- ACuRE: Accurate Continuity-Regularized SpO2 Estimation Using Liquid Time-Constant Networks
- Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection
- SuperRivolution: Fine-Scale Rivers from Coarse Temporal Satellite Imagery
- T2LF: LLM-Guided Multimodal Diffusion for Text-to-Light Field Synthesis
- SurgXBench: Explainable Vision-Language Model Benchmark for Surgery
- Deep Image Decomposition for Medical Imaging Anonymization and Curation
- Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients
- NRGMark: Localized Watermarking for Energy Transparency in Images
- SAVE: Sparse Autoencoder‑Driven Visual Information Enhancement for Mitigating Object Hallucination
- PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval
Remarks
Workshops
Report issues here.
Successful Page Load