WACV 2026 Events with Videos
Posters
- Mitigating Backdoor Attacks via Trigger Reconstruction and Model Hardening
- No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts
- Revisiting Layer Normalization for Point Cloud Test Time Adaptation
- How to Design and Train Your Implicit Neural Representation for Video Compression
- CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones
- FAE-Net: Fashion Attribute Editing via Disentangled Latent Conditioning in Diffusion Models
- Reinforcement Learning-based Adaptive Control of Classifier-Free Guidance and Timestep Embeddings in Diffusion Models
- Understanding Human-Like Biases in VLMs via Subjective Face Analytics
- A Fast, Simple, and Flexible Scale Informative Feature Transform Module for Arbitrary Scale Image Super-Resolution
- ImageNet-sES: A First Systematic Study of Sensor–Environment Simulation Anchored by Real Recaptures
- Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis
- Cluster-Guided Adversarial Perturbations for Robust Contrastive Learning
- ZonUI-3B: Competitive GUI Grounding with a 3B VLM Trained on a Single Consumer GPU
- Interleaved Vision-and-Language Generation via Generative Voken
- PSA-MIL: A Probabilistic Spatial Attention-Based Multiple Instance Learning for Whole Slide Image Classification
- DreamAnywhere: Object-Centric Panoramic 3D Scene Generation
- Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters
- A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions
- Beyond the Encoder: Joint Encoder-Decoder Contrastive Pre-Training Improves Dense Prediction
- M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models
- Enabling High-Quality In-the-Wild Imaging from Severely Aberrated Metalens Bursts
- From Lightweight CNNs to SpikeNets: Benchmarking Accuracy–Energy Tradeoffs with Pruned Spiking SqueezeNet
- Leveraging Pretrained Representations for Cross-Modal Point Cloud Completion
- Gaussian Representations for Video
- Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?
- GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection
- IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection
- Mixed Diffusion for 3D Indoor Scene Synthesis
- MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval
- ChameleonTuner: Automatic ISP Color Tuning in Subjective Scenarios
- Cluster-based Pseudo-labeling for Semi-Supervised LiDAR Semantic Segmentation
- Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space
- SilverLining: Data-First Mitigation of Spatial and Spectral Shortcuts Without Introducing New Confounders
- PS3: Part level instance segmentation in 3D
- Inpaint360GS: Efficient Object-Aware 3D Inpainting via Gaussian Splatting for 360° Scenes
- Towards Fast and Scalable Normal Integration using Continuous Components
- Network-agnostic distortion-robust projections for wide-angle image understanding
- Root Completion from Intraoral Scans of Tooth Crowns using Diffusion with Patch Perturbation
- FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
- EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation
- Odo: Depth-Guided Diffusion for Identity-Preserving Body Reshaping
- mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
- PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models
- Model-free Domain Adaptation for Concealed Multimodal Large-Language Models
- TS-PCI: Point Cloud Frame Interpolation with Time-Aware Point Cloud Sampling and Self-Supervised Learning Strategy
- M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models
- MEDAL: multi-modal MEta-space Distillation and ALignment for Visual Compatibility Learning
- Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness
- Human Pose Aggregation for Multi-View Temporal Video Alignment
- OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting
- Saliency-Guided DETR for Moment Retrieval and Highlight Detection
- Federated Model Synchronization for Diagnostic Redefinition through a Novel Selective Parameter Unlearning
- Beyond Realism: Learning the Art of Expressive Composition with StickerNet
- BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis
- AUTOCORRELATION-BASED FIDUCIAL MARKERS FOR TRACEABILITY
- Accelerated Dose Generation in Gamma Knife Radiosurgery Using a Wavelet Diffusion Model for Sparse Representation
- A framework for real-time Surgical Phase Recognition with application to Robot-Assisted Partial Nephrectomy
- Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources
- ART: Actor-Related Tubelet for Detecting Complex-shaped Action Tubes
- Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers
- Learning to Animate Images from A Few Videos to Portray Delicate Human Actions
- PaRaChute: Pathology-Radiology Cross-Modal Fusion for Missing-Modality-Robust Survival Prediction
- Marshaled Learning: Bridging Large Neural Networks with Memory-Constrained Trusted Execution Environments in Federated Learning
- OW-Rep: Open World Object Detection with Instance Representation Learning
- TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors
- CADE: Continual Weakly-supervised Video Anomaly Detection with Ensembles
- MBTI: Metric-Based Textual Inversion for Fine-Grained Image Generation
- AutoSew: A Geometric Approach to Stitching Prediction with Graph Neural Networks
- SpikeRain: Towards Energy-Efficient Single Image Deraining with Spiking Neural Networks
- From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance
- Morphing Through Time: Diffusion-Based Bridging of Temporal Gaps for Robust Alignment in Change Detection
- SGPMIL: Sparse Gaussian Process Multiple Instance Learning
- SCORP: Scene-Consistent Object Refinement via Proxy Generation and Tuning
- Zero-Shot Video Deraining with Video Diffusion Models
- Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters
- MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
- End-to-End Fine-Tuning of 3D Texture Generation using Differentiable Rewards
- Reverse Personalization
- SymNet: A Multi-Task Network for Joint Radio Map Reconstruction and Transmitter Localization
- MIX-based Foreground and Background Patch Augmentation Guided by Physics and Material Properties for X-ray Detection
- MR-Pruner: Training-free Multi-resolution Visual Token Pruning for Multi-modal Large Language Models
- LooC: Effective Low-Dimensional Codebook for Compositional Vector Quantization
- Fully Unsupervised Self-debiasing of Text-to-Image Diffusion Models
- Salience-SGG: Enhancing Unbiased Scene Graph Generation with Iterative Salience Estimation
- ViGG: Robust RGB-D Point Cloud Registration using Visual-Geometric Mutual Guidance
- Systematic Analysis of the Unintentional CSAM-Generation-Potential of Text-to-Image Models
- 4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos
- Enhanced Back-Projection of Vision Features for 3D Symmetry Detection
- MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation
- GrounDiff: Diffusion-Based Ground Surface Generation from Digital Surface Models
- AdaptViG: Adaptive Vision GNN with Exponential Decay Gating
- Overcoming Fine-Grained Visual Challenges in Animal Re-Identification via Semantic Feature Alignment
- UnderWater SLAM with Laser-light sectioning method using ST-GAT
- Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?
- HyperPose: Hyper-pose Embeddings for 3D-Aware Generative Models with Self-Supervised Disentangling of Pose and Scene
- Diverse Sketch Colorization with Content-Enhanced Style Representation and Recolorization Distillation
- SaccadeX: Directed Acyclic Graph-based Semi-Supervised Learning of Continuous Ocular Dynamics from Sparse Neuromorphic Streams
- Beyond Real Weights: Hypercomplex Representations for Stable Quantization
- Analysis of Text Accuracy and Visual Alignment in Vision-Language Models for Artistic Text Generation
- Unified Video Anomaly Detection Model for Detecting Different Anomaly Types
- QuadraNet V2: Efficient and Sustainable Training of High-Order Neural Networks with Quadratic Adaptation
- Feature-Disentangling RGB-NIR Fusion Network for Remote Driver Physiological Measurement
- Deepfake Detection that Generalizes Across Benchmarks
- VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction
- MooTrack360: A Novel Fisheye Camera Dataset for Robust Multi Diary Cow Detection and Tracking
- ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models
- Zero-Shot Coreset Selection via Iterative Subspace Sampling
- MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions
- SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense
- Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding
- Temporal Object Captioning for Street Scene Videos from LiDAR Tracks
- QCFace: Image Quality Control for boosting Face Representation & Recognition
- CraftSVG: Multi-Object Text-to-SVG Synthesis via Layout Guided Diffusion
- Pyramidal Spectrum: Frequency-based Hierarchically Vector Quantized VAE for Videos
- Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning
- Revisiting Retentive Networks for Fast Range-View 3D LiDAR Semantic Segmentation
- Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning
- You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
- Multimodal Medical Image Binding via Shared Text Embeddings
- Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention
- Harnessing Object Grounding for Time-Sensitive Video Understanding
- MEGA-PCC: A Mamba-based Efficient Approach for Joint Geometry and Attribute Point Cloud Compression
- Layout Anything: One Transformer for Universal Room Layout Estimation
- TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning
- CycleSL: Server-Client Cyclical Update Driven Scalable Split Learning
- Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model
- MarineEval: Assessing the Marine Intelligence of Vision-Language Models
- CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video
- Tables Guide Vision: Learning to See the Heart through Tabular Data
- OpenCowID: Zero-Shot Visual Identification of Dairy Cows
- UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations
- GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction
- milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion
- RobustFormer: Noise-Robust Pre-training for Images and Videos
- Imitating the Functionality of Image-to-Image Models Using a Single Example
- Shift-Equivariant Complex-Valued Convolutional Neural Networks
- LENVIZ: A High-Resolution Low-Exposure Night Vision Benchmark Dataset
- AnyBald: Toward Realistic Diffusion-Based Hair Removal In-The-Wild
- WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement
- Personalized Image Privacy Advisors via Federated Daisy-Chaining
- Illuminating Darkness: Learning to Enhance Low-light Images In-the-Wild
- SSplain: Sparse and Smooth Explainer for Retinopathy of Prematurity Classification
- Divide and Refine: Enhancing Multimodal Representation and Explainability for Emotion Recognition in Conversation
- A-V Representation Learning via Audio Shift Prediction for Multimodal Deepfake Detection and Temporal Localization
- Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models
- Orca: Object Recognition and Comprehension for Archiving Marine Species
- ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora
- PVeRA: Probabilistic Vector-Based Random Matrix Adaptation
- From Few-Shot to Zero-Shot Pallet Load Recognition: A Deployed Embedding-Based Vision System for Industrial Logistics
- Tables Decoded: DELTA for Structure, TARQA for Understanding
- DiRe: Diversity-promoting Regularization for Dataset Condensation
- AD2: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems
- BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity
- Graph-Based Spectral Attention with Multi-Spectral Images for Illuminant Estimation
- SUGAR: A Sweeter Spot for Generative Unlearning of Many Identities
- DynaGSLAM: Real-Time Gaussian-Splatting SLAM for Online Rendering, Tracking, Motion Predictions of Moving Objects in Dynamic Scenes
- PADM: A Physics-aware Diffusion Model for Attenuation Correction
- ObjectMeshDeform : Towards recovering precise 3D geometry of real objects via image-guided mesh deformation of 3D generative priors
- STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences
- Vision-informed Semantic Text Alignment for Open-set Recognition in Remote Sensing
- Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions
- FlowMorph: Revealing an Optimizable Flow Latent Space for Controlled Image Morphing
- CURIO: Curvature-Aligned and Efficient OCR for Low-Resource Historical Manuscripts
- ScoliGaitX: A Deep Multi-Modal Fusion Network for Scoliosis Assessment via Gait Video Analysis
- Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts
- Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts
- Enhancing Object Detection Training via Joint Image-Annotation Generation
- UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network
- BAFIS: Dataset + Framework to assess occupational Bias and Human Preference in modern Text-to-image Models
- Yunheon Lee, Juncheol Ye, Jaehong Kim, Dongsu Han NerVast: Compression-Efficient Scaling of Implicit Neural Video Representations via Scene-based Parameter-sharing
- Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel NeRA Adapter for Enhanced Feature Adaptation
- Discrete Facial Encoding: A Framework for Data-driven Facial Display Discovery
- Roadside Monocular 3D Detection Prompted by 2D Detection
- Learnable Query-Enhanced Pose Transformation
- BOP-Distrib: Revisiting 6D Pose Estimation Benchmarks for Better Evaluation under Visual Ambiguities
- From Bands to Depth: Understanding Bathymetry Decisions on Sentinel-2
- MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction
- Test-Time Adaptation through Semantically-guided Feature Decomposition for Few-shot Chest X-ray Diagnosis
- SPAR-Det: Segmentation-guided and Prior-Aided Routing for Small Object Detection
- DRWKV: Focusing on Object Edges for Low-Light Image Enhancement
- Matching Semantically Similar Non-Identical Objects
- 3D Gaussian Point Encoders
- Segmentation-Aware Latent Diffusion for Satellite Image Super-Resolution: Enabling Smallholder Farm Boundary Delineation
- Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models
- GeoHSAF: Geometric Hippocampus Shape Analysis Framework for Longitudinal Alzheimer's Disease Classification
- Cosine Similarity is Almost All You Need (for Prototypical-Part Models)
- RobustGait: Robustness Analysis for Appearance Based Gait Recognition
- DenseBEV: Transforming BEV Grid Cells into 3D Objects
- Color Bind: Exploring Color Perception in Text-to-Image Models
- HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis
- DMAT: An End-to-End Framework for Joint Atmospheric Turbulence Mitigation and Object Detection
- DoTA: Latent Distribution Conditioned Data Attribution for Diffusion Models
- Trajectory Tactics: When Transformers Learn Exploration to Generate Online Signature
- Reconstructing Realistic and Relightable Eyes
- MorphXAI: An Explainable Framework for Morphological Analysis of Parasites in Blood Smear Images
- Semi-Supervised Hierarchical Open-Set Classification
- Ordinal-Aware Multimodal Engagement Recognition for Collaborative Learning
- EllipssianNet: Image-guided Sampling of 2D Gaussians for Gaussian Splatting
- Reviving Unsupervised Optical Flow: Concept Reevaluation, Multi-Scale Advances and Full Open-Source Release
- ArchitectHead: Continuous Level of Detail Control for 3D Gaussian Head Avatars
- From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation
- UniCalib: Targetless LiDAR-camera Calibration via Probabilistic Flow on Unified Depth Representations
- Understanding Generative AI Capabilities in Everyday Image Editing Tasks
- FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy
- Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation
- OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models
- Identity Verification from Human Scent using Channel Representation of 2D Gas Chromatography-Mass Spectrometry Data
- Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models
- ConsensusXAI: A framework to examine class-wise agreement in medical imaging
- PHYSPLAT: a Framework for Photorealistic Hybrid Simulation of Real and Synthetic Elements using 3D Gaussian Splatting
- BrandFusion: Aligning Image Generation with Brand Styles
- ITSELF: Attention Guided Fine-Grained Alignment for Vision–Language Retrieval
- InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation
- FARF-Net: Frequency-guided Adaptive Receptive Field Network for Edge-enhanced Polyp Segmentation
- Real-Time Tracking of Flexible Markers in Low-Contrast Fluoroscopy Using a Deep Neural Network Trained Solely on Synthetic Data
- Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space
- PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs
- The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs
- SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation
- Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification
- 3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting
- Mobile-Oriented Video Diffusion: Enabling Text-to-Video Generation on Mobile Devices Without Retraining, Compression, or Pruning
- Towards Reliable Test-Time Adaptation: Style Invariance as a Correctness Likelihood
- Decomposition Sampling for Efficient Region Annotations in Active Learning
- TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model
- Event-based Graph Representation with Spatial and Motion Vectors for Asynchronous Object Detection
- SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout
- DTMIR-Pro: Domain Translation with Prompt-based Latent-Space Generalization for Multi-Weather Image Restoration
- Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models
- Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues
- Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences
- LighthouseGS: Indoor Structure-aware 3D Gaussian Splatting for Panorama-Style Mobile Captures
- Dragonite: Single-Step Drag-based Image Editing with Geometric-Semantic Guidance
- NERVE: Neighbourhood & Entropy-Guided Random-Walk for Training Free Open-Vocabulary Segmentation
- Cross-Modal Event Encoder: Bridging Image–Text Knowledge to Event Streams
- 2S-CEDiff: A Two-Stage Diffusion Framework for Generating High-Fidelity Contrast-Enhanced CT Images from Non-Contrast Scans
- Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling
- FedSCAl: Leveraging Server and Client Alignment for Unsupervised Federated Source-Free Domain Adaptation
- DOTGraph: CLIP-Driven Feature Disentanglement and Optimal Transport based Graph Learning for Few-Shot Segmentation
- Multi-Modal Soccer Scene Analysis with Masked Pre-Training
- TiCLS : Tightly Coupled Language Text Spotter
- Detecting Social Engagement of Elderly From Lifelog Image-streams to Identify Effective Cues for Autobiographic Recall
- Learning from Unknown for Open-Set Test-Time Adaptation
- Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-training
- Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory
- HEART-PFL: Stable Personalized Federated Learning under Heterogeneity with Hierarchical Directional Alignment and Adversarial Knowledge Transfer
- SENCA-st: Integrating Spatial Transcriptomics and Histopathology with Cross Attention Shared Encoder for Region Identification in Cancer Pathology
- Unified Alignment Protocol: Making Sense of the Unlabeled Data in New Domains
- FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation
- Crafting Descriptive Information for a Zero-shot Method to Improve Knowledge-Based Visual Question Answering Performance
- Perception-Inspired Color Space Design for Photo White Balance Editing
- Autoregressive Styled Text Image Generation, but Make it Reliable
- Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models
- High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization
- Learning Group Actions In Disentangled Latent Image Representations
- Empowering Source-Free Domain Adaptation via MLLM-Guided Reliability-Based Curriculum Learning
- Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation
- CLIP-IT: CLIP-based Pairing of Histology Images with Privileged Textual Information
- DATTA: Domain-Adversarial Test-Time Adaptation for Cross-Domain WiFi-Based Human Activity Recognition
- Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement
- Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model
- Universal Neural Architecture Space: Covering ConvNets, Transformers and Everything in Between
- TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection
- KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird’s-Eye-View Segmentation
- CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading
- LightGazeNet: A Lightweight GNN-based Architecture for Gaze Estimation
- Sketch-guided Cage-based 3D Gaussian Splatting Deformation
- Efficient Vision Transformers via Token Merging with Head-wise Attention Correction
- Fused Similarity Measure Based Alignment with Dual-Scale Adaptive Selection for Weakly Supervised Video Anomaly Detection
- SOAF: Scene Occlusion-aware Neural Acoustic Field
- A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis
- Guided Texture Segmentation via Coordinate-Aware Class-Ratio Mapping
- ExDDV: A New Dataset for Explainable Deepfake Detection in Video
- One Model, Many Behaviors: Training-Induced Effects on Out-of-Distribution Detection
- Towards High-Fidelity, Identity-Preserving Real-Time Makeup Transfer: Decoupling Style Generation
- Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models
- Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-free Open-Vocabulary Semantic Segmentation
- Feature Inversion as a Lens on Vision Encoders
- DCSHARP: 3D Gaussian Splatting with Direction Cosine Spherical Harmonics and Shape-Aware Pruning
- Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression
- Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination
- CaRS: A Causal Intervention Segmentation Framework and Benchmark Dataset for Autonomous Driving under Transitional Weather Conditions
- ENCORE : A Neural Collapse Perspective on Out-of-Distribution Detection in Deep Neural Networks
- AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM
- Sea-CLIP: Mining Semantic-Aware Representations for Few-Shot Anomaly Detection with CLIP
- Extreme Amodal Face Detection
- StreetView-Waste: A Multi-Task Dataset for Urban Waste Management
- Training-free Detection of Text-to-video Generations via Over-coherence
- FairScene: Learning Class-Disentangled 2D/3D Representations for Semantic Scene Completion
- MANTA: Physics-Informed Generalized Underwater Object Tracking
- GAITGen: Disentangled Motion-Pathology Impaired Gait Generative Model -- Bringing Motion Generation to the Clinical Domain
- Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy
- SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering
- ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion
- Augmenting with NeRFs: Fast Relocalization on Densified Datasets
- CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
- SCALEX: Scalable Concept and Latent Exploration for Diffusion Models
- Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image
- VOCAL: Visual Odometry via ContrAstive Learning
- Show Me: Unifying Instructional Image and Video Generation with Diffusion Models
- Comp4D: Compositional 4D Scene Generation
- AGENet: Adaptive Edge-aware Geodesic Distance Learning for Few-Shot Medical Image Segmentation
- Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection
- Latent Uncertainty-Aware Multi-View SDF Scan Completion
- OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding
- SimForce: Force and Surface Electromyography from Full Body Video with Graph Neural Nets
- KMOPS: Keypoint-Driven Method for Multi-Object Pose and Metric Size Estimation from Stereo Images
- Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting
- FLoMo-Net: A Novel Task-Adaptive Mixture of Experts Routing Framework with Frequency and Uncertainty Correction for Medical Image Segmentation
- FlowEO: Generative Unsupervised Domain Adaptation for Earth Observation
- Domain Generalizing DINO for Visual Regression via Latent Distractor Subspace Consistency
- Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries
- MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training
- HumanBench: Two Heads, No Legs, But Mostly Human, the State of Generative Capabilities in T2I Models
- A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback
- JOCA: Task-Driven Joint Optimisation of Camera Hardware and Adaptive Camera Control Algorithms
- Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing
- Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning
- Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
- GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving
- Perceptually Guided 3DGS Streaming and Rendering for Mixed Reality
- HistoMILKD: A Multiple Instance Learning based Multi-Teacher Knowledge Distillation Framework for Whole Slide Image Classification
- ART-ASyn: Anatomy-aware Realistic Texture-based Anomaly Synthesis Framework for Chest X-Rays
- WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion
- From Detection to Anticipation: Online Understanding of Struggles across Various Tasks and Activities
- Workzone3D: A Multimodal Dataset for 3D Work Zone Perception in Autonomous Driving
- Pose-Diverse Multi-View Virtual Try-on from a Single Frontal Image via Diffusion Transformer
- MAFM³: Modular Adaptation of Foundation Models for Multi-Modal Medical AI
- Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings
- Rethinking Real Image Editing: Unleashing Diverse Editing Operators via Multi-Objective Optimization
- Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar
- Decoupling Shape and Texture in SAM-2 via Controlled Texture Replacement
- MuseDance: A Diffusion-based Music-Driven Image Animation System
- Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation
- CLIP’s Visual Embedding Projector is a Few-shot Cornucopia
- Codebook Knowledge with Mamba-Transformer For Low-Light Image Enhancement
- DNA: Dual-branch Network with Adaptation for Open-Set Online Handwriting Generation
- CLUE: Bringing Machine Unlearning to Mobile Devices
- SPOC: Spatially-Progressing Object State Change Segmentation in Video
- Visual Detector Compression via Location-Aware Discriminant Analysis
- BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries
- BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining
- Dronaquatics: Real-time Swimming Analytics Using Drone Captured Imagery
- Restora-Flow: Mask-Guided Image Restoration with Flow Matching
- Towards Photorealistic Style Transfer with Multimodal Guidance and Robustness to Content Images in Arbitrary Styles
- Robust Multimodal Emotion Recognition from Incomplete Modalities via Query-Based Unimodal and Cross-Modal Learning
- GAEA: A Geolocation Aware Conversational Assistant
- Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction
- Controllable Long-term Motion Generation with Extended Joint Targets
- 3D Superquadric Splatting
- FSP-DETR: Few-Shot Prototypical Parasitic Ova Detection
- Style-Friendly SNR Sampler for Style-Driven Generation
- Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss
- SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding
- Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation
- Confidence Through Parallel Attention for Depth and Uncertainty Estimation in Dynamic Environments
- ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
- RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution
- Eye-for-an-eye: Appearance Transfer with Dense Semantic Correspondence in Diffusion Models
- CropAT: Leveraging Diffusion-Generated Target-Like Cropped Objects for Pseudo-Label Refinement in Domain-Adaptive Object Detection
- GrowTAS: Progressive Expansion from Small to Large Subnets for Efficient ViT Architecture Search
- HiMix : Hierarchical Visual-Textual Mixing Network for Lesion Segmentation
- One-Cycle Structured Pruning via Stability-Driven Subnetwork Search
- Unsupervised Segmentation by Diffusing, Walking and Cutting
- Broadcast2Pitch: Game State Reconstruction from Unconstrained Soccer Videos
- VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework
- AEON: Adaptive Embedding Optimized Noise for Robust Watermarking in Diffusion Models
- TimeRefine: Temporal Grounding with Time Refining Video LLM
- MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency
- F-ViTA: Foundation Model Guided Visible to Infrared Translation
- Revisiting Vision–Language Foundations for No-Reference Image Quality Assessment
- More Than Memory Savings: Zeroth-Order Optimization Mitigates Forgetting in Continual Learning
- Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning
- UCDSC: Open Set UnCertainty aware Deep Simplex Classifier for Medical Image Datasets
- CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow
- Beyond Faces: A Multimodal Person Clustering for Unconstrained Environments
- CoL2A: Convolution-free Local Linear Attention for SpatioTemporal Event Processing
- 4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimer's Diagnosis
- DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy
- DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions
- Lorentz Entailment Cone for Semantic Segmentation
- Countering Multi-modal Representation Collapse through Rank-targeted Fusion
- WSSSP-Net: Weakly Supervised Semantic Segmentation Plugin Network for Face Anti-Spoofing
- INRetouch: Context Aware Implicit Neural Representation for Photography Retouching
- OMeGa: Joint Optimization of Explicit Meshes and Gaussian Splats for Robust Scene-Level Surface Reconstruction
- Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection
- PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection
- VLMs Guided Interpretable Decision Making in Autonomous Driving
- Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups
- Learning spatio-temporal feature representations for video-based gaze estimation
- Sun-E: Dataset and Benchmark for Event-Based Sun Sensing
- DPBridge: Latent Diffusion Bridge for Dense Prediction
- Rank-based Geographical Regularization: Revisiting Contrastive Self-Supervised Learning for Multispectral Remote Sensing Imagery
- Lose Your Self (LoYS): an adversarial entropy-based unsupervised approach for model debiasing
- How I Met Your Bias: Investigating Bias Amplification in Diffusion Models
- EndoPBR: Photorealistic Synthetic Data for Surgical 3D Vision via Physically-based Rendering
- A Dataset and Framework for Learning State-invariant Object Representations
- An Efficient Multi-Rater Setup Towards Personalized and Diversified Medical Image Segmentation
- SSMRadNet : A Sample-wise State-Space Framework for Efficient and Ultra-Light Radar Segmentation and Object Detection
- Zero-Shot Table Extraction in Business Documents: A Unified Benchmark with Error Taxonomy and Ecological Analysis
- ControlEvents: Controllable Synthesis of Event Camera Data with Foundational Prior from Image Diffusion Models
- R3: Reconstruction, Raw, and Rain: Deraining Directly in the Bayer Domain
- PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction
- An improved architecture for part-based animal re-identification through semantic segmentation distillation
- Grounding Descriptions in Images informs Zero-Shot Visual Recognition
- Semi-supervised Key-Point Estimation for Echocardiography Video
- Improving Animal Pose Estimation through Species Similarity Measures and Rigorous Label Definition
- Subspace-Guided Knowledge Distillation for Efficient Model Transfer
- MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes
- Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching
- KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding
- Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects
- Curve Skeletonization in Continuous domain for Meshes and Point Clouds
- VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping
- SIAM: Synchronous Interaction Attention for Human Mesh Recovery
- Global Focal and Radial Distortion Averaging from Radial Fundamental Matrices for Robust Self-Calibration
- Multi-view stereo with multiple projectors for oneshot entire shape scan based on Neural SDF and DSSS demultiplexing
- FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks
- Surgical Gaussian Surfels: Highly Accurate Real-time Surgical Scene Rendering using Gaussian Surfels
- FujiView: Multimodal Late-Fusion for Predicting Scenic Visibility
- PrevMatch: Revisiting and Maximizing Temporal Knowledge in Semi-Supervised Semantic Segmentation
- Context-Preserving Dermoscopic Editing: Mask-Guided Lesion-Aware Diffusion for Attribute Modification
- DMS2F-HAD: A Dual-branch Mamba-based Spatial–Spectral Fusion Network for Hyperspectral Anomaly Detection
- X-JEPA: A Novel Joint Learning Cross-Modal Predictive Alignment Framework for Remote Sensing Image Retrieval
- WarpRF: Multi-View Consistency for Training-Free Uncertainty Quantification and Applications in Radiance Fields
- ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos
- ProtoGMVAE: A Variational Auto-Encoder with True Gaussian Mixture Prior for Prototypical-based Self-Explainability
- Gated Temporal Fusion Transformers for Robust Multi-Object Tracking
- ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points
- Line Art Colorization with Offset Prior-based Diffusion Model
- Color Preserving CMOS-SPAD Fusion for Multi-Frame HDR
- ProSkill: Segment-Level Skill Assessment in Procedural Videos
- Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance
- Automated Pore Detection from In-Situ FDM 3D Printing Video: A Comparative Evaluation of Modern Segmentation Models
- SeaClips: A Video Dataset for Maritime Object Detection.
- SegMango: Early Deep Mango Yield Prediction based on Flower Segmentation and Weather Data
- CalibBEV: LiDAR-Camera Calibration via BEV Alignment
- Direct Visual Grounding by Directing Attention of Visual Tokens
- WiSE-OD: Benchmarking Robustness in Infrared Object Detection
- HiGlassRM: Learning to Remove High-prescription Glasses via Synthetic Dataset Generation
- CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization
- Learning Unified Spatio-temporal Representations for Efficient Compressed Video Understanding
- SOPHY: Generating Simulation-Ready Objects with Physical Materials
- Towards Egocentric 3D Hand Pose Estimation in Unseen Domains
- DARB-Splatting: Generalizing Splatting with Decaying Anisotropic Radial Basis Functions
- Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery
- LASOR: Towards Clinically Transparent and Explainable Ophthalmic Report Generation via Lesion-Aware Segmentation
- Semi-supervised Domain Adaptation via Mutual Alignment through Joint Error
- Safe Vision-Language Models via Unsafe Weights Manipulation
- RemEdit: Efficient Diffusion Editing with Riemannian Geometry
- RobuMTL: Enhancing Multi-Task Learning Robustness Against Weather Conditions
- One-shot Portrait Stylizaiton via Geometric Alignment
- FCC: Fully Connected Correlation for One-Shot Segmentation
- Enhancing Vision Language Corruption Robustness using Cross Distribution & Prompted Denoisers
- Inpainting of Sparse Depth Maps from Monocular Depth-from-Focus on Pixel Processor Arrays
- UniDiff: Parameter-Efficient Adaptation of Diffusion Models for Land Cover Classification with Multi-Modal Remotely Sensed Imagery and Sparse Annotations
- Histopath-C: Towards Realistic Domain Shifts for Histopathology Vision-Language Adaptation
- Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation
- Generalization of Real World Video Deblurring By Image-to-Image Translation
- Similarity-aware Probabilistic Embeddings Modeling for Video-Text Retrieval
- Food Image Generation on Multi-Noun Categories
- Spec-Gloss Surfels and Normal–Diffuse Priors for Relightable Glossy Objects
- Improving Out-of-Distribution Detection Using Segmented Images and Cross-View Attention Fusion
- FocalComm: Hard Instance-Aware Multi-Agent Perception
- High-Level Semantics and Low-Level Features Fusion for Multi-Scale Object Detection in Dynamic Construction Environments
- Neural Geometry Image-Based Representations with Optimal Transport (OT)
- Graph Query Networks for Object Detection with Automotive Radar
- CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering
- Zero‑Shot Domain Generalisation via Prompt-Driven Feature Refinement
- ChartQA-X: Generating Explanations for Visual Chart Reasoning
- SFMNet: Sparse Focal Modulation for 3D Object Detection
- T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation
- LogicCBMs: Logic-Enhanced Concept-Based Learning
- Distribution Highlighted Reference-based Label Distribution Learning for Facial Age Estimation
- Can Image Splicing and Copy-Move Forgery Be Detected by the Same Model? Forensim: An Attention-Based State-Space Approach
- TacticalCalib: End-to-End 6-DoF Camera Pose Regression for Tactical Camera Calibration
- Clear Sights on Site: A Spatial-Adaptive Channel Network for Deblurring Construction Site Images
- GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts
- brat: Aligned Multi-View Embeddings for Brain MRI Analysis
- Learning Beyond Labels: Self-Supervised Handwritten Text Recognition
- GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion
- DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors
- GDoFS: Gaussian DoF Separation for Plausible 3D Geometry in Sparse-View 3DGS
- MoSCo: Real-time and Efficient Text-to-Motion Synthesis via Delta Training
- IPCD: Intrinsic Point-Cloud Decomposition
- TED-4DGS: Temporally Activated and Embedding-based Deformation for 4DGS Compression
- Digital Forensic AI You Can Explain: A Case Study on Video Source Camera Identification
- SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding
- PALMS+: Modular Image-Based Floor Plan Localization Leveraging Depth Foundation Model
- PoseAdapt: Sustainable Human Pose Estimation via Continual Learning Benchmarks and Toolkit
- FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding
- FAIR-SIGHT: Fairness Assurance in Image Recognition via Simultaneous Conformal Thresholding and Dynamic Output Repair
- PerVL-Bench: Benchmarking Multimodal Personalization for Large Vision–Language Models
- Diffusion Noise Optimization for Synthetic VLM Training
- See, Record, Do: Automated Generation of UI Workflows from Tutorial Videos
- TRACE: Confounder-free Adversarial Fine-tuning for Robust Object Detection
- DREAM: Dynamic Prompts and GuidedMix for Efficient Continual Adaptation of Visual-Language Models
- Virtually Unrolling the Herculaneum Papyri by Diffeomorphic Spiral Fitting
- Overcoming Small Data Limitations in Video-Based Infant Respiration Estimation
- Zero-LEAD: Source-Free Universal Domain Adaptation for Abdominal Multi-Organ Segmentation
- AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization
- SD-CSFL: A Synthetic Data-Driven Conformity Scoring Framework for Robust Federated Learning
- CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation
- D2Mamba: Dual Domain Guided Informed Search in State Space Model for Underwater Image Enhancement
- TM-Adapter: Temporal Merge Adapter for Efficient Global Temporal Modeling
- SegMo: Segment-aligned Text to 3D Human Motion Generation
- BAFLE-DCT: Bypassing Adversarial Filters via Frequency-Selective Embedding in the DCT Domain
- Enhancing Monocular 3D Hand Reconstruction with Learned Texture Priors
- Optimal Transport for Rectified Flow Image Editing: Unifying Inversion-Based and Direct Methods
- Histogram Assisted Quality Aware Generative Model for Resolution Invariant NIR Image Colorization
- DreamCatcher: Efficient Multi-Concept Customization via Representation Finetuning
- SSMT-Net: A Semi-Supervised Multitask Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images
- Understanding the Visual Projection Space of Multimodal LLMs
- Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score
- NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining
- R-MMA: Enhancing Vision-Language Models with Recurrent Adapters for Few-Shot and Cross-Domain Generalization
- Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships
- VAST-ReID: A Low-Light Benchmark Dataset for Person Re-Identification with Visual and Attribute-Rich Semantic Tracking
- A Deep Network for Object Detection on Inland Waters
- ODEt(ODEl): Shortcutting the Time and the Length in Diffusion and Flow Models for Faster Sampling
- ZebraPose: Zebra Detection and Pose Estimation using only Synthetic Data
- Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control
- IMKD: Intensity-Aware Multi-Level Knowledge Distillation for Camera-Radar Fusion
- Memoire: Learning User Personas from Gallery Tags for Personalized Photo Curation
- UniTabBank: A Large Scale Multi-Lingual, Multi-Layout, Multi-Type, Multi-Format Dataset for Table Detection
- VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion
- Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation
- SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation
- Training-Free Few-Shot Segmentation via Vision-Language Guided Prompting
- RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding
- Correcting and Quantifying Systematic Errors in 3D Box Annotations for Autonomous Driving
- Synthesizing Compositional Videos from Text Description
- Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices
- RealDroneVision: Dataset and Architecture Advancements for Small-Object Drone Detection
- CSGaussian: Progressive Rate-Distortion Compression and Segmentation for 3D Gaussian Splatting
- Diffusion-Based Action Recognition Generalizes to Untrained Domains
- Modeling and Learning Multiple Hypotheses for Monocular 3D Object Detection
- HodgeFormer: Transformers for Learnable Operators on Triangular Meshes through Data-Driven Hodge Matrices
- ATM: Enhanced Alignment for Text-to-Motion Generation
- WiSAR3D - Aerial LiDAR dataset for 3D object detection
- Gaussian Splatting Map Registration with Orthographic Bird's-Eye-View Renderings
- FNOPT: Resolution-Agnostic, Self-Supervised Cloth Simulation using Meta-Optimization with Fourier Neural Operators
- UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training
- A Unified Diffusion-Based Framework for Multi-Agent Trajectory Prediction Integrating Structured Multi-Modal Representations
- Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset
- V2XScene: Multi-View Consistent 3D Scene Simulation for Collaborative Perception
- Test Time Adaptation Using Adaptive Quantile Recalibration
- OSEG: Improving Diffusion sampling through Orthogonal Smoothed Energy Guidance
- FlowCLAS: Enhancing Normalizing Flow-Based Anomaly Segmentation Via Contrastive Learning
- Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport
- Multimodal Graph Representation Learning over Arbitrary Sets of Modalities
- Image-Guided Semantic Pseudo-LiDAR Point Generation for 3D Object Detection
- Distilling Offline Action Detection Models into Real-Time Streaming Models
- SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance
- Single-step Diffusion for Image Compression at Ultra-Low Bitrates
- Motion-Aware Graph Fusion NetWork for 3D Human Pose Estimation
- Hybrid State Representation for Video Procedure Planning
- UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
- Causality-Driven Audits of Model Robustness
- Alignment and Distillation: A Robust Framework for Multimodal Domain Generalizable Human Action Recognition
- Test-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations
- Denoise, Divide, Distill, and Predict (D3P): Towards Forecasting Long-horizon Real-world Anomaly from Normalcy
- Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information
- ForestSplats: Deformable transient field for Gaussian Splatting in the Wild
- Predicting Task fMRI Contrasts from Resting-State fMRI Using Sparse 3D Convolutions
- MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection
- SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection
- ScoreNet: Netting Lightweight Quality Scores for Better Visual Assessment with Large Multi-Modality Models
- From Prompt to Production: Automating Brand-Safe Marketing Imagery with Text-to-Image Models
- GroupPortrait: Multi-ID Portrait Generation with High Identity Preservation and Fine-Grained Control
- From Darkness to Detail: Frequency-Aware SSMs for Low-Light Vision
- Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities
- MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding
- Logit-Adjusted Test-Time Adaptation under Partial Class Imbalance
- S2O: Static to Openable Enhancement for Articulated 3D Objects
- CoreCaption: Core Caption based Text-to-Video Retrieval
- Align Video Diffusion Model with Online Video-Centric Preference Optimization
- HABIT: Human Action Benchmark for Interactive Traffic in CARLA
- Exploring the Boundaries of Diffusion Models for Offline Writer Identification with Sparse and Intra-Variable Data
- Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment
- Isolating the Role of Temporal Information in Video Saliency: A Controlled Experimental Analysis
- Diagnose Like A REAL Pathologist: An Uncertainty-Focused Approach for Trustworthy Multi-Resolution Multiple Instance Learning
- Hymavi : A Hybrid Mamba-Attention Network in Multi-View Framework for Volumetric Medical Image Segmentation
- MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation
- Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data
- VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics
- F-INR: Functional Tensor Decomposition for Implicit Neural Representations
- SAVE: Sparse Autoencoder‑Driven Visual Information Enhancement for Mitigating Object Hallucination
- PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval
- PointSt3R: Point Tracking through 3D Ground Correspondence
- CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation
- Optimization-Free Style Transfer for 3D Gaussian Splats
- Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
- Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients
- SeqFeedNet: Sequential Feature Feedback Network for Background Subtraction
- LASER: Lip Landmark Assisted Speaker Detection for Robustness
- GRAPE (Gaussian Rendering for Accelerated Pixel Enhancement) Brings Fast and Lightweight Arbitrary Super-Resolution
- UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks
- MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
- DM3Net: Dual-Camera Super-Resolution via Domain Modulation and Multi-scale Matching
- LangPose: Language-Aligned Motion for Robust 3D Human Pose Estimation
- Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering
- Splatter Layout: Geometry-embedded 3D Reconstruction via Surface Unfolding
- Scalable Video Action Anticipation with Cross Linear Attentive Memory
- CAST: Evaluating Multi-Object Trackers with Context-Aware Switch and Transfer Scores
- DocWaveDiff: A Predict-and-Refine approch for Document Image Enhancement with Wavelet U-Nets and Diffusion models
- Distilling What and Why: Enhancing Driver Intention Prediction with MLLMs
- Intraoperative 2D/3D Registration via Spherical Similarity Learning and Differentiable Levenberg-Marquardt Optimization
- VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
- QuEENet: Quantum-Enhanced Expressive Network for Image Classification
- Any Detector Can Detect Anything
- Adversarial Pseudo-replay for Exemplar-free Class-incremental Learning
- Leveraging Sparsity for Privacy in Collaborative Inference
- DualRes: Production-ready Dynamic Object Detection
- SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking
- CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting
- Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study
- Joint Optimization of Camera Model and Deep Neural Network for Image Recognition
- Conversational Image Generation: Towards Multi-Round Personalized Generation with Multi-Modal Language Models
- NRGMark: Localized Watermarking for Energy Transparency in Images
- VRAgent: Self-Refining Agent for Zero-Shot Multimodal Video Retrieval
- IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers
- ICONIC-444: A 3.1-Million-Image Dataset for OOD Detection Research
- Sketch3R: Rapid and Realistic 3D VR Sketch Creation to Shape Retrieval
- Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models
- ReFineVQA: Iterative Refinement of Video Description via Feedback Generation for Video Question Answering
- FG-TRACER: Tracing Information Flow in Multimodal Large Language Models in Free-Form Generation
- Bridging the Domain Gap in Small Multimodal Models: A Dual-level Alignment Perspective
- Dual-Domain Multimodal Hyperbolic Fusion for Cardiopulmonary Disease Diagnosis in Emergency Care
- VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models
- IMPACT: Interpretable Most Important Person Analysis and Classification using Transformer-based Models
- DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment
- MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps
- Cycle-consistent Multi-graph Matching for Self-supervised Annotation of C. Elegans
- SVS-GAN for Semantic Synthesis of Traffic Videos for Autonomous Driving
- SmoothDiffusion-VE: Real-time Generative Video Editing Using Adaptive Feature Cache
- Fetal and Neonatal Cortical Surface Reconstruction with Anatomical Normal-guidance and Perceptual Enhancements
- Optimizing against Infeasible Inclusions from Data for Semantic Segmentation through Morphology
- Where is the Watermark? Interpretable Watermark Detection at the Block Level
- Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training
- DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis
- ACuRE: Accurate Continuity-Regularized SpO2 Estimation Using Liquid Time-Constant Networks
- Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios
- DiffRegCD: Integrated Registration and Change Detection with Diffusion Features
- Advancing Player Identification and Tracking with Global ID Fusion (GIF)
- FlyPose: Towards Robust Human Pose Estimation From Aerial Views
- HDR Reconstruction Boosting with Training-Free and Exposure-Consistent Diffusion
- Automated Suturing Skill Assessment in Robot-assisted Surgery from Endoscopic Videos using Clinically-guided Evaluation Criteria
- Flood-LDM: Generalizable Latent Diffusion Models for rapid and accurate zero-shot High-Resolution Flood Mapping
- Enhancing Reverse Distillation with Core Exemplar Learning for Unified Multi-Class Anomaly Detection
- Rethinking Latent Variable in Learned Image Compression
- Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting
- ISALux: Illumination and Semantics-Aware Transformer Employing Mixture of Experts for Low Light Image Enhancement
- Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors
- Remote Sensing Forestry Similarity Convolution
- STRinGS: Selective Text Refinement in Gaussian Splatting
- SphereEdit: Spherical Semantic Editing in Diffusion Models
- SOLAR: Switchable Output Layer for Accuracy and Robustness in Once-for-All Training
- Stabilizing Direct Training of Spiking Neural Networks: Membrane Potential Initialization and Threshold-robust Surrogate Gradient
- Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs
- FairVLM: Enhancing Fairness and Prompt Sensitivity in Vision Language Models for Medical Image Segmentation
- MemeTAG: Keyword-Driven Meme Classification through Tag Embedding Reconstruction
- PhysEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education
- MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping
- MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data
- Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation
- RampWatch: An In-the-Wild Dataset and Text-Guided Detection Framework for Recreational Vessels
- TopoRec: Point Cloud Recognition Using Topological Data Analysis
- SafeguardGS: 3D Gaussian Primitive Pruning While Avoiding Catastrophic Scene Destruction
- VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning
- FastPose-ViT: A Vision Transformer for Real-Time Spacecraft Pose Estimation
- Meta-YOLO: Metadata-Guided Real-Time Object Detector in Aerial Imagery
- Diversity Preserving Coresets for Image Quality Assessment
- Exploiting Label-Independent Regularization from Spatial Patterns for Whole Slide Image Analysis
- GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring
- See, Think, Learn: A Self-Taught Multimodal Reasoner
- View-aware Cross-modal Distillation for Multi-view Action Recognition
- QAL : A Loss for Recall–Precision Balance in 3D Reconstruction
- Improvise, Adapt, Overcome — Telescopic Adapters for Efficient fine-tuning of Vision Language Models in Medical Imaging
- FedEFC: Federated Learning Using Enhanced Forward Correction Against Noisy Labels
- SpecGen: Neural Spectral BRDF Generation via Spectral-Spatial Tri-plane Aggregation
- GFT: Graph Feature Tuning for Efficient Point Cloud Analysis
- AortaDiff: A Unified Multitask Diffusion Framework for Contrast-Free AAA Imaging
- PEaRL: Pathway-Enhanced Representation Learning for Gene and Pathway Expression Prediction from Histology
- GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting
- NavMapFusion: Diffusion-based Fusion of Navigation Maps for Online Vectorized HD Map Construction
- Pretraining Helps When Capacity Allows: Evidence from Ultra-Small ConvNets
- Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone
- Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation
- Integrating Multi-scale and Multi-filtration Topological Features for Medical Image Classification
- MedPEFT-CL: Dual-Phase Parameter-Efficient Continual Learning with Medical Semantic Adapter and Bidirectional Memory Consolidation
- VideoSketcher: A Training-Free Approach for Coherent Video Sketch Transfer
- Test-Time Consistency in Vision Language Models
- AugMapNet: Improving Spatial Latent Structure via BEV Grid Augmentation for Enhanced Vectorized Online HD Map Construction
- Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting
- From Cognitive Priors to Instance Semantics: A Unified Framework for Multi-task Affective Computing
- Semantic Map Guided Bird's-Eye View Learning for Online HD Map Construction
- Crash2DocAI: Automated Integration of Post-Crash Car Part Images into Technical Reports
- Towards Unconstrained Cross-View Pose Estimation
- SCAdapter: Content-Style Disentanglement for Diffusion Style Transfer
- Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization
- SceneShine: Illumination-aware Human Scene Gaussian Re-Splatting from Mobile Device Video
- Photo Dating by Facial Age Aggregation
- MIST: Multilingual Incidental Dataset for Scene Text Detection
- NeuroBridge: Few-Shot Cross-Modal Neuron Re-identification via Dual-Channel Deep Metric Learning
- SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis
- T2LF: LLM-Guided Multimodal Diffusion for Text-to-Light Field Synthesis
- General and Domain-Specific Zero-shot Detection of Generated Images via Conditional Likelihood
- HOLO: Holistic Lightweight Optimization for Scene Understanding with Auto-Annotation and Multimodal Learning
- Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection
- Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models
- Being Positive about Negative Queries: Exclusion Aware Multimodal Retrieval using Disentangled Representations
- SurgXBench: Explainable Vision-Language Model Benchmark for Surgery
- SuperRivolution: Fine-Scale Rivers from Coarse Temporal Satellite Imagery
- Patch Your Matcher: Correspondence-Aware Image-to-Image Translation Unlocks Cross-Modal Matching via Single-Modality Priors
- Deep Image Decomposition for Medical Imaging Anonymization and Curation
- SVD-Det: A Lightweight Framework for Video Forgery Detection Using Semantic and Visual Defect Cues
- Reciprocal Teaching: Dynamic Multi-Model Teacher-Student Learning for Multiple Noisy Annotations
Report issues here.
Successful Page Load