Track: Oral Session 9A: Generative Models III

Tue 10 March 14:45 - 14:57 PDT

SCAdapter: Content-Style Disentanglement for Diffusion Style Transfer

Luan Thanh Trinh

Diffusion models have emerged as the leading approach for style transfer, yet they struggle with photo-realistic transfers, often producing painting-like results or missing detailed stylistic elements. Current methods inadequately address unwanted influence from original content styles and style reference content features. We introduce SCAdapter, a novel technique leveraging CLIP image space to effectively separate and integrate content and style features. Our key innovation systematically extracts pure content from content images and style elements from style references, ensuring authentic transfers. This approach is enhanced through three components: Controllable Style Adaptive Instance Normalization (CSAdaIN) for precise multi-style blending, KVS Injection for targeted style integration, and a style transfer consistency objective maintaining process coherence. Comprehensive experiments demonstrate SCAdapter significantly outperforms state-of-the-art methods in both conventional and diffusion-based baselines. By eliminating DDIM inversion and inference-stage optimization, our method achieves at least 2x faster inference than other diffusion-based approaches, making it both more effective and efficient for practical applications.

Tue 10 March 14:57 - 15:09 PDT

T2LF: LLM-Guided Multimodal Diffusion for Text-to-Light Field Synthesis

Soyoung Yoon ⋅ Namhyuk Ahn ⋅ In Kyu Park

We present a novel text-driven approach for light field (LF) synthesis. Existing methods typically generate LFs from given images, requiring users to find reference images, which makes it difficult to construct the desired scene directly and limits scene diversity. Moreover, existing methods are mainly designed for limited baselines from training datasets, making it difficult to implement various viewpoint changes and consequently limiting the flexibility of motion. In contrast, our method directly synthesizes LFs from user-provided text descriptions by leveraging the scene understanding capabilities of a multi-modal large language model (LLM) and the generative power of a diffusion model. Given a text prompt describing the desired LF, the multimodal LLM extracts relevant information for LF synthesis, which then guides a diffusion model to produce diverse scenes and motions. This approach enables LF synthesis even with a pre-trained model not initially designed for this purpose, requiring only minimal fine-tuning. The proposed framework enables visually diverse LF synthesis with only text input. Experimental results demonstrate that the synthesized LFs exhibit geometric consistency and achieve advanced synthesis quality compared to existing methods.

Tue 10 March 15:09 - 15:21 PDT

VideoSketcher: A Training-Free Approach for Coherent Video Sketch Transfer

Huining Li ⋅ Bangzhen Liu ⋅ Rui Yang ⋅ Yang Zhou ⋅ Chenshu Xu ⋅ Xufang PANG ⋅ Shengfeng He

Generating high-quality sketches from video requires a nuanced understanding of semantic content and visual structure, particularly for complex scenes across diverse sketch styles. Efficient and flexible video-to-sketch style transformation remains a significant challenge. We introduce \textit{VideoSketcher}, a training-free framework for style-controllable sketch video generation that preserves frame structure while applying specified sketch aesthetics. Leveraging text-to-image diffusion models, VideoSketcher utilizes strong semantic priors without the need for extensive training. Our approach enforces temporal consistency by retaining latent information across frames and employs a Time-Linked Attention mechanism to capture structural elements from the source video and inject stylistic information from the reference image. To bridge the semantic gap between sketches and original video content, we introduce Sketch Directive Amplification for selective transfer of stylistic features. Additionally, a Stroke Graph Regularization strategy, comprising line and point loss, refines line consistency in the latent space. Extensive experiments validate VideoSketcher's superior temporal stability and fidelity across diverse sketch styles and content. Video demos can be found in the supplementary materials.

Tue 10 March 15:21 - 15:33 PDT

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Yan-Bo Lin ⋅ Kevin Lin ⋅ Zhengyuan Yang ⋅ Linjie Li ⋅ Jianfeng Wang ⋅ Chung-Ching Lin ⋅ Xiaofei Wang ⋅ Gedas Bertasius ⋅ Lijuan Wang

In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AVED-Bench, designed explicitly for zero-shot audio-video editing. AVED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AVED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AVED demonstrates superior results on both AVED-Bench and the recent OAVE dataset to validate its generalization capabilities.

Tue 10 March 15:33 - 15:45 PDT

SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

Hou In Ivan Tam ⋅ Hou In Derek Pun ⋅ Austin Wang ⋅ Angel Chang ⋅ Manolis Savva

Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics often measure realism by comparing generated scenes to a set of ground-truth scenes, but they overlook how well scenes follow the input text and capture implicit expectations of plausibility. We present SceneEval, an evaluation framework designed to address these limitations. SceneEval introduces fine-grained metrics for explicit user requirements—including object counts, attributes, and spatial relationships—and complementary metrics for implicit expectations such as support, collisions, and navigability. Together, these provide interpretable and comprehensive assessments of scene quality. To ground evaluation, we curate SceneEval-500, a benchmark of 500 text descriptions with detailed annotations of expected scene properties. This dataset establishes a common reference for reproducible and systematic comparison across scene generation methods. We evaluate six recent scene generation approaches using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results identify significant gaps in current methods, underscoring the need for further research toward practical and controllable scene synthesis.