Skip to yearly menu bar Skip to main content


Oral Session

Oral Session 1A: Generative Models I

Sun 8 Mar 10:15 a.m. PDT — 11:15 a.m. PDT
Abstract:
Chat is not available.

Sun 8 March 10:15 - 10:27 PDT

DreamAnywhere: Object-Centric Panoramic 3D Scene Generation

Edoardo Dominici ⋅ Jozef Hladký ⋅ Floor Verhoeven ⋅ Lukas Radl ⋅ Thomas Deixelberger ⋅ Stefan Ainetter ⋅ Philipp Drescher ⋅ Stefan Hauswiesner ⋅ Arno Coomans ⋅ Giacomo Nazzaro ⋅ Konstantinos Vardis ⋅ Markus Steinberger

Recent advances in text-to-3D scene generation have demonstrated significant potential to transform content creation across multiple industries. Although the research community has made impressive progress in addressing the challenges of this complex task, existing methods often generate environments that are only front-facing, lack visual fidelity, exhibit limited scene understanding, and are typically fine-tuned for either indoor or outdoor settings. In this work, we address these issues and propose DreamAnywhere, a modular system for the fast generation and prototyping of 3D scenes. Our system synthesizes a 360° panoramic image from text, decomposes it into background and objects, constructs a complete 3D representation through hybrid inpainting, and lifts object masks to detailed 3D objects that are placed in the virtual environment. DreamAnywhere supports immersive navigation and intuitive object-level editing, making it ideal for scene exploration, visual mock-ups, and rapid prototyping -- all with minimal manual modeling. These features make our system particularly suitable for low-budget movie production, enabling quick iteration on scene layout and visual tone without the overhead of traditional 3D workflows. Our modular pipeline is highly customizable as it allows components to be replaced independently. Compared to current state-of-the-art text and image-based 3D scene generation approaches, DreamAnywhere shows significant improvements in coherence in novel view synthesis and achieves competitive image quality, demonstrating its effectiveness across diverse and challenging scenarios. A comprehensive user study demonstrates a clear preference for our method over existing approaches, validating both its technical robustness and practical usefulness.

Sun 8 March 10:27 - 10:39 PDT

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Sibo Dong ⋅ Ismail Shaheen ⋅ Maggie Shen ⋅ Rupayan Mallick ⋅ Sarah Bargal

Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images.Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.

Sun 8 March 10:39 - 10:51 PDT

Odo: Depth-Guided Diffusion for Identity-Preserving Body Reshaping

Siddharth Khandelwal ⋅ Sridhar Kamath ⋅ Arjun Jain

Human shape editing enables controllable transformation of a person's body shape, such as thin, muscular, or overweight, while preserving pose, identity, clothing, and background. Unlike human pose editing, which has advanced rapidly, shape editing remains relatively underexplored. Current approaches typically rely on 3D morphable models or image warping, often introducing unrealistic body proportions, texture distortions, and background inconsistencies due to alignment errors and deformations. A key limitation is the lack of large-scale, publicly available datasets for training and evaluating body shape manipulation methods.In this work, we introduce the first large-scale dataset of 18,573 images across 1523 subjects, specifically designed for controlled human shape editing. It features diverse variations in body shape, including fat, muscular and thin, captured under consistent identity, clothing, and background conditions. Using this dataset, we propose Odo, an end-to-end diffusion-based method that enables realistic and intutive body reshaping guided by simple semantic attributes. Our approach combines a frozen UNet that preserves fine-grained appearance and background details from the input image with a ControlNet that guides shape transformation using target SMPL depth maps. Extensive experiments demonstrate that our method outperforms prior approaches, achieving per-vertex reconstruction errors as low as 7.5mm, significantly lower than the 13.6mm observed in baseline methods, while producing realistic results that accurately match the desired target shapes.

Sun 8 March 10:51 - 11:03 PDT

BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

Seong-Eun Hong ⋅ SooBin Lim ⋅ JuYeong Hwang ⋅ Minwook Chang ⋅ Hyeongyeop Kang

Text-to-motion generation allows language-driven animation, yet current models struggle to deliver long-range coherence and fine-grained limb coordination. A competitive system must (i) preserve temporal consistency across hundreds of frames, (ii) synchronize limb motions, and (iii) align nuanced sentences with a spectrum of plausible trajectories. We introduce BiPO, the first part-based bidirectional autoregressive network trained with a lightweight Partial Occlusion regulariser. Each limb attends to both past and future frames for anticipatory coordination, while stochastic masking weakens spurious cross-part dependencies and encourages varied solutions. On HumanML3D and KIT-ML, BiPO lowers FID by 15–30\% relative to MoMask and BAMM, secures the highest human-perceived realism scores, and sets new state-of-the-art results on motion-editing tasks requiring infill from partial sequences. These findings demonstrate that bidirectional reasoning coupled with Partial Occlusion yields a length-agnostic, high-fidelity framework for expressive, language-conditioned motion synthesis.

Recent advances in reinforcement learning (RL) have enabled effective reward-based finetuning of text-to-image diffusion models, improving their alignment with user preferences. However, existing RL methods typically optimize only the denoising UNet while relying on fixed generation strategies, limiting their flexibility and controllability. In this work, we propose ADOPT, an adaptive diffusion policy training framework that unifies the optimization of Classifier-Free Guidance (CFG) scaling and timestep embedding modulation within a single RL paradigm. Specifically, ADOPT learns a prompt-conditioned policy to adjust the CFG strength dynamically and to modulate timestep embeddings via learnable curve-based scaling, enhancing both semantic guidance and temporal understanding of the diffusion process. Extensive experiments demonstrate that ADOPT consistently improves semantic alignment, aesthetic quality, and human preference scores across diverse prompt datasets, while maintaining efficient inference cost. Our results highlight the potential of jointly optimizing adaptive control strategies to unlock greater flexibility and performance for reward-driven diffusion generation.