GenHSI: Controllable Generation of Human-Scene Interaction Videos
Abstract
Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation.However, existing solutions face several challenges in generating long videos with rich human–scene interactions (HSI), including unrealistic dynamics and affordance, lack of subject identity preservation, and the need for expensive training.To this end, we propose GenHSI, a training-free method for controllable generation of long HSI videos with 3D awareness.Taking inspiration from movie animation, we subdivide the video synthesis into three stages: (1) script writing, (2) pre-visualization, and (3) animation.Given an image of a scene and a character with a user description, we use these three stages to generate long videos that preserve human identity and provide rich and plausible HSI.Script writing converts a complex text prompt involving a chain of HSI into simple atomic actions that are used in the pre-visualization stage to generate 3D keyframes.To synthesize plausible human interaction poses in 3D keyframes, we utilize pre-trained 2D inpainting diffusion models to generate plausible 2D human interactions based on view canonicalization, which eliminates the need for multi-view fitting in previous works. We then extend these interactions to 3D using robust iterative optimization, informed by contact cues and reasoning from VLMs.Prompted by these 3D keyframes, the pretrained video diffusion models can better generate consistent long videos with plausible dynamics and affordance in a 3D-aware manner.We are the first to synthesize a long video sequence with a chain of HSI actions without training based on the image references of the scene and character.Experiments demonstrate that our method can generate HSI videos that effectively preserve scene content and character identity with plausible human-scene interaction from a single image scene.