Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space
Abstract
We present a novel approach for generating realistic speaking and talking faces by synthesizing a person’s voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation. Our experiments and analysis through standard metrics demonstrate the effectiveness of our model. We also developed a domain-specific dataset for the problem. All model checkpoints, code, and the proposed dataset can be found at https://github.com/narratingForYou/NarratingForYou.