GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting
Abstract
Speech-driven talking heads have emerged recently, enabling interactive avatars. However, real-world applications are limited, as current methods are either accurate but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with one-shot generation while Gaussian Splatting approaches are real-time, yet inaccuracies in tracking or mappings of Gaussians lead to unstable outputs and video artifacts, detrimental to realistic use cases. We address this by mapping Gaussian Splatting via 3D Morphable Models (3DMM) to generate person-specific avatars and introduce transformer-based prediction of 3DMM parameters, directly from audio, to drive temporal consistency. From monocular video and independent speech input signals we generate real-time talking head videos with lip-sync, where we report competitive quantitative and qualitative video-generation performance.