MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement
This paper presents a generic method for generating full facial 3D animation
from speech. Existing approaches to audio-driven facial animation exhibit
uncanny or static upper face animation, fail to produce accurate and plausible
co-articulation or rely on person-specific models that limit their scalability.
To improve upon existing models, we propose a generic audio-driven facial
animation approach that achieves highly realistic motion synthesis results for
the entire face. At the core of our approach is a categorical latent space for
facial animation that disentangles audio-correlated and audio-uncorrelated
information based on a novel cross-modality loss. Our approach ensures highly
accurate lip motion, while also synthesizing plausible animation of the parts
of the face that are uncorrelated to the audio signal, such as eye blinks and
eye brow motion. We demonstrate that our approach outperforms several baselines
and obtains state-of-the-art quality both qualitatively and quantitatively. A
perceptual user study demonstrates that our approach is deemed more realistic
than the current state-of-the-art in over 75% of cases. We recommend watching
the supplemental video before reading the paper:
this https URL
Authors
Alexander Richard, Michael Zollhoefer, Yandong Wen, Fernando de la Torre, Yaser Sheikh