Generating 3D Holistic Body Motions from Human Speech
Generating Holistic 3D Human Motion from Speech
This work addresses the problem of generating 3d holistic body motions from human speech.
Given a speech recording, we synthesize sequences of 3d bodyposes, hand gestures, and facial expressions that are realistic and diverse.
To achieve this, we first build a high-quality dataset of 3d holistic 3d body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
Specifically, we employ an autoencoder for face motions, and a compositional vector-quantized variational autooencoder (vq-vae) for the bodyand hand motions.
Additionally, we propose a cross-conditional autoregressive model that generates body poses and hand gestures, leading to coherent and realistic motions.
Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively.
Authors
Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, Michael J. Black
In this work, we focus on generating the conversational body, hand gestures as well as the facial expression of a talking person from speech.
To do this, we must learn a cross-modal mapping between audio and 3d holistic body motion, which is very challenging in practice for several reasons.
First, datasets of 3d holistic body meshes and synchronous speech recordings are scarce, and they are difficult to acquire due to the complex motion capture systems.
Second, real humans often vary in shape, and their faces and hands are highly deformable.
Third, as different body parts correlate differently with speech audio, it is difficult to generate realistic and diverse holistic body motions efficiently.
We address the above challenges and learn to model the conversational dynamics in a data-driven way by applying existing models separately.
To overcome the issue of data scarcity, we present a new set of 3d body mesh annotations with synchronous audio from in-the-wild videos.
This dataset was previously used for learning 2d/3d gesture modeling with 2d body keypoint annotations and 3d keypoint annotations of the holistic body.
Apart from facilitating speech and motion modeling, our dataset can also support broad research topics like realistic digital human rendering.
Result
In this work, we propose the first approach to generate 3d holistic body meshes from speech.
For body and hands, we enable diverse generation and coherent prediction with compositional vq-vae and cross-conditional modeling, respectively.
We devise a simple and effective encoder-decoder for realistic face generation with accurate lip shape.
The annotations are obtained by an empirical approach designed for videos.
Experimental results demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively.