In this paper, we focus on audio-driven co-speech gesture video generation. Given an image with speech audio, we generate aligned speaker
Co-speech gesture is crucial for human-machine interaction and digital entertainment.
While previous works mostly map speech audio to human skeletons (e.g., 2d keypoints), directly generating speakers'gestures in the image domain remains unsolved.
To this end, we propose a novel framework, audio-driven gesture video generation (angie), to effectively capture the reusable co-speech motion patterns as well as fine-grained rhythmic movements.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (i.e.
Specifically, 1) we propose a vector quantized motion extractor (vq-motion extractor) to summarize common motion patterns from implicit motion representation to codebooks.
Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video.
Authors
Xian Liu, Qianyi Wu, Hang Zhou, Yuanqi Du, Wayne Wu, Dahua Lin, Ziwei Liu
Co-speech gestures naturally emit co-speech behaviors among humans to complement the verbal channels and express their thoughts.
Such non-verbal behaviors ease speech comprehension and bridge the communicator’s gap for better credibility.
Therefore, equipping the social robot with conversation skills constitutes a crucial step to human-machine interaction.
To achieve it, researchers delve into the task of co-speech gesture generation, where audio-coherent human gesture sequences are synthesized in the form of structural human representation (skeletons).
However, such representation contains no appearance information of the target speaker, which is crucial for human perception.
To effectively learn the mapping from encoded audio feature to human skeletons in a data-driven manner, we pinpoint two important observations from current studies : 1) hand-crafted structural human priors like 2d/3d skeletons would eliminate articulated human body region information and 2) such a zeroth-order motion representation fails to formulate first-order motion like local affine transformation in image animation.
Accordingly, we explore the problem of, using a framework to generate speaker driven by speech audio.