We introduce a state-of-the-art method for editable dance generation that is capable of creating realistic, physically-plausible dances while remaining faithful to the input music.
We introduce a new metric for physical plausibility, and evaluate dance quality generated by our method extensively through (1) multiple quantitative metrics on physical plausibility, beat alignment, and diversitybenchmarks, and more importantly, (2) a large-scale user study demonstrating a significant improvement over previous state of the art methods.
Creating new dances or dance animations is uniquely difficult because dance movements are expressive and freeform, yet precisely structured by music.
In practice, this requires tedious hand animation or motion capture solutions, which can be expensive and impractical.
On the other hand, using computational methods to generate dances automatically can alleviate the burden of the creation process, leading to many applications : such methods can help animators create new dances or provide interactive characters in video games or virtual reality with realistic and varied movements based on user-provided music.
In this work, we propose a state-of-the-art method for dance generation that creates realistic, physically-plausible dance motions based on input music.
Our method uses a transformer-based diffusion model paired with a strong music feature extractor.
This unique diffusion-based approach confers powerful editing capabilities well-suited to dance, including joint-wise conditioning and in-betweening.
We improve on previous hand-crafted audio feature extraction strategies by leveraging music audio representations from a pre-trained generative model for music that has previously demonstrated strong performance on music-specific prediction tasks.
In addition to the advantages immediately conferred by the modeling choices, we observe flaws with previous metrics and propose a new metric that captures the physical accuracy of ground contact behaviors without explicit physical modeling.
Result
We propose a diffusion-based model that generates realistic and long-form dance sequences conditioned on music.
Our method is able to create arbitrarily long dance sequences via the chaining of locally consistent, shorter clips.
Moreover, we demonstrate that our model admits powerful editing capabilities, allowing users to freely specify both temporal and joint-wise constraints.
We evaluate our model on multiple automated metrics and a large user study, and find that it achieves state-of-the-art results on the aist dataset and generalizes well to in-the-wild music inputs.