MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos.
Two subnets for audio and video learn to gradually generate aligned audio-video pairs from gaussian noises.
To ensure semantic consistencyacross modalities, we propose a novel random-shift based attention blockbridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other.
Extensive experiments show superior results in unconditional audio-video pair generation, and zero-shot conditional tasks (e.g., video-to-audio).
In particular, we achieve the best fvd and fad on landscape and aist++ dancing datasets.
Authors
Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo
We propose the first ulti-odal model (,) consisting of two-coupled denoising autoencoders for joint audio-video generation in the open domain.
Such a design enables a joint distribution over both modalities to be learned, which greatly reduces temporal redundancies in video and audio and facilitates cross-modal interactions efficiently.
Result
In this paper, we propose, a novel multimodal diffusion model for joint audio and video generation.
Our work pushes the current content generation based on single-modality diffusion models one step forward, and the proposed multi-modal diffusion can generate realistic audio and videos in a joint manner.
Superior performances are achieved over widely-used audio-video benchmarks by objective evaluations and turing tests, which can be attributed to the new formulation for multimodal diffusions, and the designed coupled coupled u-net.