We introduce the masked generative video transformer, magvit, to tackle diverse video synthesis tasks with a single model.
We introduce a 3-dimensional tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning.
Extensive experiments are conducted to demonstrate the quality, efficiency, and flexibility of magvit.
Our experiments show that (i) magvit outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models and (ii) a single magvit model supports ten diverse video generation tasks and generalizes across videos from different visual domains.
Authors
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
Inspired by the recent success of generative image transformers such as dall·e and other approaches, we propose an efficient and effective multi-task video generation model by leveraging masked token modeling and multi-task learning.
Specifically, we design a 3-dimensional quantization model to tokenize a video, with high fidelity, into a low-dimensional spatial-temporal manifold and propose an embedding method to model a video condition using a multivariate mask and show its efficacy in training.
We build and train a single model to perform a variety of diverse video generation tasks and demonstrate the model s efficiency, effectiveness, and flexibility against state-of-the-art approaches.
Result
We propose a 3d-video - quality-quantization (3d-vq) model for video generation.
The proposed model quantizes a video into 4 @xmath0 16 @xmath1 16 visual tokens, where the visual codebook size is 1024. the token sequence includes 1 task prompt, 1 class token, and 1024 visual tokens.
We train the model to generate 16-frame videos on three standard benchmarks, class-conditional generation on ucf-101 and frame prediction on bair robot pushing and kinetics-600.
For multi-task video generation, we quantitatively evaluate the model on 8-10 tasks.
Extensive experiments are conducted to demonstrate the video generation quality, efficiency, and flexibility for multi-task generation.