SinFusion: Training Diffusion Models on a Single Image or Video
Diffusion models exhibited tremendous progress in image and video generation,exceeding gans in quality and diversity.
However, they are usually trained on very large datasets and are not naturally adapted to manipulate a given input image or video.
In this paper we show how this can be resolved by training a diffusion model on a single input image (or video).
Our image/video-specific diffusion model (sinfusion) learns the appearance and dynamics of the single image or video, while utilizing the conditioning capabilities of diffusionmodels.
It can solve a wide array of image and video-specific manipulation tasks.
In particular, our model can learn from few frames the motion and dynamics of a single input video, then generate diverse new video samples of the same dynamic scene, extrapolate short videos into long ones (both forward and backward in time) and perform video upsampling.
Diffusion models have gained the lead in the last years, surpassing generative adversarial networks by image quality and diversity and becoming the leading method in many vision tasks like text-to-image generation, superresolution and many more.
We present, for the first time, diffusion models trained on a single image/video, adapted to single-image/video tasks.
Once trained, sinfusion can generate new image/video samples with similar appearance and dynamics to the original input and perform various editing and manipulation tasks.
This is learned from very few frames (mostly 2-3 dozens, but is already apparent for fewer frames).
Result
We present a framework for visual synthesis tasks, especially for videos-results are not easily conveyed through the 2d format of a document.
Our main application is generating diverse videos from a single given video, to any length, such that the output samples have similar appearance, structure and motion as the original input video.
Given a video, we can anticipate its future frames, by initializing the generation process (described above) with the last frame of the input video.
Repeating this autoregressive generation process creates an arbitrary length new video.
Since our predictor is also trained backward in time (predicting the previous frame using negative @xmath0), it can generate new videos backwards in time, e.g., starting from the first frame of the video of the balloons causes them to land, even though these motions were obviously not seen in the original video.
We can increase the temporal resolution of an input video or a generated video by performing frame interpolation.
This is done by using the interpolator to interpolate between successive frames, and then correcting the appearance of the interpolated frames with the unconditional projector.