The high dimensionality of images presents architecture and
sampling-efficiency challenges for likelihood-based generative models. Previous
approaches such as VQ-VAE use deep autoencoders to obtain compact
representations, which are more practical as inputs for likelihood-based
models. We present an alternative approach, inspired by common image
compression methods like JPEG, and convert images to quantized discrete cosine
transform (DCT) blocks, which are represented sparsely as a sequence of DCT
channel, spatial location, and DCT coefficient triples. We propose a
Transformer-based autoregressive architecture, which is trained to sequentially
predict the conditional distribution of the next element in such sequences, and
which scales effectively to high resolution images. On a range of image
datasets, we demonstrate that our approach can generate high quality, diverse
images, with sample metric scores competitive with state of the art methods. We
additionally show that simple modifications to our method yield effective image
colorization and super-resolution models.
Authors
Charlie Nash, Jacob Menick, Sander Dieleman, Peter W. Battaglia