Accelerating JPEG Decompression on GPUs
André Weißenberger, Bertil Schmidt
The JPEG compression format has been the standard for lossy image compression
for over multiple decades, offering high compression rates at minor perceptual
loss in image quality. For GPU-accelerated computer vision and deep learning
tasks, such as the training of image classification models, efficient JPEG
decoding is essential due to limitations in memory bandwidth. As many decoder
implementations are CPU-based, decoded image data has to be transferred to
accelerators like GPUs via interconnects such as PCI-E, implying decreased
throughput rates. JPEG decoding therefore represents a considerable bottleneck
in these pipelines. In contrast, efficiency could be vastly increased by
utilizing a GPU-accelerated decoder. In this case, only compressed data needs
to be transferred, as decoding will be handled by the accelerators. In order to
design such a GPU-based decoder, the respective algorithms must be parallelized
on a fine-grained level. However, parallel decoding of individual JPEG files
represents a complex task. In this paper, we present an efficient method for
JPEG image decompression on GPUs, which implements an important subset of the
JPEG standard. The proposed algorithm evaluates codeword locations at arbitrary
positions in the bitstream, thereby enabling parallel decompression of
independent chunks. Our performance evaluation shows that on an A100 (V100) GPU
our implementation can outperform the state-of-the-art implementations
libjpeg-turbo (CPU) and nvJPEG (GPU) by a factor of up to 51 (34) and 8.0
(5.7). Furthermore, it achieves a speedup of up to 3.4 over nvJPEG accelerated
with the dedicated hardware JPEG decoder on an A100.