Accelerating Sparse Deep Neural Networks
Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, Paulius Micikevicius
As neural network model sizes have dramatically increased, so has the
interest in various techniques to reduce their parameter counts and accelerate
their execution. An active area of research in this field is sparsity -
encouraging zero values in parameters that can then be discarded from storage
or computations. While most research focuses on high levels of sparsity, there
are challenges in universally maintaining model accuracy as well as achieving
significant speedups over modern matrix-math hardware. To make sparsity
adoption practical, the NVIDIA Ampere GPU architecture introduces sparsity
support in its matrix-math units, Tensor Cores. We present the design and
behavior of Sparse Tensor Cores, which exploit a 2:4 (50%) sparsity pattern
that leads to twice the math throughput of dense matrix units. We also describe
a simple workflow for training networks that both satisfy 2:4 sparsity pattern
requirements and maintain accuracy, verifying it on a wide range of common
tasks and model architectures. This workflow makes it easy to prepare accurate
models for efficient deployment on Sparse Tensor Cores.