TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models
The memory capacity of embedding tables in deep learning recommendation
models (DLRMs) is increasing dramatically from tens of GBs to TBs across the
industry. Given the fast growth in DLRMs, novel solutions are urgently needed,
in order to enable fast and efficient DLRM innovations. At the same time, this
must be done without having to exponentially increase infrastructure capacity
demands. In this paper, we demonstrate the promising potential of Tensor Train
decomposition for DLRMs (TT-Rec), an important yet under-investigated context.
We design and implement optimized kernels (TT-EmbeddingBag) to evaluate the
proposed TT-Rec design. TT-EmbeddingBag is 3 times faster than the SOTA TT
implementation. The performance of TT-Rec is further optimized with the batched
matrix multiplication and caching strategies for embedding vector lookup
operations. In addition, we present mathematically and empirically the effect
of weight initialization distribution on DLRM accuracy and propose to
initialize the tensor cores of TT-Rec following the sampled Gaussian
distribution. We evaluate TT-Rec across three important design space dimensions
-- memory capacity, accuracy, and timing performance -- by training MLPerf-DLRM
with Criteo's Kaggle and Terabyte data sets. TT-Rec achieves 117 times and 112
times model size compression, for Kaggle and Terabyte, respectively. This
impressive model size reduction can come with no accuracy nor training time
overhead as compared to the uncompressed baseline.