A Fast DistilBERT Pipeline for Transformer-based Language Models
Fast DistilBERT on CPUs
In this work, we propose a new pipeline for creating and running fast and parallelized fast transformer language models on high performance processors, leveraging hardware-aware pruning, knowledge distillation, quantization, and our own inference runtime engine with optimized kernels for sparse and quantized operators.
We demonstrate the efficiency of our pipeline by creating a fast distilbert model showing minimal accuracy loss on the question-answering question-answering question-answering squadv1.1 benchmark, and throughput results under typical production constraints and environments.
Authors
Haihao Shen, Ofir Zafrir, Bo Dong, Hengyu Meng, Xinyu Ye, Zhe Wang, Yi Ding, Hanwen Chang, Guy Boudoukh, Moshe Wasserblat
In this paper, we propose a new pipeline for creating and running fast transformer models on common machines.
Our main contributions are threefold: 1) to propose a hardware-aware extreme compression technique for fast transformer models on cpus.
We demonstrate up to 1.5x performance gain over neural magic’s deepsparse engine and up to 4.1x performance gains over onnx runtime on common cpus from amazon web services (aws) under typical production constraints.
Result
We present the best inference solution for the production scenario of the transformer.
Our solution shows a consistent performance gain of 3.6x-4.1x over onnx runtime and outperforms neural magic by up to 50% for a range of sequence lengths.
We observe that our solution yields results comparable to the neural magic solution and performs 2x-3x better than onnx runtime.