Q8BERT: Quantized 8Bit BERT

10/14/2019
by   Ofir Zafrir, et al.
0

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. As a result, using these models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work, we show how to perform quantization-aware training during the fine tuning phase of BERT in order to compress BERT by 4× with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed by optimizing it to 8bit Integer supporting hardware.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset