Making Bert Learn Faster

12 min readMay 3, 2021

The OG BERT

BERT and its many derivatives (ALBERT, RoBERT, etc) have become the mainstay of almost all NLP deep learning pipelines. It can be considered a milestone in NLP, similar to what ResNet trained on ImageNet was the computer vision field. BERT paper has become one of the highest cited papers in the field and even lead to the paper BERTology, a 20-page review paper on BERT.

Even Google Search now uses BERT!

The biggest problem with BERT is its size. The others being social bias, lack of explainability, and inability to generalize.

But there is a little fly in the ointment. It is hard to use BERT in production. BERT-base contains 110M parameters. The larger variant BERT-large contains 340M parameters. Such large neural networks are problematic in practice. Due to the large numbers of parameters, it’s very difficult to deploy BERT in resource-restricted systems such as mobile devices. Additionally, low-inference time makes it useless in real-time systems. That’s why finding ways to make it faster is so important.

Making Neural Networks Fast

Making Neural Networks faster is a well-studied problem and researchers utilized previous knowledge derived from Computer Vision to make a crack at this problem.

These approaches roughly can be divided into several groups:

Model Training improvements (hardware, communication, and distributed training ).
Architecture improvements (change the architecture to a faster one, say, replace RNN to a Transformer or a CNN; use layers that require fewer computations and so on) or more clever optimization (learning rate and policy, number of warmup steps, larger batch size, etc).
Model compression (Quantization and/or Pruning).
Model distillation ( Training a smaller student model that learns from BERT as a teacher)
Bonus!: Ternary Bert uses both compression and distillation to achieve speed.

Let’s explore how each of these methods has accelerated BERT training.

1. Model Training Improvements

Large-scale distributed training

The first and the most obvious way to speed up BERT training is to distribute it on a larger cluster. While the original BERT was already trained using several machines, there are some optimized solutions for distributed training of BERT

Nvidia’s BERT implementation

For the remaining sections, we will get more details into Nvidia’s implementation. For BERT, LAMB can achieve a global batch size of 64K and 32K for input sequence lengths of 128 (phase 1) and 512 (phase 2) respectively. With a single GPU, we need a mini-batch size of 64 plus 1024 accumulation steps. That will takes months to pre-train BERT.

Nvidia builds the DGX SuperPOD system with 92 and 64 DGX-2H machines respectively in 2019 and finishes the training in 47 and 67 minutes.

DGX-2 costs about $400K each. One possibility in accessing such infrastructure is through the cloud service like the announced Microsoft Azure’s NDv2 instance that has 800 Nvidia V100 Tensor Core GPUs. Yet, this is subject to the organization’s usage and use cases. For example, if the corpse may change over time or require multiple training, the price of the cloud solution may add up.

On multi-node systems, LAMB can scale up to 1024 GPUs with 17x training speedup compared to Adam optimizer.

Another example of a more clever optimization (and using super-powerful hardware) is a new layerwise adaptive large batch optimization technique called LAMB which allowed reducing BERT training time from 3 days to just 76 minutes on a (very expensive as well) TPUv3 Pod (1024 TPUv3 chips that can provide more than 100 PFLOPS performance for mixed-precision computing).

1. Architecture and optimization improvements

Architectures

Regarding more architecture- and fewer hardware solutions, there is a progressive stacking method for training BERT based upon an observation of self-attention layers behavior showing that its distribution concentrates locally around its position and the start-of-sentence token and that the attention distribution in the shallow model is similar to that of a deep model. Motivated by this, the authors proposed the stacking algorithm to transfer knowledge from a shallow model to a deep model; and applied it to stack progressively to accelerate BERT training. The authors achieved the training time about 25% shorter than the original BERT. This is mainly because for the same number of steps, training a small model needs less computation.

Other architectural improvements reducing the total amount of memory and/or computation are sparse factorizations of the attention matrix (aka Sparse Transformer by OpenAI) and block attention.

ALBERT

And finally, there is a possible architectural descendant of BERT called ALBERT (A Lite BERT) submitted to the ICLR 2020 conference.

ALBERT incorporates two-parameter reduction techniques.

The first one is a factorized embedding parameterization, separating the size of the hidden layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings.

The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network.

Both techniques significantly reduce the number of parameters for BERT without seriously hurting performance, thus improving parameter efficiency.

An ALBERT configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster.

It even outperforms heavily tuned RoBERTa!

2. Distillation

An interesting model compression method is distillation, a technique that transfers the knowledge of a large “teacher” network to a smaller “student” network. The “student” network is trained to mimic the behaviors of the “teacher” network.

A version of this strategy has already been pioneered by Rich Caruana and his collaborators. In their important paper, they demonstrate convincingly that the knowledge acquired by a large ensemble of models can be transferred to a single small model.

Geoffrey Hinton et al. showed this technique can be applied to neural networks in their paper called “Distilling the Knowledge in a Neural Network”.

DistilBERT

Since then this approach was applied to different neural networks, and you probably heard of a BERT distillation called DistilBERT by HuggingFace.

Finally, on October 2nd a paper on DistilBERT called “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter” emerged and was submitted at NeurIPS 2019.

DistilBERT is a smaller language model, trained from the supervision of BERT in which authors removed the token-type embeddings and the pooler (used for the next sentence classification task) and kept the rest of the architecture identical while reducing the numbers of layers by a factor of two.

You can use DistilBERT off-the-shelf with the help of the transformers python package by HuggingFace (formerly known as PyTorch-transformers and PyTorch-pretrained-bert). Version 2.0.0 of the package supports TensorFlow 2.0/PyTorch interoperability.

DistilBERT authors also used a few training tricks from the recent RoBERTa paper which showed that the way BERT is trained is crucial for its final performance.

DistilBERT compares surprisingly well to BERT: authors were able to retain more than 95% of the performance while having 40% fewer parameters.

Comparison of the dev sets of the GLUE benchmark.

In terms of inference time, DistilBERT is more than 60% faster and smaller than BERT and 120% faster and smaller than ELMo+BiLSTM.

Inference speed

TinyBERT

A few days ago a new BERT distillation emerged — TinyBERT by Huawei.

To build a competitive TinyBERT, authors firstly propose a new Transformer distillation method to distill the knowledge embedded in teacher BERT. Specifically, they designed several loss functions to fit different representations from BERT layers:

the output of the embedding layer;
the hidden states and attention matrices derived from the Transformer
layer;
the logits output by the prediction layer.

The attention-based fitting is inspired by the recent findings that the attention weights learned by BERT can capture substantial linguistic knowledge, which encourages that linguistic knowledge can be well transferred from teacher BERT to student TinyBERT. However, it is ignored in existing KD methods of BERT, such as Distilled BiLSTM_SOFT, BERT-PKD, and DistilBERT.

Then, they proposed a novel two-stage learning framework including the general distillation and the task-specific distillation. At the general distillation stage, the original BERT without fine-tuning acts as the teacher model. The student TinyBERT learns to mimic the teacher’s behavior by executing the proposed Transformer distillation on the large-scale corpus from the general domain. They obtained general TinyBERT that can be fine-tuned for various downstream tasks. At the task-specific distillation stage, they perform the data augmentation to provide more task-related materials for teacher-student learning and then re-execute the Transformer distillation on the augmented data.

Both two stages are essential to improve the performance and generalization capability of TinyBERT.

TinyBERT is effective and achieves comparable results with BERT-base in GLUE datasets, while being 7.5x smaller and 9.4x faster on inference.

TinyBERT comparison to other baselines

Other distillations

1.“Distilling Task-Specific Knowledge from BERT into Simple Neural Networks” paper distilled BERT into a single-layer BiLSTM achieving comparable results with ELMo while using roughly 100 times fewer parameters and 15 times less inference time.

BiLSTM_SOF is a distilled TBiLSTM trained on soft logit targets

2. “Patient Knowledge Distillation for BERT Model Compression” paper proposed a Patient Knowledge Distillation approach that was among the first attempts to use hidden states of the teacher, not only the output from the last layer. Their student model patiently learned from multiple intermediate layers of the teacher model for incremental knowledge extraction. In their Patient-KD framework, the student is cultivated to imitate the representations only for the [CLS] token in the intermediate layers. The code is here.

3. “Extreme Language Model Compression with Optimal Subwords and Shared Projections” paper submitted to ICLR 2020 focuses on a knowledge distillation technique for training a student model with a significantly smaller vocabulary as well as lower embedding and hidden state dimensions. They employ a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. The method is able to compress the BERT-base model by more than 60x, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB.

TinyBERT results look better, but anyway, a BERT-like model in 7Mb looks cool.

3.Quantization and pruning

RASA article on BERT compression

Quantization decreases the numerical precision of a model’s weights.

Typically models trained using FP32 (32-bit floating point), then they can be quantized into FP16 (16-bit floating point), INT8 (8-bit integer), or even more to INT4 or INT1, so reducing the model size 2x, 4x, 8x, or 32x respectively. This is called post-training quantization.

Another (harder and less mature) option is quantization-aware training. FP16 training is becoming a commodity now. ICLR 2020 has an interesting submission on the state-of-the-art training results using 8-bit floating-point representation, across Resnet, GNMT, Transformer.

Pruning

An alternative to quantization is pruning. Pruning introduces zeros (aka sparsity) in the weight matrices, promising both memory and compute savings. For example, a recent work by Huggingface, pruneBERT, was able to achieve 95% sparsity on BERT while finetuning for downstream tasks. Another promising work from the lottery ticket hypothesis team at MIT shows that one can obtain 70% sparse pre-trained BERTs that achieves similar performance as the dense one for finetuning on downstream tasks. Tensorflow and Pytorch both offer support for playing around with pruning.

However, getting speedup from pruning is even more challenging than quantization, since CPUs don’t like sparse computation very much. Indeed, last time that I checked, Pytorch’s sparse matrix dense matrix multiplication is only faster than the dense-dense version if the sparse matrix contains more than 98% zeros! Typically, one can afford at most 90% sparsity or maybe 95% sparsity without losing too much accuracy.

Recent solutions, such as OctoML’s TVM, have started to tackle the sparse inference problem: https://medium.com/octoml/using-sparsity-in-apache-tvm-to-halve-your-cloud-bill-for-nlp-4964eb1ce4f2. Although only a comparison to Tensorflow was given, a near 2x speedup on pruneBERT seems fairly promising. Unfortunately, it only seems to work for AMD CPUs, possibly because it is not optimized for AVX512 specific to Intel CPUs.

SparseDNN offers 5x speedup for pruneBERT, and works for both Intel and AMD CPUs. SparseDNN also offers speedups for popular computer vision networks like ResNet and MobileNet.

Of note, currently no library can take advantage of both quantization and pruning. (Please comment if you know one.) SparseDNN offers experimental support, but its sparse int8 kernels are only marginally faster than the floating point ones.

Bonus : TernaryBERT: Quantization Meets Distillation

On September 27, Huawei introduced TernaryBERT, a model that leverages both distillation and quantization to achieve accuracy comparable to the original BERT model with ~15x decrease in size. What is truly remarkable about TernaryBERT is that its weights are ternarized, i.e. have one of three values: -1, 0, or 1 (and can hence be stored in only two bits).

TernaryBERT cleverly pieces together existing quantization and distillation techniques.

References:

[1] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Retrieved from https://arxiv.org/abs/1810.04805

[2] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. (2019). XLNet: Generalized Autoregressive Pretraining for Language Understanding. Retrieved from https://arxiv.org/abs/1906.08237

[3] Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., & Wang, H. (2019). ERNIE 2.0: A Continual Pre-training Framework for Language Understanding. Retrieved from https://arxiv.org/abs/1907.12412

[4] Cheong, R., & Daniel, R. (2019). transformers.zip: Compressing Transformers with Pruning and Quantization. Retrieved from http://web.stanford.edu/class/cs224n/reports/custom/15763707.pdf

[5] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., & Kalenichenko, D. (2017). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. Retrieved from https://arxiv.org/abs/1712.05877

[6] Gale, T., Elsen, E., & Hooker, S. (2019). The State of Sparsity in Deep Neural Networks. Retrieved from https://arxiv.org/abs/1902.09574

[7] Han, S., Pool, J., Tran, J., & Dally, W. (2015). Learning both Weights and Connections for Efficient Neural Networks. Retrieved from https://arxiv.org/abs/1506.02626

[8] Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning Convolutional Neural Networks for Resource Efficient Inference. Retrieved from https://arxiv.org/abs/1611.06440

[9] Michel, P., Levy, O., & Neubig, G. (2019). Are Sixteen Heads Really Better than One? Retrieved from https://arxiv.org/abs/1905.10650

[10] Romero, A., Ballas, N., Kahou, S., Chassang, A., Gatta, C., & Bengio, Y. (2014). FitNets: Hints for Thin Deep Nets. Retrieved from https://arxiv.org/abs/1412.6550

[11] Kim, Y., & Rush, A. (2016). Sequence-Level Knowledge Distillation. Retrieved from https://arxiv.org/abs/1606.07947

[12] Luo, P., Zhu, Z., Liu, Z., Wang, X., & Tang, X. (2016). Face Model Compression by Distilling Knowledge from Neurons. Retrieved from https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/11977/12130

[13] Mirzadeh, S., Farajtabar, M., Li, A., & Ghasemzadeh, H. (2019). Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher. Retrieved from https://arxiv.org/abs/1902.03393

[14] Tang, R., Lu, Y., Liu, L., Mou, L., Vechtomova, O., & Lin, J. (2019). Distilling Task-Specific Knowledge from BERT into Simple Neural Networks. Retrieved from https://arxiv.org/abs/1903.12136v1

[15] Han, S., Mao, H., & Dally, W. (2015). Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding. Retrieved from https://arxiv.org/abs/1510.00149