BERT Busters: Outlier LayerNorm Dimensions that Disrupt BERT

05/14/2021
by   Olga Kovaleva, et al.
35

Multiple studies have shown that BERT is remarkably robust to pruning, yet few if any of its components retain high importance across downstream tasks. Contrary to this received wisdom, we demonstrate that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of scaling factors and biases in the output layer normalization (<0.0001 weights). These are high-magnitude normalization parameters that emerge early in pre-training and show up consistently in the same dimensional position throughout the model. They are present in all six models of BERT family that we examined and removing them significantly degrades both the MLM perplexity and the downstream task performance. Our results suggest that layer normalization plays a much more important role than usually assumed.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset