Dynamic Masking Rate Schedules for MLM Pretraining
Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15 instead dynamically schedules the masking ratio throughout training. We found that linearly decreasing the masking rate from 30 pretraining improves average GLUE accuracy by 0.46 standard 15 scheduling come from being exposed to both high and low masking rate regimes. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models and achieve up to a 1.89x speedup in pretraining.
READ FULL TEXT