Parallel Attention and Feed-Forward Net Design for Pre-training and Inference on Transformers
In this paper, we introduce Parallel Attention and Feed-Forward Net Design (PAF) for transformer models. Transformer models are indisputably the backbone of all Natural Language Processing applications. Therefore, any efforts aimed at improving their efficiency are guaranteed to have an enormous impact. Transformer models consist of many layers and each layer has an attention block followed by a feed-forward network (FFN) that processes the input based on the attention block's output. We refer to this standard design as Series Attention and Feed-Forward Net Design (SAF). For each layer in our proposed PAF design for transformer models, we make FFN block's computations independent of the output of the attention block. This decoupling allows FFN block of each layer to run in parallel to the attention block of that layer. We evaluate PAF design by training two large language models (RoBERTa-large and bert-large-uncased) and comparing them to their SAF counterparts on six tasks of the General Language Understanding (GLUE) benchmark which test a multitude of semantic attributes. PAF models achieves nearly identical performance as their SAF counterparts on all the six tasks. We also compare time complexities of attention blocks with FFN blocks and find that running both blocks in parallel can theoretically and in practice achieve upto 1.5x to 2x gains in speed. We leave the development of fast and efficient libraries for implementation of PAF design for future work.
READ FULL TEXT