EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms

03/24/2023

∙

Automated design of efficient transformer models has recently attracted significant attention from industry and academia. However, most works only focus on certain metrics while searching for the best-performing transformer architecture. Furthermore, running traditional, complex, and large transformer models on low-compute edge platforms is a challenging problem. In this work, we propose a framework, called ProTran, to profile the hardware performance measures for a design space of transformer architectures and a diverse set of edge devices. We use this profiler in conjunction with the proposed co-design technique to obtain the best-performing models that have high accuracy on the given task and minimize latency, energy consumption, and peak power draw to enable edge deployment. We refer to our framework for co-optimizing accuracy and hardware performance measures as EdgeTran. It searches for the best transformer model and edge device pair. Finally, we propose GPTran, a multi-stage block-level grow-and-prune post-processing step that further improves accuracy in a hardware-aware manner. The obtained transformer model is 2.8× smaller and has a 0.8 (BERT-Base). Inference with it on the selected edge device enables 15.0 latency, 10.0× lower energy, and 10.8× lower peak power draw compared to an off-the-shelf GPU.

READ FULL TEXT

EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms

Sign in with Google

Consider DeepAI Pro