FLAT: An Optimized Dataflow for Mitigating Attention Performance Bottlenecks

07/13/2021

∙

Attention mechanisms form the backbone of state-of-the-art machine learning models for a variety of tasks. Deploying them on deep neural network (DNN) accelerators, however, is prohibitively challenging especially under long sequences, as this work identifies. This is due to operators in attention layers exhibiting limited reuse opportunities and quadratic growth in memory footprint, leading to severe memory-boundedness. To address this, we introduce a new attention-tailored dataflow, termed FLAT, which identifies fusion opportunities within the attention layer, and implements an on-chip memory-aware interleaved execution and tiling mechanism. FLAT increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer and thus achieves better run time and compute resource utilization. In our evaluation, FLAT achieves 1.94x and 1.76x speedup and 49 state-of-the-art edge and cloud accelerators.

READ FULL TEXT

FLAT: An Optimized Dataflow for Mitigating Attention Performance Bottlenecks

Sign in with Google

Consider DeepAI Pro