Implicit Kernel Attention
Attention compute the dependency between representations, and it encourages the model to focus on the important selective features. Among the attention methods, the scaled dot-product attention is widely utilized in many models. This paper suggests a generalized structure of the scaled dot-product attention with similarity and magnitude terms. We derive that the scaled dot-product attention is a product of two parts: 1) the RBF kernel to measure the similarity of two instances and 2) the exponential L^2 norm to compute the importance of individual instances. From this decomposition, we improve the attention in two ways: implicit modeling on the kernel spectral density and generalized L^p norm, which results in a learnable and flexible attention structure. First, we estimate the spectral density of kernel with implicit probabilistic models to estimate the appropriate kernel for a given dataset without kernel selection manually. Second, we introduce a generalized L^p norm on the hidden feature space, where p is a hyper-parameter that affects the scale of individual importance and the sparsity of attention weights. Also, we show how to expand this implicit kernel modeling to multi-head attention in conjunction with a copula augmentation. Our generalized attention shows better performance on text classification, translation, regression, and node classification tasks.
READ FULL TEXT