Efficient Kernel Transfer in Knowledge Distillation
Knowledge distillation is an effective way for model compression in deep learning. Given a large model (i.e., teacher model), it aims to improve the performance of a compact model (i.e., student model) by transferring the information from the teacher. An essential challenge in knowledge distillation is to identify the appropriate information to transfer. In early works, only the final output of the teacher model is used as the soft label to help the training of student models. Recently, the information from intermediate layers is also adopted for better distillation. In this work, we aim to optimize the process of knowledge distillation from the perspective of kernel matrix. The output of each layer in a neural network can be considered as a new feature space generated by applying a kernel function on original images. Hence, we propose to transfer the corresponding kernel matrix (i.e., Gram matrix) from teacher models to student models for distillation. However, the size of the whole kernel matrix is quadratic to the number of examples. To improve the efficiency, we decompose the original kernel matrix with Nyström method and then transfer the partial matrix obtained with landmark points, whose size is linear in the number of examples. More importantly, our theoretical analysis shows that the difference between the original kernel matrices of teacher and student can be well bounded by that of their corresponding partial matrices. Finally, a new strategy of generating appropriate landmark points is proposed for better distillation. The empirical study on benchmark data sets demonstrates the effectiveness of the proposed algorithm. Code will be released.
READ FULL TEXT