Convergence Analysis of Over-parameterized Deep Linear Networks, and the Principal Components Bias
Convolutional Neural networks of different architectures seem to learn to classify images in the same order. To understand this phenomenon, we revisit the over-parametrized deep linear network model. Our analysis of this model's learning dynamics reveals that the convergence rate of its parameters is exponentially faster along directions corresponding to the larger principal components of the data, at a rate governed by the singular values. We term this convergence pattern the Principal Components bias (PC-bias). We show how the PC-bias streamlines the order of learning of both linear and non-linear networks, more prominently in earlier stages of learning. We then compare our results to the spectral bias, showing that both biases can be seen independently, and affect the order of learning in different ways. Finally, we discuss how the PC-bias can explain several phenomena, including the benefits of prevalent initialization schemes, how early stopping may be related to PCA, and why deep networks converge more slowly when given random labels.
READ FULL TEXT