Linear Range in Gradient Descent
This paper defines linear range as the range of parameter perturbations which approximately leads to linear perturbations in states. We compute linear range by comparing the actual perturbations in states and the tangent solution of a network. Linear range is a new criterion for gradients to be meaningful, thus having many possible applications. In particular, we propose that the optimal learning rate at the beginning of training can be found automatically, by selecting a stepsize such that all minibatches are within linear range. We demonstrate our algorithm on a network with canonical architecture and a ResNet.
READ FULL TEXT