Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology
A recent line of research has provided convergence guarantees for gradient descent algorithms in the excessive over-parameterization regime where the widths of all the hidden layers are required to be polynomially large in the number of training samples. However, the widths of practical deep networks are often only large in the first layer(s) and then start to decrease towards the output layer. This raises an interesting open question whether similar results also hold under this empirically relevant setting. Existing theoretical insights suggest that the loss surface of this class of networks is well-behaved, but these results usually do not provide direct algorithmic guarantees for optimization. In this paper, we close the gap by showing that one wide layer followed by pyramidal deep network topology suffices for gradient descent to find a global minimum with a geometric rate. Our proof is based on a weak form of Polyak-Lojasiewicz inequality which holds for deep pyramidal networks in the manifold of full-rank weight matrices.
READ FULL TEXT