On variation of gradients of deep neural networks
We provide a theoretical explanation of the role of the number of nodes at each layer in deep neural networks. We prove that the largest variation of a deep neural network with ReLU activation function arises when the layer with the fewest nodes changes its activation pattern. An important implication is that deep neural network is a useful tool to generate functions most of whose variations are concentrated on a smaller area of the input space near the boundaries corresponding to the layer with the fewest nodes. In turn, this property makes the function more invariant to input transformation. That is, our theoretical result gives a clue about how to design the architecture of a deep neural network to increase complexity and transformation invariancy simultaneously.
READ FULL TEXT