Smaller generalization error derived for deep compared to shallow residual neural networks
Estimates of the generalization error are proved for a residual neural network with L random Fourier features layers z̅_ℓ+1=z̅_ℓ + Re∑_k=1^Kb̅_ℓ ke^ iω_ℓ kz̅_ℓ+ Re∑_k=1^Kc̅_ℓ ke^ iω'_ℓ k· x. An optimal distribution for the frequencies (ω_ℓ k,ω'_ℓ k) of the random Fourier features e^ iω_ℓ kz̅_ℓ and e^ iω'_ℓ k· x is derived. The derivation is based on the corresponding generalization error to approximate function values f(x). The generalization error turns out to be smaller than the estimate f̂^2_L^1(ℝ^d)/(LK) of the generalization error for random Fourier features with one hidden layer and the same total number of nodes LK, in the case the L^∞-norm of f is much less than the L^1-norm of its Fourier transform f̂. This understanding of an optimal distribution for random features is used to construct a new training method for a deep residual network that shows promising results.
READ FULL TEXT