On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks

08/10/2018
by   Fangyu Zou, et al.
0

Adaptive stochastic gradient descent methods, such as AdaGrad, Adam, AdaDelta, Nadam, AMSGrad, etc., have been demonstrated efficacious in solving non-convex stochastic optimization, such as training deep neural networks. However, their convergence rates have not been touched under the non-convex stochastic circumstance except recent breakthrough results on AdaGrad ward2018adagrad and perturbed AdaGrad li2018convergence. In this paper, we propose two new adaptive stochastic gradient methods called AdaHB and AdaNAG which integrate coordinate-wise AdaGrad with heavy ball momentum and Nesterov accelerated gradient momentum, respectively. The O(T/√(T)) non-asymptotic convergence rates of AdaHB and AdaNAG in non-convex stochastic setting are also jointly characterized by leveraging a newly developed unified formulation of these two momentum mechanisms. In particular, when momentum term vanishes we obtain convergence rate of coordinate-wise AdaGrad in non-convex stochastic setting as a byproduct.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset