Faster Convergence of Local SGD for Over-Parameterized Models
Modern machine learning architectures are often highly expressive. They are usually over-parameterized and can interpolate the data by driving the empirical loss close to zero. We analyze the convergence of Local SGD (or FedAvg) for such over-parameterized models in the heterogeneous data setting and improve upon the existing literature by establishing the following convergence rates. We show an error bound of Ø(exp(-T)) for strongly-convex loss functions, where T is the total number of iterations. For general convex loss functions, we establish an error bound of Ø(1/T) under a mild data similarity assumption and an error bound of Ø(K/T) otherwise, where K is the number of local steps. We also extend our results for non-convex loss functions by proving an error bound of Ø(K/T). Before our work, the best-known convergence rate for strongly-convex loss functions was Ø(exp(-T/K)), and none existed for general convex or non-convex loss functions under the overparameterized setting. We complete our results by providing problem instances in which such convergence rates are tight to a constant factor under a reasonably small stepsize scheme. Finally, we validate our theoretical results using numerical experiments on real and synthetic data.
READ FULL TEXT