Making Byzantine Decentralized Learning Efficient
Decentralized-SGD (D-SGD) distributes heavy learning tasks across multiple machines (a.k.a., nodes), effectively dividing the workload per node by the size of the system. However, a handful of Byzantine (i.e., misbehaving) nodes can jeopardize the entire learning procedure. This vulnerability is further amplified when the system is asynchronous. Although approaches that confer Byzantine resilience to D-SGD have been proposed, these significantly impact the efficiency of the process to the point of even negating the benefit of decentralization. This naturally raises the question: can decentralized learning simultaneously enjoy Byzantine resilience and reduced workload per node? We answer positively by proposing that ensures Byzantine resilience without losing the computational efficiency of D-SGD. Essentially, weakens the impact of Byzantine nodes by reducing the variance in local updates using Polyak's momentum. Then, by establishing coordination between nodes via signed echo broadcast and a nearest-neighbor averaging scheme, we effectively tolerate Byzantine nodes whilst distributing the overhead amongst the non-Byzantine nodes. To demonstrate the correctness of our algorithm, we introduce and analyze a novel Lyapunov function that accounts for the non-Markovian model drift arising from the use of momentum. We also demonstrate the efficiency of through experiments on several image classification tasks.
READ FULL TEXT