Optimizing the Transition Waste in Coded Elastic Computing
Distributed computing, in which a resource-intensive task is divided into subtasks and distributed among different machines, plays a key role in solving large-scale problems, e.g., machine learning for large datasets or massive computational problems arising in genomic research. Coded computing is a recently emerging paradigm where redundancy for distributed computing is introduced to alleviate the impact of slow machines, or stragglers, on the completion time. Motivated by recently available services in the cloud computing industry, e.g., EC2 Spot or Azure Batch, where spare/low-priority virtual machines are offered at a fraction of the price of the on-demand instances but can be preempted in a short notice, we investigate coded computing solutions over elastic resources, where the set of available machines may change in the middle of the computation. Our contributions are two-fold: We first propose an efficient method to minimize the transition waste, a newly introduced concept quantifying the total number of tasks that existing machines have to abandon or take on anew when a machine joins or leaves, for the cyclic elastic task allocation scheme recently proposed in the literature (Yang et al. ISIT'19). We then proceed to generalize such a scheme and introduce new task allocation schemes based on finite geometry that achieve zero transition wastes as long as the number of active machines varies within a fixed range. The proposed solutions can be applied on top of every existing coded computing scheme tolerating stragglers.
READ FULL TEXT