Asymptotic efficiency of restart and checkpointing

02/21/2018

∙

Many tasks are subject to failure before completion. Two of the most common failure recovery strategies are restart and checkpointing. Under restart, once a failure occurs, it is restarted from the beginning. Under checkpointing, the task is resumed from the preceding checkpoint after the failure. We study asymptotic efficiency of restart for an infinite sequence of tasks, whose sizes form a stationary sequence. We define asymptotic efficiency as the limit of the ratio of the total time to completion in the absence of failures over the total time to completion when failures take place. Whether the asymptotic efficiency is positive or not depends on the comparison of the tail of the distributions of the task size and the random variables governing failures. Our framework allows for variations in the failure rates and dependencies between task sizes. We also study a similar notion of asymptotic efficiency for checkpointing when the task is infinite a.s. and the inter-checkpoint times are i.i.d.. Moreover, in checkpointing, when the failures are exponentially distributed, we prove the existence of an infinite sequence of universal checkpoints, which are always used whenever the system starts from any checkpoint that precedes them.

READ FULL TEXT

Asymptotic efficiency of restart and checkpointing

A Note on the Asymptotic Optimality of Work-Conserving Disciplines in Completion Time Minimization

A counterexample to the central limit theorem for pairwise independent random variables having a common absolutely continuous arbitrary margin

Elly: A Real-Time Failure Recovery and Data Collection System for Robotic Manipulation

On a probabilistic extension of the Oldenburger-Kolakoski sequence

Failure-Sentient Composition For Swarm-Based Drone Services

Failure Analysis and Quantification for Contemporary and Future Supercomputers

A Study of Deep Learning Robustness Against Computation Failures

Asymptotic efficiency of restart and checkpointing

Related Research

A Note on the Asymptotic Optimality of Work-Conserving Disciplines in Completion Time Minimization

A counterexample to the central limit theorem for pairwise independent random variables having a common absolutely continuous arbitrary margin

Elly: A Real-Time Failure Recovery and Data Collection System for Robotic Manipulation

On a probabilistic extension of the Oldenburger-Kolakoski sequence

Failure-Sentient Composition For Swarm-Based Drone Services

Failure Analysis and Quantification for Contemporary and Future Supercomputers

A Study of Deep Learning Robustness Against Computation Failures