Checkpointing and Localized Recovery for Nested Fork-Join Programs

02/25/2021
by   Claudia Fohry, et al.
0

While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished on the intact nodes, and the lost tasks be reassigned. This extended abstract suggests to adapt a checkpointing and localized recovery technique that has originally been developed for independent tasks to nested fork-join programs. We consider a Cilk-like work stealing scheme with work-first policy in a distributed memory setting, and describe the required algorithmic changes. The original technique has checkpointing overheads below 1 similar performance.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset