Parallel Programming Laboratory

The recovery and rise of checkpoint/restart

| Laxmikant Kale

Joint Laboratory for Petascale Computing Workshop (JLPC) 2012

Publication Type: Talk

Repository URL:

Download:

Summary

Checkpoint/restart has often be considered as a protocol that will not scale to large machines. Earlier, the claim was made for petascale machines, and later for exascale machines. But the idea has proved to be more resilient than the claims. Within the HPC community in the US, and especially the DOE, multiple researchers have started defending the scalability of checkpoint/rstart schemes. In this talk, I will show some results in scalable checkpoint/restart, and examine reasons for its scalability. I will describe my viewpoint on where different categories of protocols, such as message-logging, causal, and coordinated checkpoint/restart will work better, and what are the likely scenarios at exascale. The talk is aimed at spurring controversy and discussion. Although we will touch upon soft faults, the main focus of this talk will be fail-stop faults.

People

Laxmikant Kale

Research Areas

Live Webcast 15th Annual Charm++ Workshop