A Scalable Double In-memory Checkpoint and Restart Scheme towards Exascale
Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS) 2012
Publication Type: Paper
Repository URL: papers/201202_FTXS
Abstract
As the size of supercomputers increases, the probability of system failure
grows substantially, posing an increasingly significant challenge for
scalability. It is important to provide resilience for long running
applications. Checkpoint-based fault tolerance methods are effective
approaches at dealing with faults. With these methods, the state of the entire
parallel application is checkpointed to reliable storage. When a failure occurs,
the application is restarted from a recent checkpoint.
In previous work, we have demonstrated an efficient double in-memory checkpoint
and restart fault tolerance scheme, which leverages Charm++'s parallel objects
for checkpointing. In this paper, we further optimize the scheme by eliminating
several bottlenecks caused by serialized communication. We extend the in-memory
checkpointing scheme to work on MPI communication layer, and demonstrate the
performance on very large scale supercomputers. For example, when running a
million atom molecular dynamics simulation on up to 64K cores of a BlueGene/P
machine, the checkpoint time was in milliseconds. The restart times were
measured to be less than 0.15 seconds on 64K cores.
TextRef
Gengbin Zheng, Xiang Ni and Laxmikant V. Kale, A Scalable Double In-memory Checkpoint and Restart Scheme towards Exascale, Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2012), Boston, USA
People
Research Areas