A Scalable Double In-memory Checkpoint and Restart Scheme towards Exascale
PPL Technical Report 2012
Publication Type: Paper
Repository URL: papers/201201_FTChkp
Download:
[BIB]
Abstract
As the size of supercomputers multiplies, the probability of system failure
grows substantially, posing an increasingly significant challenge for
scalability. It is important to provide resilience for long running
applications. Checkpoint-based fault tolerance methods are effective
approaches at dealing with faults. With these methods, the state of the entire
parallel application is checkpointed to reliable storage. When a fault occurs,
the application is restarted from a recent checkpoint.
In previous work, we have demonstrated an efficient double in-memory checkpoint
and restart fault tolerance scheme, which leverages Charm++'s parallel objects
for checkpointing. In this paper, we further optimize the scheme by eliminating
sequential bottlenecks caused by serialized communication. We extend the
in-memory checkpointing scheme to work on MPI communication layer, and
demonstrate the performance on very large scale supercomputers. We run a
million atom molecular dynamics simulation on up to 32k cores, the checkpoint
times were in milliseconds. Even for large memory becnhamrks, such as a 5-point
stencil with 128 MB per core at checkpoint, the checkpoint time was increased
to about 1.4 seconds on 16k cores. The restart times were measured to be less
than 0.4 seconds on 32k cores.
People
Research Areas