System Support for Checkpoint/Restart of Charm++ and AMPI Applications
Thesis 2004
Publication Type: MS Thesis
Repository URL:
Abstract
As both modern supercomputers and new generation scientific
computing applications grow in size and complexity, the probability
of system failure rises commensurately. Making parallel computing
fault tolerant has become an increasingly important issue.
Checkpoint/restart mechanism provides for fault tolerance
capability as well as other benefits for parallel programmers. This
thesis describes the On-Disk Checkpoint/Restart Mechanism for
Charm++ and Adaptive MPI programming framework, its motivation,
design, and implementation. This mechanism has proven to be useful
in practice and can also be extended to implement other fault
tolerant techniques.
TextRef
Chao Huang, "System Support for Checkpoint and Restart of Charm++ and
AMPI Applications", Dept. of Computer Science, University of Illinois, 2004.
People
Research Areas