A Fault Tolerance Protocol with Fast Fault Recovery
IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007
Publication Type: Paper
Repository URL:
Abstract
Large machines with tens or even hundreds of thousands of
processors are currently in use. Fault tolerance is an important
issue for these and the even larger machines of the future.
Checkpoint based methods, currently used on most machines, rollback
all processors to previous checkpoints after a crash. This wastes a
significant amount of computation as all processors have to redo
all the computation from that checkpoint onwards. In addition,
recovery-time in checkpoint based fault tolerance protocols is
bound by the time between the last checkpoint and the crash.
Protocols based on message logging avoid the problem of rolling
back all processors to their earlier state. However, the recovery
time of existing message logging protocols is no smaller than the
time between the last checkpoint and crash. We present a fault
tolerance protocol, in this paper, that provides fast restarts by
using the ideas of message logging and processor virtualization. We
evaluate our implementation of the protocol in the Charm++/Adaptive
MPI runtime system. We show that our protocol not only provides
fast restarts but also has low fault-free overhead for many
applications.
TextRef
Sayantan Chakravorty, Laxmikant V. Kale, "A Fault Tolerance Protocol with
Fast Fault Recovery", Parallel Programming Laboratory, Department of Computer
Science, University of Illinois at Urbana-Champaign, 2006.
People
Research Areas