Proactive Fault Tolerance in Large Systems
Workshop on High Performance Computing Reliability Issues at HPCA (HPCRI) 2005
Publication Type: Paper
Repository URL:
Abstract
High-performance systems with thousands of processors have been
introduced in the recent past, and systems with hundreds of
thousands of processors should become available in the near future.
Since failures are likely to be frequent in such systems, schemes
for dealing with faults are important. In this paper, we introduce
a new fault tolerance solution for parallel applications that
proactively migrates execution from a processor where a failure is
imminent. Our approach assumes that some failures are predictable,
and leverages the fact that current hardware devices contain
various features supporting early indication of faults. By using
the concepts of processor virtualization in Charm++ and Adaptive
MPI (AMPI), we describe a mechanism that migrates objects when a
failure is expected to arise in a given processor, without
requiring spare processors. After migrating objects, and applying a
load balancing scheme, the execution of an MPI application can
proceed and achieve optimized efficiency. We modify the
implementation of collective operations, such as reductions, so
that they continue to operate efficiently even after a processor is
evacuated and crashes. To demonstrate the feasibility of our
approach, we present preliminary performance data.
TextRef
Sayantan Chakravorty, Celso Mendes and L. V. Kale,
"Proactive Fault Tolerance in Large Systems",
HPCRI Workshop in conjunction with HPCA 2005, 2005.
People
Research Areas