Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems
International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) 2012
Publication Type: Talk
Repository URL:
Summary
An exascale machine is expected to be delivered in
the time frame 2018-2020. Such a machine will be able to tackle
some of the hardest computational problems and to extend our
understanding of Nature and the universe. However, to make
that a reality, the HPC community has to solve a few important
challenges. Resilience will become a prominent problem because
an exascale machine will experience frequent failures due to the
large amount of components it will encompass. Some form of fault
tolerance has to be incorporated in the system to maintain the
progress rate of applications as high as possible. In parallel, the
system will have to be more careful about power management.
There are two dimensions of power. First, in a power-limited
environment, all the layers of the system have to adhere to
that limitation (including the fault tolerance layer). Second,
power will be relevant due to energy consumption: an exascale
installation will have to pay a large energy bill. It is fundamental
to increase our understanding of the energy profile of different
fault tolerance schemes. This paper presents an evaluation of
three different fault tolerance approaches: checkpoint/restart,
message-logging and parallel recovery. Using programs from
different programming models, we show parallel recovery is the
most energy-efficient solution for an execution with failures. At
the same time, parallel recovery is able to finish the execution
faster than the other approaches. We explore the behavior of
these approaches at extreme scales using an analytical model.
At large scale, parallel recovery is predicted to reduce the total
execution time of an application by 17% and reduce the energy
consumption by 13% when compared to checkpoint/restart.
People
Research Areas