Clustering Message Passing Applications to Enhance Fault Tolerance Protocols
Joint Laboratory for Petascale Computing Workshop (JLPC) 2010
Publication Type: Talk
Repository URL:
Download:
Summary
This talk describes the effort of an ongoing collaboration to find meaningful clusters in a parallel computing application using its communication behavior. We start by showing the communication pattern of various MPI benchmarks and how we can use standard graph partitioning techniques to group the ranks into subsets. For Charm++ applications, we describe the changes on the runtime system to dynamically find the clusters even in the presence of object migration. The information about clusters is used to improve two major message logging protocols for fault tolerance. In one case, we manage to reduce its memory overhead, while in the other we are able to limit the number of processes to roll back during recovery.
pdf
People
Research Areas