Adoption Protocols for Fanout-Optimal Fault-Tolerant Termination Detection
ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP) 2013
Publication Type: Talk
Repository URL:
Download:
[PDF]
Summary
Termination detection is relevant for signaling completion (all processors are idle and no messages are in flight) of many distributed operations, including work stealing algorithms, dynamic data exchange, and dynamically structured computations. In the face of growing supercomputers with increasing likelihood that each job may encounter faults, it is important for systems that rely on termination detection that such an algorithm be able to tolerate the inevitable faults. We provide a trio of new practical fault tolerance schemes for a standard approach to termination detection that are easy to implement, present low overhead in both theory and practice, and have scalable costs when recovering from faults. These schemes tolerate all single-process faults, and are probabilistically tolerant of faults affecting multiple processes. We combine the theoretical failure probabilities we can calculate for each algorithm with historical fault records from real machines to show that these algorithms have excellent overall survivability.
People
Research Areas