A Multi-level Scalable Startup for Parallel Applications
International Workshop on Runtime and Operating Systems for Supercomputers (ROSS) 2011
Publication Type: Paper
Repository URL: papers/201010_ScalableStartup
Abstract
High performance parallel machines with hundreds of thousands of
processors and petascale performance are already in use, and even
larger Exaflops scale computing systems which may have hundreds of
millions of cores are planned. To run parallel applications on
machines of such massive scale, one of the biggest challenges is
the parallel startup process. This task involves two components:
(1) parallel launching of appropriate processes on the given set of
processors and (2) setting up communication channels to enable the
processes to communicate with each other after process launching
has completed. Most current startup mechanisms focus on either
using daemons which waste system resources or using a startup
manager which becomes a scalability bottleneck. In this paper, we
investigate the design and scalability of a SMP-aware, multi-level
startup scheme with batching of remote shell sessions, which
provides a complete solution to startup of a parallel application
and facilitates its management during execution. It monitors
process health and can be used to support recovery from failures
and provide scalable interaction with the application. We
demonstrate the performance and scalability of this scheme by
applying it to startup Charm++ applications. In particular,
starting up a Charm++ program on 16,384 cores of Ranger (at TACC)
with Ethernet as the underlying communication layer takes only 25
seconds and attains a speedup of over 400% compared to MPICH2
startup (using hydra as process manager) and over 800% compared to
Open MPI startup on Ranger.
TextRef
Abhishek Gupta and Gengbin Zheng and Laxmikant V. Kale, "A Multi-level Scalable
Startup for Parallel Applications", Proceedings of International Workshop on Runtime and Operating Systems for Supercomputers, Tucson, AZ, May 2011
People
Research Areas