While NAMD is designed to be a scalable program, particularly for simulations of 100,000 atoms or more, at some point adding additional processors to a simulation will provide little or no extra performance. If you are lucky enough to have access to a parallel machine you should measure NAMD's parallel speedup for a variety of processor counts when running your particular simulation. The easiest and most accurate way to do this is to look at the ``Benchmark time:'' lines that are printed after 20 and 25 cycles (usually less than 500 steps). You can monitor performance during the entire simulation by adding ``outputTiming steps'' to your configuration file, but be careful to look at the ``wall time'' rather than ``CPU time'' fields on the ``TIMING:'' output lines produced. For an external measure of performance, you should run simulations of both 25 and 50 cycles (see the stepspercycle parameter) and base your estimate on the additional time needed for the longer simulation in order to exclude startup costs and allow for initial load balancing.
We provide both standard (UDP) and new TCP based precompiled binaries for Linux clusters. We have observed that the TCP version is better on our dual processor clusters with gigabit ethernet while the basic UDP version is superior on our single processor fast ethernet cluster. When using the UDP version with gigabit you can add the +giga option to adjust several tuning parameters. Additional performance may be gained by building NAMD against an SMP version of Charm++ such as net-linux-smp or net-linux-smp-icc. This will use a communication thread for each process to respond to network activity more rapidly. For dual processor clusters we have found it that running two separate processes per node, each with its own communication thread, is faster than using the charmrun ++ppn option to run multiple worker threads. However, we have observed that when running on a single hyperthreaded processor (i.e., a newer Pentium 4) there is an additional 15% boost from running standalone with two threads (namd2 +p2) beyond running two processors (charmrun namd2 ++local +p2). For a cluster of single processor hyperthreaded machines an SMP version should provide very good scaling running one process per node since the communication thread can run very efficiently on the second virtual processor. We are unable to ship an SMP build for Linux due to portability problems with the Linux pthreads implementation needed by Charm++. The new NPTL pthreads library in RedHat 9 fixes these problems so an SMP port can become the standard shipping binary version in the future.
On some large machines with very high bandwidth interconnects you may be able to increase performance for PME simulations by adding either ``+strategy USE_MESH'' or ``+strategy USE_GRID'' to the command line. These flags instruct the Charm++ communication optimization library to reduce the number of messages sent during PME 3D FFT by combining data into larger messages to be transmitted along each dimension of either a 2D mesh or a 3D grid, respectively. While reducing the number of messages sent per processor from N to 2*sqrt(N) or 3*cbrt(N), the total amount of data transmitted for the FFT is doubled or tripled.
Extremely short cycle lengths (less than 10 steps) will also limit parallel scaling, since the atom migration at the end of each cycle sends many more messages than a normal force evaluation. Increasing pairlistdist from, e.g., cutoff + 1.5 to cutoff + 2.5, while also doubling stepspercycle from 10 to 20, may increase parallel scaling, but it is important to measure. When increasing stepspercycle, also try increasing pairlistspercycle by the same proportion.