+--------------------------------------------------------------------+
|                                                                    |
|                      NAMD 3.0.1 Release Notes                      |
|                                                                    |
+--------------------------------------------------------------------+

This file contains the following sections:

  - Problems?  Found a bug?
  - Installing NAMD
  - Running NAMD
  - CPU Affinity
  - GPU Acceleration
  - Compiling NAMD
  - Memory Usage
  - Improving Parallel Scaling
  - Endian Issues

----------------------------------------------------------------------

Problems?  Found a bug?

 1. Download and test the latest version of NAMD. 

 2. Please check NamdWiki, NAMD-L, and the rest of the NAMD web site
    resources at http://www.ks.uiuc.edu/Research/namd/.

 3. For problems compiling or running NAMD please review these notes,
    https://charm.readthedocs.io/en/latest/charm++/manual.html#appendix,
    and NamdOn... pages at http://www.ks.uiuc.edu/Research/namd/wiki/.
    If you do not understand the errors generated by your compiler,
    queueing system, ssh, or mpiexec you should seek assistance
    from a local expert familiar with your setup.

 4. For questions about using NAMD please subscribe to NAMD-L and post
    your question there so that others may benefit from the discussion.
    Please avoid sending attachments to NAMD-L by posting any related
    files on your web site and including the location in your message.
    If you are not familiar with molecular dynamics simulations please
    work through the tutorials and seek assistance from a local expert.

 5. Gather, in a single directory, all input and config files needed
    to reproduce your problem. 

 6. Run once, redirecting output to a file (if the problem occurs
    randomly do a short run and include additional log files showing
    the error). 

 5. Tar everything up (but not the namd3 or charmrun binaries) and
    compress it, e.g., "tar cf mydir.tar mydir; gzip mydir.tar"
    Please do not attach your files individually to your email message
    as this is error prone and tedious to extract.  The only exception
    may be the log of your run showing any error messages.

 6. For problems concerning compiling or using NAMD please consider
    subscribing to NAMD-L and posting your question there, and
    summarizing the resolution of your problem on NamdWiki so that
    others may benefit from your experience.  Please avoid sending large
    attachments to NAMD-L by posting any related files on a web site and
    including the location in your message.

 7. For bug reports, mail namd@ks.uiuc.edu with: 
    - A synopsis of the problem as the subject (not "HELP" or "URGENT").
    - The NAMD version, platform and number of CPUs the problem occurs
      (to be very complete just copy the first 20 lines of output).
    - A description of the problematic behavior and any error messages. 
    - If the problem is consistent or random. 
    - A complete log of your run showing any error messages.
    - The URL of your compressed tar file on a web server.

 8. We'll get back to you with further questions or suggestions.  While
    we try to treat all of our users equally and respond to emails in a
    timely manner, other demands occasionally prevent us from doing so.
    Please understand that we must give our highest priority to crashes
    or incorrect results from the most recently released NAMD binaries.

----------------------------------------------------------------------

Installing NAMD

A NAMD binary distribution need only be untarred or unzipped and can
be run directly in the resulting directory.  When building from source
code (see "Compiling NAMD" below), "make release" will generate a
self-contained directory and .tar.gz or .zip archive that can be moved
to the desired installation location.

----------------------------------------------------------------------

Running NAMD

NAMD runs on a variety of serial and parallel platforms.  While it is 
trivial to launch a serial program, a parallel program depends on a 
platform-specific library such as MPI to launch copies of itself on other 
nodes and to provide access to a high performance network such as
InfiniBand if one is available.

For typical workstations (Windows, Linux, or Mac OS X) with only ethernet
networking, NAMD uses the Charm++ native communications layer and the
program charmrun to launch namd3 processes for parallel runs (either
exclusively on the local machine with the ++local option or on other hosts
as specified by a nodelist file).  The namd3 binaries for these platforms
can also be run directly (known as standalone mode) for single process runs.

Documentation on launching NAMD and other Charm++ programs may be found at
https://charm.readthedocs.io/en/latest/charm++/manual.html#running-charm-programs
in addition to the NAMD-oriented summary below.

-- Individual Linux or Mac OS X Workstations --

Mac OX X and Linux-x86_64-multicore released binaries are based on
"multicore" builds of Charm++ that can run multiple threads.
These multicore builds lack a network layer, so they can only be used on
a single machine.  For best performance use the new +auto-provision option,
which will automatically run one thread per processor:

  namd3 +auto-provision <configfile>

You may need to specify the full path to the namd3 binary.

The number of processors may still be specified with the +p option:

  namd3 +p<procs> <configfile>

For Windows support, we encourage use of WSL (Windows Subsystem for Linux) 
to run a Linux build of NAMD. For more information regarding WSL, see 
https://learn.microsoft.com/en-us/windows/wsl/install .

-- Multi-Copy Algorithm Support --

Multi-copy algorithms (such as replica exchange) require at least one
process per replica, plus a Charm++ build based on "LRTS" (low-level
run-time system).  Multi-copy-capable builds include netlrts, verbs, ucx,
and mpi.  The older net and ibverbs builds do not support multi-copy.
NAMD built on netlrts and verbs is launched with charmrun like the older
net and ibverbs layers, with +replicas <replicas> +stdout <format>
options added to divide the processes into <replicas> partitions that
write to separate log files with %d in <format> replaced by the replica.
For example, to run 8 replicas writing to rep-0.log through rep-7.log:

  charmrun namd3 ++local +p16 +replicas 8 +stdout rep-%d.log

The number of replicas must evenly divide the number of processes.

-- Linux Clusters with InfiniBand or Other High-Performance Networks --

Charm++ provides a special verbs network layer that uses InfiniBand
networks directly through the OpenFabrics OFED ibverbs library.  This
avoids efficiency and portability issues associated with MPI.  Look for
pre-built verbs NAMD binaries or specify verbs when building Charm++.
The verbs network layer should offer equivalent performance to the
old ibverbs layer, plus support for multi-copy algorithms (replicas).

A new UCX network layer provides higher performance on InfiniBand,
but must be compiled from source code for the specific version of
mpiexec that will be used to launch NAMD.  For details please see
https://charm.readthedocs.io/en/latest/charm++/manual.html#ucx

Intel Omni-Path networks are incompatible with the pre-built verbs
NAMD binaries.  Charm++ for verbs can be built with --with-qlogic 
to support Omni-Path, but the Charm++ MPI network layer performs
better than the verbs layer.  Hangs have been observed with Intel MPI
but not with OpenMPI, so OpenMPI is preferred.  See "Compiling NAMD"
below for MPI build instructions.  NAMD MPI binaries may be launched
directly with mpiexec rather than via the provided charmrun script.

Writing batch job scripts to run charmrun in a queueing system can be
challenging.  Since most clusters provide directions for using mpiexec
to launch MPI jobs, charmrun provides a ++mpiexec option to use mpiexec
to launch non-MPI binaries.  If "mpiexec -n <procs> ..." is not
sufficient to launch jobs on your cluster you will need to write an
executable mympiexec script like the following from TACC:

  #!/bin/csh
  shift; shift; exec ibrun $*

The job is then launched (with full paths where needed) as:

  charmrun +p<procs> ++mpiexec ++remote-shell mympiexec namd3 <configfile>

Charm++ now provides the option ++mpiexec-no-n for the common case
where mpiexec does not accept "-n <procs>" and instead derives the
number of processes to launch directly from the queueing system:

  charmrun +p<procs> ++mpiexec-no-n ++remote-shell ibrun namd3 <configfile>

For massively parallel machines with proprietary networks, NAMD uses
the system-provided MPI library (with a few exceptions including GNI on
Cray and PAMI on IBM) and standard system tools such as mpiexec or aprun
are used to launch jobs.  Since MPI libraries are very often incompatible 
between versions, you will likely need to recompile NAMD and its 
underlying Charm++ libraries to use these machines in parallel (the 
provided non-MPI binaries should still work for serial runs). The provided 
charmrun program for these platforms is only a script that attempts to 
translate charmrun options into mpiexec options, but due to the diversity 
of MPI libraries it often fails to work.

-- Linux or Other Unix Workstation Networks --

The netlrts binaries used for multi-copy algorithms as described above
can be run in parallel on a workstation network. The only difference 
is that you must provide a "nodelist" file listing the machines where 
namd3 processes should run, for example:

  group main
  host brutus
  host romeo

The "group main" line defines the default machine list.  Hosts brutus and 
romeo are the two machines on which to run the simulation.  Note that 
charmrun may run on one of those machines, or charmrun may run on a third 
machine.  All machines used for a simulation must be of the same type and 
have access to the same namd3 binary.

By default, the ancient insecure "rsh" command is used to start namd3 on 
each node specified in the nodelist file.  You can change this via the 
CONV_RSH environment variable, i.e., to use ssh instead of rsh run "setenv 
CONV_RSH ssh" or add it to your login or batch script.  You must be able 
to connect to each node via rsh/ssh without typing your password; this can 
be accomplished via a .rhosts files in your home directory, by an 
/etc/hosts.equiv file installed by your sysadmin, or by a 
.ssh/authorized_keys file in your home directory.  You should confirm that 
you can run "ssh hostname pwd" (or "rsh hostname pwd") without typing a 
password before running NAMD.  Contact your local sysadmin if you have 
difficulty setting this up.  If you are unable to use rsh or ssh, then add 
"setenv CONV_DAEMON" to your script and run charmd (or charmd_faceless, 
which produces a log file) on every node.

You should now be able to try running NAMD as:

  charmrun namd3 +p<procs> <configfile>

If this fails or just hangs, try adding the ++verbose option to see more 
details of the startup process.  You may need to specify the full path to 
the namd3 binary.  Charmrun will start the number of processes specified 
by the +p option, cycling through the hosts in the nodelist file as many 
times as necessary.  You may list multiprocessor machines multiple times 
in the nodelist file, once for each processor.

You may specify the nodelist file with the "++nodelist" option and the 
group (which defaults to "main") with the "++nodegroup" option.  If you do 
not use "++nodelist" charmrun will first look for "nodelist" in your 
current directory and then ".nodelist" in your home directory.

Some automounters use a temporary mount directory which is prepended to 
the path returned by the pwd command.  To run on multiple machines you 
must add a "++pathfix" option to your nodelist file.  For example:

  group main ++pathfix /tmp_mnt /
  host alpha1
  host alpha2

Many other options to charmrun and for the nodelist file are listed at
https://charm.readthedocs.io/en/latest/charm++/manual.html#launching-programs-with-charmrun
and a list of options is available by running charmrun without arguments.

If your workstation cluster is controlled by a queueing system and an
MPI library is configured to interact with it then you can use the
++mpiexec options described for the verbs layer above.  Otherwise, you
need build a nodelist file in your job script.  For example, if your 
queueing system provides a $HOST_FILE environment variable:

  set NODES = `cat $HOST_FILE`
  set NODELIST = $TMPDIR/namd3.nodelist
  echo group main >! $NODELIST
  foreach node ( $nodes )
    echo host $node >> $NODELIST
  end
  @ NUMPROCS = 2 * $#NODES
  charmrun namd3 +p$NUMPROCS ++nodelist $NODELIST <configfile>

Note that $NUMPROCS is twice the number of nodes in this example. This is 
the case for dual-processor machines.  For single-processor machines you 
would not multiply $#NODES by two.

Note that these example scripts and the setenv command are for the csh or 
tcsh shells.  They must be translated to work with sh or bash.

-- Shared-Memory and Network-Based Parallelism (SMP Builds) --

The Linux-x86_64-verbs-smp and other ...-smp released binaries are
based on "smp" builds of Charm++ that can be used with multiple threads
on either a single machine like a multicore build, or across a network.
SMP builds combine multiple worker threads and an extra communication
thread into a single process.  Since one core per process is used for
the communication thread SMP builds might be slower than non-SMP
builds.  The advantage of SMP builds is that many data structures are
shared among the threads, reducing the per-core memory footprint when
scaling large simulations to large numbers of cores.

SMP builds launched with charmrun use ++n to specify the total number of
processes (Charm++ "nodes") and ++ppn to specify the number of PEs (Charm++
worker threads) per process.  Prevous versions required the use of +p to
specify the total number of PEs, but the new ++n option is now recommended.
Thus, to run one process with one communication and three worker threads
on each of four quad-core nodes one would specify:

  charmrun namd3 ++n 4 ++ppn 3 <configfile>

For mpiexec-launched builds one would specify any mpiexec options needed
for the required number of processes and pass +ppn to the NAMD binary as:

  mpiexec -n 4 namd3 +ppn 3 <configfile>

Verbs and UCX builds tend to have better performance than MPI-based SMP 
builds, so are recommended over MPI-based SMP builds when supported,
particularly for GPU-accelerated builds. 

See the Cray XE/XK/XC directions below for a more complex example.

-- Cray XE/XK/XC --

First load modules for the GNU compilers (XE/XK only, XC should use Intel),
topology information, huge page sizes, and the system FFTW 3 library:

  module swap PrgEnv-cray PrgEnv-gnu
  module load rca
  module load craype-hugepages8M
  module load fftw

The CUDA Toolkit module enables dynamic linking, so it should only
be loaded when building CUDA binaries and never for non-CUDA binaries:

  module load cudatoolkit

For CUDA or large simulations on XE/XK use gemini_gni-crayxe-persistent-smp
and for smaller XE simulations use gemini_gni-crayxe-persistent.  For XC
similarly use gni-crayxc-persistent-smp or gni-crayxc-persistent.

For XE/XK use CRAY-XE-gnu and (for CUDA) the ``--with-cuda'' config option,
the appropriate ``--charm-arch'' parameter, and --with-fftw3.  For XC
use instead CRAY-XC-intel but all other options the same.

Your batch job will need to load modules and set environment variables:

  module swap PrgEnv-cray PrgEnv-gnu
  module load rca
  module load craype-hugepages8M
  setenv HUGETLB_DEFAULT_PAGE_SIZE 8M
  setenv HUGETLB_MORECORE no

To run an SMP build with one process per node on 16 32-core nodes:

  aprun -n 16 -r 1 -N 1 -d 31 /path/to/namd3 +ppn 30 +pemap 1-30 +commap 0 <configfile>

or the same with 4 processes per node:

  aprun -n 64 -N 4 -d 8 /path/to/namd3 +ppn 7 +pemap 1-7,9-15,17-23,25-31 +commap 0,8,16,24 <configfile>

or non-SMP, leaving one core free for the operating system:

  aprun -n 496 -r 1 -N 31 -d 1 /path/to/namd3 +pemap 0-30 <configfile>

The explicit +pemap and +commap settings are necessary to avoid having
multiple threads assigned to the same core (or potentially all threads
assigned to the same core).  If the performance of NAMD running on a
single compute node is much worse than comparable non-Cray host then
it is very likely that your CPU affinity settings need to be fixed.

All Cray XE/XK/XC network layers support multi-copy algorithms (replicas).

-- Xeon Phi Processors (KNL) --

Special Linux-KNL-icc and CRAY-XC-KNL-intel builds enable vectorizable
mixed-precision kernels while preserving full alchemical and other
functionality.  Multi-host runs require multiple smp processes per host
(as many as 13 for Intel Omni-Path, 6 for Cray Aries) in order to drive
the network. Careful attention to CPU affinity settings (see below) is
required, as is 1 or 2 (but not 3 or 4) hyperthreads per PE core (but
only 1 per communication thread core).

There appears to be a bug in the Intel 17.0 compiler that breaks the
non-KNL-optimized NAMD kernels (used for alchemical free energy, etc.)
on KNL.  Therefore the Intel 16.0 compilers are recommended on KNL.

----------------------------------------------------------------------

CPU Affinity

NAMD may run faster on some machines if threads or processes are set to
run on (or not run on) specific processor cores (or hardware threads).
On Linux this can be done at the process level with the numactl utility,
but NAMD provides its own options for assigning threads to cores.  This
feature is enabled by adding +setcpuaffinity to the namd3 command line,
which by itself will cause NAMD (really the underlying Charm++ library)
to assign threads/processes round-robin to available cores in the order
they are numbered by the operating system.  This may not be the fastest
configuration if NAMD is running fewer threads than there are cores
available and consecutively numbered cores share resources such as
memory bandwidth or are hardware threads on the same physical core.

If needed, specific cores for the Charm++ PEs (processing elements) and
communication threads (on SMP builds) can be set by adding the +pemap
and (if needed) +commap options with lists of core sets in the form
"lower[-upper[:stride[.run]]][,...]".  A single number identifies a
particular core.  Two numbers separated by a dash identify an inclusive
range (lower bound and upper bound).  If they are followed by a colon and
another number (a stride), that range will be stepped through in increments
of the additional number.  Within each stride, a dot followed by a run will
indicate how many cores to use from that starting point.  For example, the
sequence 0-8:2,16,20-24 includes cores 0, 2, 4, 6, 8, 16, 20, 21, 22, 23, 24.
On a 4-way quad-core system three cores from each socket would be 0-15:4.3
if cores on the same chip are numbered consecutively.  There is no need
to repeat cores for each node in a run as they are reused in order.

For example, the IBM POWER7 has four hardware threads per core and the
first thread can use all of the core's resources if the other threads are
idle; threads 0 and 1 split the core if threads 2 and 3 are idle, but
if either of threads 2 or 3 are active the core is split four ways.  The
fastest configuration of 32 threads or processes on a 128-thread 32-core 
is therefore "+setcpuaffinity +pemap 0-127:4".  For 64 threads we need
cores 0,1,4,5,8,9,... or 0-127:4.2.  Running 4 processes with +ppn 31
would be "+setcpuaffinity +pemap 0-127:32.31 +commap 31-127:32"

For Intel processors, including KNL, where hyperthreads on the same core
are not numbered consecutively, hyperthreads may be mapped to consecutive
PEs by appending [+span] to a core set, e.g., "+pemap 0-63+64+128+192"
to use all threads on a 64-core, 256-thread KNL with threads mapped to
PEs as 0,64,128,192,1,65,129,193,...

By default, the CPU indices passed to +pemap and +commap match those used
by the operating system.  Prefixing the index set by "L" will instead use
"logical" indices (defined by the hwloc library) in which related CPUs
are numbered consecutively.  For the 64-core, 256-thread KNL example above,
"+pemap 0-63+64+128+192" and "+pemap L0-255" are equivalent.

For large shared-memory machines where the queueing system assigns cores
to jobs this information must be obtained with numactl --show and passed
to NAMD in order to set thread affinity (which will improve performance):

  namd3 +setcpuaffinity `numactl --show | awk '/^physcpubind/ {printf \
     "+p%d +pemap %d",(NF-1),$2; for(i=3;i<=NF;++i){printf ",%d",$i}}'` ...

----------------------------------------------------------------------

GPU Acceleration: GPU-offload mode

For NAMD's traditional GPU-offload mode of operation, in which force 
calculations are performed on the GPU and the remaining calculations 
are performed on the CPU, performance may be limited by the CPU. 
In general, all available CPU cores should be used, with CPU affinity 
set as described above.  Note that two programming models for GPU 
acceleration are supported, CUDA for NVIDIA GPUs and HIP for AMD GPUs.

Energy evaluation is slower than calculating forces alone, and the loss
is much greater in GPU-accelerated builds.  Therefore you should set
outputEnergies to 100 or higher in the simulation config file.  Some
features are unavailable in CUDA builds, such as the Lowe-Andersen 
thermostat and Drude force field when using NBThole. 

As this is an evolving feature you are encouraged to test all simulations
before beginning production runs.  Forces evaluated on the GPU differ
slightly from a CPU-only calculation, an effect more visible in reported
scalar pressure values than in energies.

To benefit from GPU acceleration you will need a CUDA build of NAMD 
and a recent high-end NVIDIA video card.  CUDA builds will not function
without a CUDA-capable GPU and a driver that supports CUDA 9.1.  If the
installed driver is too old NAMD will exit on startup with the error
"CUDA driver version is insufficient for CUDA runtime version".  For 
HIP builds, NAMD has been tested with ROCm 5.4.2. 

Finally, if NAMD was not statically linked against the CUDA runtime
then the libcudart.so file included with the binary (copied from
the version of CUDA it was built with) must be in a directory in your
LD_LIBRARY_PATH before any other libcudart.so libraries.  For example,
when running a multicore binary (recommended for a single machine):

  setenv LD_LIBRARY_PATH ".:$LD_LIBRARY_PATH"
  (or LD_LIBRARY_PATH=".:$LD_LIBRARY_PATH"; export LD_LIBRARY_PATH)
  ./namd3 +p8 +setcpuaffinity <configfile>

Each namd3 thread can use only one GPU.  Therefore you will need to run
at least one thread for each GPU you want to use.  Multiple threads
can share a single GPU, usually with an increase in performance.  NAMD
will automatically distribute threads equally among the GPUs on a node.
Specific GPU device IDs can be requested via the +devices argument on
the namd3 command line, for example:

  ./namd3 +p8 +setcpuaffinity +devices 0,2 <configfile>

Devices are shared by consecutive threads in a process, so in the
above example threads 0-3 will share device 0 and threads 4-7 will
share device 2.  Repeating a device will cause it to be assigned to
multiple master threads, which is allowed only for different processes
and is advised against in general but may be faster in certain cases.
When running on multiple nodes the +devices specification is applied to
each physical node separately and there is no way to provide a unique
list for each node.

When running a multi-node parallel job it is recommended to have one
process per device to maximize the number of communication threads.
If the job launch system enforces device segregation such that not all
devices are visible to each process then the +ignoresharing argument
must be used to disable the shared-device error message.

When running a multi-copy simulation with both multiple replicas and
multiple devices per physical node, the +devicesperreplica <n> argument
must be used to prevent each replica from binding all of the devices.
For example, for 2 replicas per 6-device host use +devicesperreplica 3. 

When running a multi-copy simulation in GPU-resident mode using a
netlrts-smp-CUDA (or netlrts-smp-HIP) build, it is recommended to use
"+devicesperreplica 1".

GPUs of compute capability < 5.0 are no longer supported and are ignored.
GPUs with two or fewer streaming multiprocessors are ignored unless
specifically requested with +devices.

While charmrun with ++local will preserve LD_LIBRARY_PATH, normal
charmrun does not.  You can use charmrun ++runscript to add the namd3
directory to LD_LIBRARY_PATH with the following executable runscript:

  #!/bin/csh
  setenv LD_LIBRARY_PATH "${1:h}:$LD_LIBRARY_PATH"
  $*

For example:

  ./charmrun ++runscript ./runscript ++n 4 ./namd3 ++ppn 15 <configfile>

An InfiniBand network is highly recommended when running CUDA-accelerated
NAMD across multiple nodes.  You must use verbs (available for download)
or UCX (must compile) to make use of the InfiniBand network.  The use
of SMP binaries is also recommended when running on multiple nodes, with
one process per GPU and as many threads as available cores, reserving
one core per process for the communication thread.  MPI-smp-CUDA is not
recommended due to reduced performance from MPI-smp. 

The CUDA (NVIDIA's graphics processor programming platform) code in
NAMD is completely self-contained and does not use any of the CUDA
support features in Charm++.  When building NAMD with CUDA support
you should use the same Charm++ you would use for a non-CUDA build.
Do NOT add the cuda option to the Charm++ build command line.  The
only changes to the build process needed are to add --with-cuda and
possibly --cuda-prefix ... to the NAMD config command line.

----------------------------------------------------------------------

GPU Acceleration: GPU-resident mode

NAMD version 3 introduces a new GPU-resident mode of operation, in which 
almost all calculations during dynamics simulations are performed on the 
GPU.  Performance for single GPU simulation can give 2x or more speedup 
versus GPU-offload.  Standard dynamics simulations can be scaled across 
multiple GPUs in a single node, if the GPUs have P2P (peer-to-peer) 
connectivity.  Although this can be made available over PCIe, we highly 
recommend use of NVLink or NVSwitch connections between GPUs to provide 
good scaling beyond two devices.  Scaling on AMD Instinct GPUs (MI series) 
is supported for Infinity Fabric connections.  Larger simulations, in the 
range of 1M to 10M atoms, are able to scale very well on 8-way DGX-type 
platforms, like the DGX-A100, with up to 87% parallel efficiency. 

Note that GPU-resident mode is supported just for Charm++ multicore for 
single node simulation and netlrts for multi-copy and replica-exchange 
simulation. 

Since the CPU has so much less work for GPU-resident mode, fewer CPU 
cores per device are needed.  In fact, using too many CPU cores per GPU 
might slow down performance, due to some extra overhead introduced for 
managing each core.  The number of CPU cores to use per device depends 
on the size of the system, where the use of more cores for larger 
systmes might improve performance.  For example, 8 cores per device 
works well for simulating the 1M-atom STMV system on DGX-A100.  Besides 
the number of cores per device, it is important to keep earlier advice 
in mind, especially the use of +setcpuaffinity to maintain CPU core 
affinity.  

GPU-resident mode is enabled by configuration file option:

  CUDASOAintegrate on

Performance for this mode is even more sensitive to the frequency of 
energy evaluation and output.  When outputEnergies is left unset it 
will default to 100 steps, however, depending on the size of the system, 
it might be good to set outputEnergies as high as 500 or 1000 steps. 

Another bottleneck to good performance is caused by "atom migration," 
in which the spatial decomposition of atoms in their respective patches 
is updated as atoms move.  For non-GPU-resident simulation, the rate of 
atom migration is stepsPerCycle, which defaults to 20.  To reduce how 
often atom migration takes place, GPU-resident mode extends the patch 
margin (controlled by the "margin" parameter) and then performs atom 
migration only when necessary, rather than at regular intervals.  From a
user perspective, you should avoid setting "stepsPerCycle" in the config 
file.  The "margin" parameter defaults to 4.  Performance for larger 
systems might improve from setting margin as large as 8, and very small 
systems can get better performance from setting margin to 0. 

There are some additional parameters that can improve performance 
through experimental functionality that is not yet enabled by default. 
It is possible to keep atom migration completely on the GPU by setting

  DeviceMigration on

Since this is quite new, DeviceMigration is set off by default.  Note 
DeviceMigration is not yet working across multiple AMD GPUs.  We have 
also been experimenting with improving performance for the short-range 
non-bonded compute kernels by direct calculation of the pairwise 
interaction forces rather than interpolation from a pre-computed force 
table, as has been done now for many years.  With PME enabled, the 
erfc() function is still too expensive to evaluate directly.  However, 
we have seen a small performance boost from direct calculation of the 
non-PME steps when multiple time stepping is enabled.  This functionality 
can be enabled by setting 

  CUDAForceTable off

where it is enabled to continue to use interpolation tables by default. 
Use of direct calculation on non-PME steps is still considered 
experimental because we have not yet performed extensive testing to see 
its effect on energy conservation and other conserved quantities over 
long timescales.  Note that "CUDAForceTable off" is not yet available 
for AMD GPUs. 

Scaling GPU-resident simulations to multiple GPUs is subject to load 
imbalance due to PME.  The problem is that, due PME's need to calculate 
FFTs, which for our typical system sizes do not scale well across GPUs, 
we perform the PME calculation on a single GPU.  However, overloading 
one GPU means that the other GPUs will be left waiting without useful 
work during the PME force calculation steps.  We can deal with this 
load imbalance by assigning less work to the GPU that is performing 
the PME calculation.  This is implemented to use the existing work 
decomposition infrastructure in NAMD, that evenly distributes patches 
and compute objects across the CPU cores.  What we have introduced is 
a command line parameter that reduces the number of PEs (CPU cores) 
assigned to the PME device, thereby reducing its overall work assignment. 
The command line parameter is "+pmepes K" where K is the number of PEs 
to assign to the PME device.  The following examples show optimal work 
distribution for the STMV (1M atom) benchmark system scaled across 
GPUs on a DGX-A100.  Note that "+p" needs to be set to the total number 
of CPU cores.

Running on 1 GPU:

  namd3 +p8 +setcpuaffinity +devices 0 stmv.namd
    # with only one GPU, there are no load balancing issues

Running on 2 GPUs:

  namd3 +p15 +pmepes 7 +setcpuaffinity +devices 0,1 stmv.namd
    # with 8 PEs per non-PME device, +p is set to 15 = 1*8 + 7

Running on 4 GPUs:

  namd3 +p29 +pmepes 5 +setcpuaffinity +devices 0,1,2,3 stmv.namd
    # with 8 PEs per non-PME device, +p is set to 29 = 3*8 + 5

Running on all 8 GPUs:

  namd3 +p57 +pmepes 1 +setcpuaffinity +devices 0,1,2,3,4,5,6,7 stmv.namd
    # with 8 PEs per non-PME device, +p is set to 57 = 7*8 + 1

There are also performance considerations for very small systems.  When
relying on NAMD's standard patch decomposition, a smaller system might 
have trouble producing enough work to fully occupy all SMs on a single 
GPU.  The amount of work can be increased using:

  twoAwayZ on  # doubles work items by splitting patches against Z-axis
  twoAwayY on  # if we need more work...
  twoAwayX on  # if we need even more work...

These take the existing patch decomposition and then split against 
the Z-axis, Y-axis, and X-axis, each one doubling the amount of data 
items available up to a factor of 8.  As an example, DHFR (23.5k atom)
benchmark system was previously challenging for good GPU-resident 
performance.  However, by using margin=2 and twoAwayZ=on (note that the 
benchmark is 9A cutoff with 4fs timestep using HMR), GPU-resident NAMD 
is able to get 1,102 ns/day on A100. 

When doing ensemble sampling, aggregate throughput can be improved by 
using NVIDIA MPS (multi-process server) to schedule multiple jobs per 
GPU in order to continually keep all SMs occupied.  Although this will 
slow down any single simulation, the aggregate sampling available is 
improved.  To use MPS with NAMD, begin your launch script with:

  export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
  export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
  nvidia-cuda-mps-control -d

Then launch NAMD jobs in the background (using "&") and wait on the job 
PIDs.  One caveat is that superuser access is required to run 
"nvidia-cuda-mps-control" server, but it should work if the computing 
center white-lists this command or when using containers. 

GPU-resident mode is compatible with standard simulation protocols, 
including constant energy, temperature thermostat, pressure barostats, 
PME, rigid bond constraints, and multiple time stepping.  All of these 
are supported for multi-GPU scaling.  GPU-resident can also run 
replica-exchange MD and multi-copy simulation using netlrts build. 

Several advanced features are supported by GPU-resident mode. 
Alchemical free energy methods (FEP & TI) are supported for single-
and multi-GPU simulation.  The supported advanced features that are 
single-GPU only are REST2, harmonic restraints, external electric field, 
and SMD.

There are two additional features available only using GPU-resident mode 
and only for single-GPU simulation.  There is a Monte Carlo barostat that 
is faster than the Langevin piston barostat.  There are group position 
restraints that replaces a common use case of Colvars.  As a native 
GPU implementation, it provides much faster performance than the 
CPU-based Colvars implementation. 

Work continues to port more existing features to GPU-resident mode. 

----------------------------------------------------------------------

Compiling NAMD

Building a complete NAMD binary from source code requires:

- working C and C++ compilers;
- a compiled version of the Charm++/Converse library;
- a compiled version of the TCL library and its header files;
- a compiled version of the FFTW library and its header files;
- a C shell (csh/tcsh) to run the script used to configure the build.

NAMD can be compiled without TCL or FFTW, but certain features will be
disabled.  Fortunately, precompiled TCL and FFTW libraries are available from
http://www.ks.uiuc.edu/Research/namd/libraries/.  You may disable
these options by specifying --without-tcl --without-fftw as options
when you run the config script.  Some files in arch may need editing
to set the path to TCL and FFTW libraries correctly.

As an example, here is the build sequence for 64-bit Linux workstations:

Unpack NAMD and matching Charm++ source code:
  tar xzf NAMD_3.0.1_Source.tar.gz
  cd NAMD_3.0.1_Source
  tar xf charm-8.0.0.tar

Build and test the Charm++/Converse library (single-node multicore version):
  cd charm-8.0.0
  ./build charm++ multicore-linux-x86_64 --with-production
  cd multicore-linux-x86_64/tests/charm++/megatest
  make pgm
  ./pgm +p4   (multicore does not support multiple nodes)
  cd ../../../../..

Build and test the Charm++/Converse library (ethernet version):
  cd charm-8.0.0
  ./build charm++ netlrts-linux-x86_64 --with-production
  cd netlrts-linux-x86_64/tests/charm++/megatest
  make pgm
  ./charmrun ++local +p4 ./pgm   (forks processes on local node)
  cd ../../../../..

Build and test the Charm++/Converse library (InfiniBand verbs version):
  cd charm-8.0.0
  ./build charm++ verbs-linux-x86_64 --with-production
  cd verbs-linux-x86_64/tests/charm++/megatest
  make pgm
  ./charmrun ++mpiexec +p4 ./pgm   (uses mpiexec to launch processes)
  cd ../../../../..

Build and test the Charm++/Converse library (InfiniBand UCX OpenMPI PMIx version):
  cd charm-8.0.0
  ./build charm++ ucx-linux-x86_64 ompipmix --with-production
  cd ucx-linux-x86_64-ompipmix/tests/charm++/megatest
  make pgm
  mpiexec -n 4 ./pgm   (run as for an OpenMPI program on your cluster)
  cd ../../../../..

Build and test the Charm++/Converse library (MPI version):
  cd charm-8.0.0
  env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --with-production
  cd mpi-linux-x86_64/tests/charm++/megatest
  make pgm
  mpiexec -n 4 ./pgm   (run as any other MPI program on your cluster)
  cd ../../../../..

For years, NAMD has used TCL 8.5.9. However, in order for GPU-resident mode
to support Colvars and TCL forces, it has been necessary to update to using
the stackless TCL 8.6.x releases. The original TCL 8.5.9 archives are still
available to download and will work perfectly well for non-GPU-resident
builds.

Download and install TCL and FFTW libraries:
  (cd to NAMD_3.0.1_Source if you're not already there)
  wget http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-x86_64.tar.gz
  tar xzf fftw-linux-x86_64.tar.gz
  mv linux-x86_64 fftw
  wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.6.13-linux-x86_64.tar.gz
  wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.6.13-linux-x86_64-threaded.tar.gz
  tar xzf tcl8.6.13-linux-x86_64.tar.gz
  tar xzf tcl8.6.13-linux-x86_64-threaded.tar.gz
  mv tcl8.6.13-linux-x86_64 tcl
  mv tcl8.6.13-linux-x86_64-threaded tcl-threaded

Optionally edit various configuration files:
  (not needed if charm-8.0.0, fftw, and tcl are in NAMD_3.0.1_Source)
  vi Make.charm  (set CHARMBASE to full path to charm)
  vi arch/Linux-x86_64.fftw     (fix library name and path to files)
  vi arch/Linux-x86_64.tcl      (fix library version and path to TCL files)

Set up build directory and compile:
  multicore version:
    ./config Linux-x86_64-g++ --charm-arch multicore-linux-x86_64
  ethernet version:
    ./config Linux-x86_64-g++ --charm-arch netlrts-linux-x86_64
  InfiniBand verbs version:
    ./config Linux-x86_64-g++ --charm-arch verbs-linux-x86_64
  InfiniBand UCX version:
    ./config Linux-x86_64-g++ --charm-arch ucx-linux-x86_64-ompipmix
  MPI version:
    ./config Linux-x86_64-g++ --charm-arch mpi-linux-x86_64
  GPU-resident CUDA multicore version: 
    ./config Linux-x86_64-g++ --charm-arch multicore-linux-x86_64 \
        --with-single-node-cuda
        # you might also need --cuda-prefix CUDA_DIRECTORY
  GPU-resident CUDA ethernet version:
    ./config Linux-x86_64-g++ --charm-arch netlrts-linux-x86_64 \
        --with-single-node-cuda
        # you might also need --cuda-prefix CUDA_DIRECTORY
  GPU-resident HIP multicore version: 
    ./config Linux-x86_64-g++ --charm-arch multicore-linux-x86_64 \
        --with-single-node-hip
        # you might also need --rocm-prefix ROCM_DIRECTORY
  GPU-resident HIP ethernet version:
    ./config Linux-x86_64-g++ --charm-arch netlrts-linux-x86_64 \
        --with-single-node-hip
        # you might also need --rocm-prefix ROCM_DIRECTORY

  cd Linux-x86_64-g++
  make   (or gmake -j4, which should run faster)

Quick tests using one and two processes (ethernet version):
  (this is a 66-atom simulation so don't expect any speedup)
  ./namd3
  ./namd3 src/alanin
  ./charmrun ++local +p2 ./namd3
  ./charmrun ++local +p2 ./namd3 src/alanin
  (for MPI or UCX version, run namd3 binary as for MPI executable)

Longer test using four processes:
  wget http://www.ks.uiuc.edu/Research/namd/utilities/apoa1.tar.gz
  tar xzf apoa1.tar.gz
  ./charmrun ++local +p4 ./namd3 apoa1/apoa1.namd
  (FFT optimization will take a several seconds during the first run.)

That's it.  A more complete explanation of the build process follows.
Note that you will need Cygwin to compile NAMD on Windows.

Download and unpack fftw and tcl libraries for your platform from
http://www.ks.uiuc.edu/Research/namd/libraries/.  Each tar file
contains a directory with the name of the platform.  These libraries
don't change very often, so you should find a permanent home for them.

Unpack the NAMD source code and the enclosed charm-8.0.0.tar archive.  This
version of Charm++ is the same one used to build the released binaries
and is more likely to work and be bug free than any other we know of.
Edit Make.charm to point at .rootdir/charm-8.0.0 or the full path to the
charm directory if you unpacked outside of the NAMD source directory.

Run the config script without arguments to list the available builds,
which have names like Linux-x86_64-icc.  Each build or "ARCH" is of the
form BASEARCH-compiler, where BASEARCH is the most generic name for a
platform, like Linux-x86_64.

Note that many of the options that used to require editing files can
now be set with options to the config script.  Running the config script
without arguments lists the available options as well.

Edit arch/BASEARCH.fftw and arch/BASEARCH.tcl to point to the libraries
you downloaded.  Find a line something like
"CHARMARCH = multicore-linux-x86_64-iccstatic" in arch/ARCH.arch to tell
what Charm++ platform you need to build.  The CHARMARCH name is of the
format comm-OS-cpu-options-compiler.  It is important that Charm++ and
NAMD be built with the same C++ compiler.  To change the CHARMARCH, just
edit the .arch file or use the --charm-arch config option.

Enter the charm directory and run the build script without options
to see a list of available platforms.  Only the comm-OS-cpu part will
be listed.  Any options or compiler tags are listed separately and
must be separated by spaces on the build command line.  Run the build
command for your platform as:

  ./build charm++ comm-OS-cpu options compiler --with-production

For example:

  ./build charm++ ucx-linux-x86_64 ompipmix icc --with-production

Note that for MPI builds you normally do not need to specify a compiler,
even if your mpicxx calls icc internally, but you will need to use an
icc-based NAMD architecture specification.

The README distributed with Charm++ contains a complete explanation.
You only actually need the bin, include, and lib subdirectories, so
you can copy those elsewhere and delete the whole charm directory,
but don't forget to edit Make.charm if you do this.

The CUDA (NVIDIA's graphics processor programming platform) code in
NAMD is completely self-contained and does not use any of the CUDA
support features in Charm++.  When building NAMD with CUDA support,
you should use the same Charm++ you would use for a non-CUDA build.
Do NOT add the cuda option to the Charm++ build command line.  The
only changes to the build process needed are to add --with-cuda and
possibly --cuda-prefix ... to the NAMD config command line.  Use 
--with-single-node-cuda to build GPU-resident support.  

HIP builds of NAMD likewise depend on --with-hip to be set, with 
--rocm-prefix, --hipcub-prefix, and --rocprim-prefix being optionally
set if the ROCm library is installed outside of /opt/rocm.  Use
--with-single-node-hip to build GPU-resident support.  If desired,
CUDA can be used as the backend on NVIDIA hardware for direct
performance comparisons by passing both --with-hip and --with-cuda on
the NAMD config line.  Both clang and gcc compilers have been tested 
with HIP.  Intel compilers are not recommended with HIP.

If you are building a non-smp, non-tcp version of netlrts with the
Intel icc compiler you may need to disable optimization for some
files to avoid crashes in the communication interrupt handler.  The
smp and tcp builds use polling instead of interrupts and therefore
are not affected.  Adding +netpoll to the namd3 command line also
avoids the bug, but this option reduces performance in many cases.
These commands recompile the necessary files without optmization:

  cd charm/netlrts-linux-x86_64-icc
  /bin/rm tmp/sockRoutines.o
  /bin/rm tmp/machine.o
  /bin/rm lib/libconv-machine*
  ( cd tmp; make charm++ OPTS="-O0" )

If you're building an MPI version you will probably want to build
Charm++ with env MPICXX=mpicxx preceding ./build on the command line,
since the default MPI C++ compiler is mpiCC.  You may also need to
compiler flags or commands in the Charm++ src/arch directory.  The
file charm/src/arch/mpi-linux/conv-mach.sh contains the definitions
that select the mpiCC compiler for mpi-linux, while other compiler
choices are defined by files in charm/src/arch/common/.

If you want to run NAMD on InfiniBand one option is to build a UCX
library version with the OpenMPI PMIx launcher by specifying:

  ./build charm++ ucx-linux-x86_64 ompipmix icc --with-production

You would then change "multicore-linux-x86_64-iccstatic" to "ucx-linux-x86_64-ompipmix-icc"
in your namd3/arch/Linux-x86_64-icc.arch file (or create a new .arch file).

Run make in charm/CHARMARCH/tests/charm++/megatest/ and run the
resulting binary "pgm" as you would run NAMD on your platform.  You
should try running on several processors if possible.  For example:

  cd ucx-linux-x86_64-ompipmix-icc/tests/charm++/megatest/
  make pgm
  mpiexec -n 16 ./pgm

If any of the tests fail then you will probably have problems with
NAMD as well.  You can continue and try building NAMD if you want,
but when reporting problems please mention prominently that megatest
failed, include the megatest output, and copy the Charm++ developers
at ppl@cs.uiuc.edu on your email.

Now you can run the NAMD config script to set up a build directory:

  ./config ARCH

For this specific example:

  ./config Linux-x86_64-icc --charm-arch ucx-linux-x86_64-ompipmix-icc

This will create a build directory Linux-x86_64-icc.

If you wish to create this directory elsewhere use config DIR/ARCH,
replacing DIR with the location the build directory should be created.
A symbolic link to the remote directory will be created as well.  You
can create multiple build directories for the same ARCH by adding a
suffix.  These can be combined, of course, as in:

  ./config /tmp/Linux-x86_64-icc.test1

Now cd to your build directory and type make.  The namd3 binary and
a number of utilities will be created.

If you have trouble building NAMD your compiler may be different from
ours.  The architecture-specific makefiles in the arch directory use
several options to elicit similar behavior on all platforms.  Your
compiler may conform to an earlier C++ specification than NAMD uses.
You compiler may also enforce a later C++ rule than NAMD follows.
You may ignore repeated warnings about new and delete matching.

The NAMD Wiki at http://www.ks.uiuc.edu/Research/namd/wiki/ has entries
on building and running NAMD at various supercomputer centers (e.g.,
NamdAtTexas) and on various architectures (e.g., NamdOnMPICH).  Please
consider adding a page on your own porting effort for others to read.

----------------------------------------------------------------------

Memory Usage

NAMD has traditionally used less than 100MB of memory even for systems
of 100,000 atoms.  With the reintroduction of pairlists in NAMD 2.5,
however, memory usage for a 100,000 atom system with a 12A cutoff can
approach 300MB, and will grow with the cube of the cutoff.  This extra
memory is distributed across processors during a parallel run, but a
single workstation may run out of physical memory with a large system.

To avoid this, NAMD now provides a pairlistMinProcs config file option
that specifies the minimum number of processors that a run must use
before pairlists will be enabled (on fewer processors small local
pairlists are generated and recycled rather than being saved, the
default is "pairlistMinProcs 1").  This is a per-simulation rather than
a compile time option because memory usage is molecule-dependent.

Additional information on reducing memory usage may be found at
http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdMemoryReduction

----------------------------------------------------------------------

Improving Parallel Scaling

While NAMD is designed to be a scalable program, particularly for
simulations of 100,000 atoms or more, at some point adding additional
processors to a simulation will provide little or no extra performance.
If you are lucky enough to have access to a parallel machine you should
measure NAMD's parallel speedup for a variety of processor counts when
running your particular simulation.  The easiest and most accurate way
to do this is to look at the "Benchmark time:" lines that are printed
after 20 and 25 cycles (usually less than 500 steps).  You can monitor
performance during the entire simulation by adding "outputTiming <steps>"
to your configuration file, but be careful to look at the "wall time"
rather than "CPU time" fields on the "TIMING:" output lines produced.
For an external measure of performance, you should run simulations of
both 25 and 50 cycles (see the stepspercycle parameter) and base your
estimate on the additional time needed for the longer simulation in
order to exclude startup costs and allow for initial load balancing.

Multicore builds scale well within a single node, but may benefit from
setting CPU affinity using the +setcpuaffinity +pemap <map> +commap <map>
options described in CPU Affinity above.  Experimentation is needed.

We provide single-copy (multicore), multi-copy (netlrts), and InfiniBand
(verbs, supporting both single- and multi-copy simulation) precompiled
binaries for Linux clusters.  For other high-speed networks, we recommend
compiling for UCX or MPI.

In the past, it was noted that SMP builds generally do not scale as well
across nodes as single-threaded non-SMP builds because the communication
thread became a bottleneck while occuping a core that could otherwise
have been used for computation.  However, SMP builds running on modern
CPUs having large core counts are just as fast, if not faster, than
single-threaded non-SMP builds, as long as the processes (ranks) are
aligned with the NUMA domains of the CPU, with one core reserved for
the communication thread.  Depending on the underlying system,
additional cores might need to be left unscheduled for the OS or for
system-level management of a GPU device.  In our experience, NAMD runs
fastest in SMT=1 mode (i.e., we recommend to not use hyperthreading if
available).  It is advisable to run benchmarks to determine optimal
core configuration.

Extremely short cycle lengths (less than 10 steps) will limit parallel
scaling, since the atom migration at the end of each cycle sends many
more messages than a normal force evaluation.  Increasing margin from
0 to 1 while doubling stepspercycle and pairlistspercycle may help,
but it is important to benchmark.  The pairlist distance will adjust
automatically, and one pairlist per ten steps is a good ratio.

NAMD should scale very well when the number of patches (multiply the
dimensions of the patch grid) is larger or rougly the same as the
number of processors.  If this is not the case, it may be possible
to improve scaling by adding ``twoAwayX yes'' to the config file,
which roughly doubles the number of patches.  (Similar options
twoAwayY and twoAwayZ also exist, and may be used in combination,
but this greatly increases the number of compute objects.  twoAwayX
has the unique advantage of also improving the scalability of PME.)

Additional performance tuning suggestions and options are described
at http://www.ks.uiuc.edu/Research/namd/wiki/?NamdPerformanceTuning

----------------------------------------------------------------------

Endian Issues

Some architectures write binary data (integer or floating point) with
the most significant byte first; others put the most significant byte
last.  This doesn't effect text files but it does matter when a binary
data file that was written on a "big-endian" machine (some POWER) is
read on a "small-endian" machine (Intel) or vice versa.

NAMD generates DCD trajectory files and binary coordinate and velocity
files which are "endian-sensitive".  While VMD can now read DCD files
from any machine and NAMD reads most other-endian binary restart files,
many analysis programs (like CHARMM or X-PLOR) require same-endian DCD
files.  We provide the programs flipdcd and flipbinpdb for switching the
endianness of DCD and binary restart files, respectively.  These programs
use mmap to alter the file in-place and may therefore appear to consume
an amount of memory equal to the size of the file.

----------------------------------------------------------------------