NAMD
Introduction
http://www.ks.uiuc.edu/Development/
NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of processors on high-end parallel platforms and tens of processors on commodity clusters using gigabit ethernet.
Version 2.9
Linux-x86_64-multicore (64-bit Intel/AMD single node)
USAGE:
module load NAMD/2.9-multicore
Installation
A NAMD binary distribution need only be untarred or unzipped and can
be run directly in the resulting directory.
cd /sw/NAMD/2.9/multicore tar -zxvf ../src/NAMD_2.9_Linux-x86_64-multicore.tar.gz cd /sw/NAMD/2.9/multicore/NAMD_2.9_Linux-x86_64-multicore mv * ../ cd /sw/NAMD/2.9/multicore rmdir NAMD_2.9_Linux-x86_64-multicore
Sample PBS script
#!/bin/bash -l #PBS -m abe #PBS -M xxxxx@griffith.edu.au #PBS -N NormalRunNAMD #PBS -l walltime=01:19:00 #PBS -q workq #PBS -l select=1:ncpus=1:mem=2gb source $HOME/.bashrc module load NAMD/2.9-multicore ######module load mpi/intel-4.0 echo "Starting job" namd2 +idlepoll /export/home/s2174555/pbs/namd/apoa1.namd > apoa1.namd.log echo "Done with job"
Version 2.8b1 (NVIDIA CUDA with InfiniBand)
manual run:
module load mpi/intel-4.0 module load NAMD/NAMD28b1 mpirun -r ssh -f /export/home/s2594054/pbs/namd/apoa1/namd/mpd.hosts -n 1 namd2 +idlepoll apoa1.namd cat /export/home/s2594054/pbs/namd/apoa1/namd/mpd.hosts n020 n021 n022 n023
Please note that if you wish to run it on 2 (or more nodes, you need to change -n 1 from above to -n 2 (-n 3 or -n 4). Currently we have 4 nodes that is gpu enabled.
pbs run
use: -l ngpus=2
sample 1
#!/bin/bash #PBS -N cuda #PBS -l ngpus=2 module load cuda/4.0 echo "Hello from $HOSTNAME: date = `date`" nvcc --version echo "Finished at `date`"
sample 2
#!/bin/bash -l #PBS -m abe #PBS -M emailaddress@griffith.edu.au #PBS -N CudaJob #PBS -q gpu #PBS -l select=2:ncpus=2:mem=2gb:ngpus=2 source $HOME/.bashrc module load NAMD/NAMD28b1 module load mpi/intel-4.0 echo "Starting job" mpirun -r ssh -n 2 namd2 +idlepoll /export/home/s2594054/pbs/namd/apoa1/namd/apoa1.namd > apoa1.namd.log echo "Done with job"
Please do change directory names in the above script to reflect your home directory...
qsub run.pbs 824.pbsserver [s2594054@n027 namd]$ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 812.pbsserver 3nss s2795116 00:00:02 R workq 813.pbsserver 1ivf_naen s2795116 00:00:01 R workq 818.pbsserver 1ivf_apo s2795116 00:00:01 R workq 819.pbsserver 1nn2 s2795116 00:00:00 R workq 821.pbsserver 1ivg s2795116 00:00:00 R workq 824.pbsserver CudaJob s2594054 00:00:00 R gpu
compiling namd
module file
more NAMD28b1 #%Module###################################################################### ## ## NAMD ## proc ModulesHelp { } { puts stderr "Sets up environment to use NAMD 2.8b1" } set base /sw/NAMD set vers NAMD28b1 set-alias namd2 $base/$vers/namd2 set-alias charmrun $base/$vers/charmrun prepend-path PATH $base/$vers/ prepend-path LD_LIBRARY_PATH $base/$vers/
mpd.hosts file
cat /export/home/s2594054/pbs/namd/apoa1/namd/mpd.hosts
n020 n021 n022 n023
Release notes
+--------------------------------------------------------------------+ | | | NAMD 2.9 Release Notes | | | +--------------------------------------------------------------------+ This file contains the following sections: - Problems? Found a bug? - Installing NAMD - Running NAMD - CPU Affinity - CUDA GPU Acceleration - Compiling NAMD - Memory Usage - Improving Parallel Scaling - Endian Issues ---------------------------------------------------------------------- Problems? Found a bug? 1. Download and test the latest version of NAMD. 2. Please check NamdWiki, NAMD-L, and the rest of the NAMD web site resources at http://www.ks.uiuc.edu/Research/namd/. 3. For problems compiling or running NAMD please review the release notes, Charm++ Installation and Usage Manual, and NamdOn... pages at NamdWiki. If you do not understand the errors generated by your compiler, queueing system, ssh, or mpiexec you should seek assistance from a local expert familiar with your setup. 4. For questions about using NAMD please subscribe to NAMD-L and post your question there so that others may benefit from the discussion. Please avoid sending attachments to NAMD-L by posting any related files on your web site and including the location in your message. If you are not familiar with molecular dynamics simulations please work through the tutorials and seek assistance from a local expert. 5. Gather, in a single directory, all input and config files needed to reproduce your problem. 6. Run once, redirecting output to a file (if the problem occurs randomly do a short run and include additional log files showing the error). 5. Tar everything up (but not the namd2 or charmrun binaries) and compress it, e.g., "tar cf mydir.tar mydir; gzip mydir.tar" Please do not attach your files individually to your email message as this is error prone and tedious to extract. The only exception may be the log of your run showing any error messages. 6. For problems concerning compiling or using NAMD please consider subscribing to NAMD-L and posting your question there, and summarizing the resolution of your problem on NamdWiki so that others may benefit from your experience. Please avoid sending large attachments to NAMD-L by posting any related files on your web site or on BioCoRE (http://www.ks.uiuc.edu/Research/BioCoRE/, in BioFS for the NAMD public project) and including the location in your message. 7. For bug reports, mail namd@ks.uiuc.edu with: - A synopsis of the problem as the subject (not "HELP" or "URGENT"). - The NAMD version, platform and number of CPUs the problem occurs (to be very complete just copy the first 20 lines of output). - A description of the problematic behavior and any error messages. - If the problem is consistent or random. - A complete log of your run showing any error messages. - The URL of your compressed tar file on a web server. 8. We'll get back to you with further questions or suggestions. While we try to treat all of our users equally and respond to emails in a timely manner, other demands occasionally prevent us from doing so. Please understand that we must give our highest priority to crashes or incorrect results from the most recently released NAMD binaries. ---------------------------------------------------------------------- Installing NAMD A NAMD binary distribution need only be untarred or unzipped and can be run directly in the resulting directory. When building from source code (see "Compiling NAMD" below), "make release" will generate a self-contained directory and .tar.gz or .zip archive that can be moved to the desired installation location. Windows and CUDA builds include Tcl .dll and CUDA .so files that must be in the dynamic library path. ---------------------------------------------------------------------- Running NAMD NAMD runs on a variety of serial and parallel platforms. While it is trivial to launch a serial program, a parallel program depends on a platform-specific library such as MPI to launch copies of itself on other nodes and to provide access to a high performance network such as Myrinet or InfiniBand if one is available. For typical workstations (Windows, Linux, Mac OS X, or other Unix) with only ethernet networking (hopefully gigabit), NAMD uses the Charm++ native communications layer and the program charmrun to launch namd2 processes for parallel runs (either exclusively on the local machine with the ++local option or on other hosts as specified by a nodelist file). The namd2 binaries for these platforms can also be run directly (known as standalone mode) for single process runs. -- Individual Windows, Linux, Mac OS X, or Other Unix Workstations -- Individual workstations use the same version of NAMD as workstation networks, but running NAMD is much easier. If your machine has only one processor core you can run the namd2 binary directly: namd2 <configfile> Windows, Mac OX X (Intel), and Linux-x86_64-multicore released binaries are based on "multicore" builds of Charm++ that can run multiple threads. These multicore builds lack a network layer, so they can only be used on a single machine. For best performance use one thread per processor with the +p option: namd2 +p<procs> <configfile> For other multiprocessor workstations the included charmrun program is needed to run multiple namd2 processes. The ++local option is also required to specify that only the local machine is being used: charmrun namd2 ++local +p<procs> <configfile> You may need to specify the full path to the namd2 binary. -- Windows Clusters and Workstation Networks --- The Win64-MPI version of NAMD runs on Windows HPC Server and should be launched as you would any other MPI program. -- Linux Clusters with InfiniBand or Other High-Performance Networks -- Charm++ provides a special ibverbs network layer that uses InfiniBand networks directly through the OpenFabrics OFED ibverbs library. This avoids efficiency and portability issues associated with MPI. Look for pre-built ibverbs NAMD binaries or specify ibverbs when building Charm++. Writing batch job scripts to run charmrun in a queueing system can be challenging. Since most clusters provide directions for using mpiexec to launch MPI jobs, charmrun provides a ++mpiexec option to use mpiexec to launch non-MPI binaries. If "mpiexec -np <procs> ..." is not sufficient to launch jobs on your cluster you will need to write an executable mympiexec script like the following from TACC: #!/bin/csh shift; shift; exec ibrun $* The job is then launched (with full paths where needed) as: charmrun +p<procs> ++mpiexec ++remote-shell mympiexec namd2 <configfile> For workstation clusters and other massively parallel machines with special high-performance networking, NAMD uses the system-provided MPI library (with a few exceptions) and standard system tools such as mpirun are used to launch jobs. Since MPI libraries are very often incompatible between versions, you will likely need to recompile NAMD and its underlying Charm++ libraries to use these machines in parallel (the provided non-MPI binaries should still work for serial runs.) The provided charmrun program for these platforms is only a script that attempts to translate charmrun options into mpirun options, but due to the diversity of MPI libraries it often fails to work. -- Linux or Other Unix Workstation Networks -- The same binaries used for individual workstations as described above (other than pure "multicore" builds and MPI builds) can be used with charmrun to run in parallel on a workstation network. The only difference is that you must provide a "nodelist" file listing the machines where namd2 processes should run, for example: group main host brutus host romeo The "group main" line defines the default machine list. Hosts brutus and romeo are the two machines on which to run the simulation. Note that charmrun may run on one of those machines, or charmrun may run on a third machine. All machines used for a simulation must be of the same type and have access to the same namd2 binary. By default, the "rsh" command ("remsh" on HPUX) is used to start namd2 on each node specified in the nodelist file. You can change this via the CONV_RSH environment variable, i.e., to use ssh instead of rsh run "setenv CONV_RSH ssh" or add it to your login or batch script. You must be able to connect to each node via rsh/ssh without typing your password; this can be accomplished via a .rhosts files in your home directory, by an /etc/hosts.equiv file installed by your sysadmin, or by a .ssh/authorized_keys file in your home directory. You should confirm that you can run "ssh hostname pwd" (or "rsh hostname pwd") without typing a password before running NAMD. Contact your local sysadmin if you have difficulty setting this up. If you are unable to use rsh or ssh, then add "setenv CONV_DAEMON" to your script and run charmd (or charmd_faceless, which produces a log file) on every node. You should now be able to try running NAMD as: charmrun namd2 +p<procs> <configfile> If this fails or just hangs, try adding the ++verbose option to see more details of the startup process. You may need to specify the full path to the namd2 binary. Charmrun will start the number of processes specified by the +p option, cycling through the hosts in the nodelist file as many times as necessary. You may list multiprocessor machines multiple times in the nodelist file, once for each processor. You may specify the nodelist file with the "++nodelist" option and the group (which defaults to "main") with the "++nodegroup" option. If you do not use "++nodelist" charmrun will first look for "nodelist" in your current directory and then ".nodelist" in your home directory. Some automounters use a temporary mount directory which is prepended to the path returned by the pwd command. To run on multiple machines you must add a "++pathfix" option to your nodelist file. For example: group main ++pathfix /tmp_mnt / host alpha1 host alpha2 There are many other options to charmrun and for the nodelist file. These are documented at in the Charm++ Installation and Usage Manual available at http://charm.cs.uiuc.edu/manuals/ and a list of available charmrun options is available by running charmrun without arguments. If your workstation cluster is controlled by a queueing system you will need build a nodelist file in your job script. For example, if your queueing system provides a $HOST_FILE environment variable: set NODES = `cat $HOST_FILE` set NODELIST = $TMPDIR/namd2.nodelist echo group main >! $NODELIST foreach node ( $nodes ) echo host $node >> $NODELIST end @ NUMPROCS = 2 * $#NODES charmrun namd2 +p$NUMPROCS ++nodelist $NODELIST <configfile> Note that $NUMPROCS is twice the number of nodes in this example. This is the case for dual-processor machines. For single-processor machines you would not multiply $#NODES by two. Note that these example scripts and the setenv command are for the csh or tcsh shells. They must be translated to work with sh or bash. -- Shared-Memory and Network-Based Parallelism (SMP Builds) -- The Linux-x86_64-ibverbs-smp and Solaris-x86_64-smp released binaries are based on "smp" builds of Charm++ that can be used with multiple threads on either a single machine like a multicore build, or across a network. SMP builds combine multiple worker threads and an extra communication thread into a single process. Since one core per process is used for the communication thread SMP builds are typically slower than non-SMP builds. The advantage of SMP builds is that many data structures are shared among the threads, reducing the per-core memory footprint when scaling large simulations to large numbers of cores. SMP builds launched with charmrun use +p to specify the total number of PEs (worker threads) and +ppn to specify the number of PEs per process. Thus, to run one process with one communication and three worker threads on each of four quad-core nodes one would specify: charmrun namd2 +p12 +ppn 3 <configfile> For MPI-based SMP builds one would specify any mpiexec options needed for the required number of processes and pass +ppn to the NAMD binary as: mpiexec -np 4 namd2 +ppn 3 <configfile> See the Cray XT directions below for a more complex example. -- Cray XT -- You will need to load the GNU compiler module, build Charm++ for mpi-crayxt and NAMD for CRAY-XT-g++, use the following command in your batch script: aprun -n $PBS_NNODES -cc cpu namd2 To reduce memory usage, build Charm++ mpi-crayxt-smp instead and use (assuming you have 12 cores per node on your machine): setenv MPICH_PTL_UNEX_EVENTS 100000 @ NPROC = $PBS_NNODES / 12 aprun -n $NPROC -N 1 -d 12 namd2 +ppn 11 +setcpuaffinity +pemap 1-11 +commap 0 For two processes per node (for possibly better scaling) use: setenv MPICH_PTL_UNEX_EVENTS 100000 @ NPROC = $PBS_NNODES / 6 aprun -n $NPROC -N 2 -d 6 namd2 +ppn 5 +setcpuaffinity \ +pemap 1-8,10,11 +commap 0,9 ... For three processes per node use: setenv MPICH_PTL_UNEX_EVENTS 100000 @ NPROC = $PBS_NNODES / 4 aprun -n $NPROC -N 3 -d 4 namd2 +ppn 3 +setcpuaffinity \ +pemap 1-5,7,8,10,11 +commap 0,6,9 ... The strange +pemap and +commap settings place the communication threads on cores that have been observed to have higher operating system loads. -- SGI Altix UV -- Use Linux-x86_64-multicore and the following script to set CPU affinity: namd2 +setcpuaffinity `numactl --show | awk '/^physcpubind/ {printf \ "+p%d +pemap %d",(NF-1),$2; for(i=3;i<=NF;++i){printf ",%d",$i}}'` ... For runs on large numbers of cores (you will need to experiment) use the following to enable the Charm++ communication thread: namd2 +setcpuaffinity `numactl --show | awk '/^physcpubind/ {printf \ "+p%d +pemap %d",(NF-2),$2; for(i=3;i<NF;++i){printf ",%d",$i}; \ print " +commthread +commap",$NF}'` -- IBM POWER Clusters -- Run the MPI version of NAMD as you would any POE program. The options and environment variables for poe are various and arcane, so you should consult your local documentation for recommended settings. As an example, to run on Blue Horizon one would specify: poe namd2 <configfile> -nodes <procs/8> -tasks_per_node 8 ---------------------------------------------------------------------- CPU Affinity NAMD may run faster on some machines if threads or processes are set to run on (or not run on) specific processor cores (or hardware threads). On Linux this can be done at the process level with the numactl utility, but NAMD provides its own options for assigning threads to cores. This feature is enabled by adding +setcpuaffinity to the namd2 command line, which by itself will cause NAMD (really the underlying Charm++ library) to assign threads/processes round-robin to available cores in the order they are numbered by the operating system. This may not be the fastest configuration if NAMD is running fewer threads than there are cores available and consecutively numbered cores share resources such as memory bandwidth or are hardware threads on the same physical core. If needed, specific cores for the Charm++ PEs (processing elements) and communication threads (on all SMP builds and on multicore builds when the +commthread option is specified) can be set by adding the +pemap and (if needed) +commap options with lists of core sets in the form "lower[-upper[:stride[.run]]][,...]". A single number identifies a particular core. Two numbers separated by a dash identify an inclusive range (lower bound and upper bound). If they are followed by a colon and another number (a stride), that range will be stepped through in increments of the additional number. Within each stride, a dot followed by a run will indicate how many cores to use from that starting point. For example, the sequence 0-8:2,16,20-24 includes cores 0, 2, 4, 6, 8, 16, 20, 21, 22, 23, 24. On a 4-way quad-core system three cores from each socket would be 0-15:4.3 if cores on the same chip are numbered consecutively. There is no need to repeat cores for each node in a run as they are reused in order. For example, the IBM POWER7 has four hardware threads per core and the first thread can use all of the core's resources if the other threads are idle; threads 0 and 1 split the core if threads 2 and 3 are idle, but if either of threads 2 or 3 are active the core is split four ways. The fastest configuration of 32 threads or processes on a 128-thread 32-core is therefore "+setcpuaffinity +pemap 0-127:4". For 64 threads we need cores 0,1,4,5,8,9,... or 0-127:4.2. Running 4 processes with +ppn 31 would be "+setcpuaffinity +pemap 0-127:32.31 +commap 31-127:32" For an Altix UV or other machines where the queueing system assigns cores to jobs this information must be obtained with numactl --show and passed to NAMD in order to set thread affinity (which will improve performance): namd2 +setcpuaffinity `numactl --show | awk '/^physcpubind/ {printf \ "+p%d +pemap %d",(NF-1),$2; for(i=3;i<=NF;++i){printf ",%d",$i}}'` ... ---------------------------------------------------------------------- CUDA GPU Acceleration Energy evaluation is slower than calculating forces alone, and the loss is much greater in CUDA-accelerated builds. Therefore you should set outputEnergies to 100 or higher in the simulation config file. Some features are unavailable in CUDA builds, including alchemical free energy perturbation and the Lowe-Andersen thermostat. As this is a new feature you are encouraged to test all simulations before beginning production runs. Forces evaluated on the GPU differ slightly from a CPU-only calculation, an effect more visible in reported scalar pressure values than in energies. To benefit from GPU acceleration you will need a CUDA build of NAMD and a recent high-end NVIDIA video card. CUDA builds will not function without a CUDA-capable GPU. You will also need to be running the NVIDIA Linux driver version 270.41.19 or newer (released Linux binaries are built with CUDA 4.0, but can be built with newer versions as well). Finally, the libcudart.so.4 included with the binary (the one copied from the version of CUDA it was built with) must be in a directory in your LD_LIBRARY_PATH before any other libcudart.so libraries. For example, when running a multicore binary (recommended for a single machine): setenv LD_LIBRARY_PATH ".:$LD_LIBRARY_PATH" (or LD_LIBRARY_PATH=".:$LD_LIBRARY_PATH"; export LD_LIBRARY_PATH) ./namd2 +idlepoll +p4 <configfile> When running CUDA NAMD always add +idlepoll to the command line. This is needed to poll the GPU for results rather than sleeping while idle. Each namd2 thread can use only one GPU. Therefore you will need to run at least one thread for each GPU you want to use. Multiple threads can share a single GPU, usually with an increase in performance. NAMD will automatically distribute threads equally among the GPUs on a node. Specific GPU device IDs can be requested via the +devices argument on the namd2 command line, for example: ./namd2 +idlepoll +p4 +devices 0,2 <configfile> Devices are shared by consecutive threads in a process, so in the above example processes 0 and 1 will share device 0 and processes 2 and 3 will share device 2. Repeating a device will cause it to be assigned to multiple master threads, either in the same or different processes, which is advised against in general but may be faster in certain cases. In the above example one could specify +devices 0,2,0,2 to cause device 0 to be shared by threads 0 and 2, etc. When running on multiple nodes the +devices specification is applied to each physical node separately and there is no way to provide a unique list for each node. GPUs of compute capability 1.0 are no longer supported and are ignored. GPUs with two or fewer multiprocessors are ignored unless specifically requested with +devices. While charmrun with ++local will preserve LD_LIBRARY_PATH, normal charmrun does not. You can use charmrun ++runscript to add the namd2 directory to LD_LIBRARY_PATH with the following executable runscript: #!/bin/csh setenv LD_LIBRARY_PATH "${1:h}:$LD_LIBRARY_PATH" $* For example: ./charmrun ++runscript ./runscript +p24 ./namd2 +idlepoll +ppn 3 <configfile> An InfiniBand network is highly recommended when running CUDA-accelerated NAMD across multiple nodes. You will need either an ibverbs NAMD binary (available for download) or an MPI NAMD binary (must build Charm++ and NAMD as described below) to make use of the InfiniBand network. The use of SMP binaries is also recommended when running on multiple nodes, with one process per GPU and as many threads as available cores, reserving one core per process for the communication thread. The CUDA (NVIDIA's graphics processor programming platform) code in NAMD is completely self-contained and does not use any of the CUDA support features in Charm++. When building NAMD with CUDA support you should use the same Charm++ you would use for a non-CUDA build. Do NOT add the cuda option to the Charm++ build command line. The only changes to the build process needed are to add --with-cuda and possibly --cuda-prefix ... to the NAMD config command line. ---------------------------------------------------------------------- Compiling NAMD Building a complete NAMD binary from source code requires working C and C++ compilers, Charm++/Converse, TCL, and FFTW. NAMD will compile without TCL or FFTW but certain features will be disabled. Fortunately, precompiled libraries are available from http://www.ks.uiuc.edu/Research/namd/libraries/. You may disable these options by specifying --without-tcl --without-fftw as options when you run the config script. Some files in arch may need editing to set the path to TCL and FFTW libraries correctly. As an example, here is the build sequence for 64-bit Linux workstations: Unpack NAMD and matching Charm++ source code and enter directory: tar xzf NAMD_2.9_Source.tar.gz cd NAMD_2.9_Source tar xf charm-6.4.0.tar cd charm-6.4.0 Build and test the Charm++/Converse library (multicore version): ./build charm++ multicore-linux64 --with-production cd multicore-linux64/tests/charm++/megatest make pgm ./pgm +p4 (multicore does not support multiple nodes) cd ../../../../.. Build and test the Charm++/Converse library (MPI version): env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --with-production cd mpi-linux-x86_64/tests/charm++/megatest make pgm mpirun -n 4 ./pgm (run as any other MPI program on your cluster) cd ../../../../.. Download and install TCL and FFTW libraries: (cd to NAMD_2.9_Source if you're not already there) wget http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-x86_64.tar.gz tar xzf fftw-linux-x86_64.tar.gz mv linux-x86_64 fftw wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64.tar.gz wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-threaded.tar.gz tar xzf tcl8.5.9-linux-x86_64.tar.gz tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz mv tcl8.5.9-linux-x86_64 tcl mv tcl8.5.9-linux-x86_64-threaded tcl-threaded Optionally edit various configuration files: (not needed if charm-6.4.0, fftw, and tcl are in NAMD_2.9_Source) vi Make.charm (set CHARMBASE to full path to charm) vi arch/Linux-x86_64.fftw (fix library name and path to files) vi arch/Linux-x86_64.tcl (fix library version and path to TCL files) Set up build directory and compile: multicore version: ./config Linux-x86_64-g++ --charm-arch multicore-linux64 network version: ./config Linux-x86_64-g++ --charm-arch net-linux-x86_64 MPI version: ./config Linux-x86_64-g++ --charm-arch mpi-linux-x86_64 cd Linux-x86_64-g++ make (or gmake -j4, which should run faster) Quick tests using one and two processes (network version): (this is a 66-atom simulation so don't expect any speedup) ./namd2 ./namd2 src/alanin ./charmrun ++local +p2 ./namd2 ./charmrun ++local +p2 ./namd2 src/alanin (for MPI version, run namd2 binary as any other MPI executable) Longer test using four processes: wget http://www.ks.uiuc.edu/Research/namd/utilities/apoa1.tar.gz tar xzf apoa1.tar.gz ./charmrun ++local +p4 ./namd2 apoa1/apoa1.namd (FFT optimization will take a several seconds during the first run.) That's it. A more complete explanation of the build process follows. Note that you will need Cygwin to compile NAMD on Windows. Download and unpack fftw and tcl libraries for your platform from http://www.ks.uiuc.edu/Research/namd/libraries/. Each tar file contains a directory with the name of the platform. These libraries don't change very often, so you should find a permanent home for them. Unpack the NAMD source code and the enclosed charm-6.4.0.tar archive. This version of Charm++ is the same one used to build the released binaries and is more likely to work and be bug free than any other we know of. Edit Make.charm to point at .rootdir/charm-6.4.0 or the full path to the charm directory if you unpacked outside of the NAMD source directory. Run the config script without arguments to list the available builds, which have names like Linux-x86_64-icc. Each build or "ARCH" is of the form BASEARCH-compiler, where BASEARCH is the most generic name for a platform, like Linux-x86_64. Note that many of the options that used to require editing files can now be set with options to the config script. Running the config script without arguments lists the available options as well. Edit arch/BASEARCH.fftw and arch/BASEARCH.tcl to point to the libraries you downloaded. Find a line something like "CHARMARCH = net-linux-x86_64-iccstatic" in arch/ARCH.arch to tell what Charm++ platform you need to build. The CHARMARCH name is of the format comm-OS-cpu-options-compiler. It is important that Charm++ and NAMD be built with the same C++ compiler. To change the CHARMARCH, just edit the .arch file or use the --charm-arch config option. Enter the charm directory and run the build script without options to see a list of available platforms. Only the comm-OS-cpu part will be listed. Any options or compiler tags are listed separately and must be separated by spaces on the build command line. Run the build command for your platform as: ./build charm++ comm-OS-cpu options compiler --with-production For this specific example: ./build charm++ net-linux-x86_64 tcp icc --with-production Note that for MPI builds you normally do not need to specify a compiler, even if your mpicxx calls icc internally, but you will need to use an icc-based NAMD architecture specification. The README distributed with Charm++ contains a complete explanation. You only actually need the bin, include, and lib subdirectories, so you can copy those elsewhere and delete the whole charm directory, but don't forget to edit Make.charm if you do this. The CUDA (NVIDIA's graphics processor programming platform) code in NAMD is completely self-contained and does not use any of the CUDA support features in Charm++. When building NAMD with CUDA support you should use the same Charm++ you would use for a non-CUDA build. Do NOT add the cuda option to the Charm++ build command line. The only changes to the build process needed are to add --with-cuda and possibly --cuda-prefix ... to the NAMD config command line. If you are building a non-smp, non-tcp version of net-linux with the Intel icc compiler you will need to disable optimization for some files to avoid crashes in the communication interrupt handler. The smp and tcp builds use polling instead of interrupts and therefore are not affected. Adding +netpoll to the namd2 command line also avoids the bug, but this option reduces performance in many cases. These commands recompile the necessary files without optmization: cd charm/net-linux-icc /bin/rm tmp/sockRoutines.o /bin/rm lib/libconv-cplus-* ( cd tmp; make charm++ OPTS="-O0" ) If you're building an MPI version you will probably want to build Charm++ with env MPICXX=mpicxx preceding ./build on the command line, since the default MPI C++ compiler is mpiCC. You may also need to compiler flags or commands in the Charm++ src/arch directory. The file charm/src/arch/mpi-linux/conv-mach.sh contains the definitions that select the mpiCC compiler for mpi-linux, while other compiler choices are defined by files in charm/src/arch/common/. If you want to run NAMD on InfiniBand one option is to build an ibverbs library network version by specifying the "ibverbs" option as in: ./build charm++ net-linux-x86_64 ibverbs icc --with-production You would then change "net-linux-x86_64-icc" to "net-linux-x86_64-ibverbs-icc" in your namd2/arch/Linux-x86_64-icc.arch file (or create a new .arch file). Alternatively you could use an Infiniband -aware MPI library. Run make in charm/CHARMARCH/tests/charm++/megatest/ and run the resulting binary "pgm" as you would run NAMD on your platform. You should try running on several processors if possible. For example: cd net-linux-x86_64-ibverbs-icc/tests/charm++/megatest/ make pgm ./charmrun +p16 ./pgm If any of the tests fail then you will probably have problems with NAMD as well. You can continue and try building NAMD if you want, but when reporting problems please mention prominently that megatest failed, include the megatest output, and copy the Charm++ developers at ppl@cs.uiuc.edu on your email. Now you can run the NAMD config script to set up a build directory: ./config ARCH For this specific example: ./config Linux-x86-icc --charm-arch net-linux-tcp-icc This will create a build directory Linux-x86-icc. If you wish to create this directory elsewhere use config DIR/ARCH, replacing DIR with the location the build directory should be created. A symbolic link to the remote directory will be created as well. You can create multiple build directories for the same ARCH by adding a suffix. These can be combined, of course, as in: ./config tcl fftw /tmp/Linux-x86-icc.test1 Now cd to your build directory and type make. The namd2 binary and a number of utilities will be created. If you have trouble building NAMD your compiler may be different from ours. The architecture-specific makefiles in the arch directory use several options to elicit similar behavior on all platforms. Your compiler may conform to an earlier C++ specification than NAMD uses. You compiler may also enforce a later C++ rule than NAMD follows. You may ignore repeated warnings about new and delete matching. The NAMD Wiki at http://www.ks.uiuc.edu/Research/namd/wiki/ has entries on building and running NAMD at various supercomputer centers (e.g., NamdAtTexas) and on various architectures (e.g., NamdOnMPICH). Please consider adding a page on your own porting effort for others to read. ---------------------------------------------------------------------- Memory Usage NAMD has traditionally used less than 100MB of memory even for systems of 100,000 atoms. With the reintroduction of pairlists in NAMD 2.5, however, memory usage for a 100,000 atom system with a 12A cutoff can approach 300MB, and will grow with the cube of the cutoff. This extra memory is distributed across processors during a parallel run, but a single workstation may run out of physical memory with a large system. To avoid this, NAMD now provides a pairlistMinProcs config file option that specifies the minimum number of processors that a run must use before pairlists will be enabled (on fewer processors small local pairlists are generated and recycled rather than being saved, the default is "pairlistMinProcs 1"). This is a per-simulation rather than a compile time option because memory usage is molecule-dependent. Additional information on reducing memory usage may be found at http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdMemoryReduction ---------------------------------------------------------------------- Improving Parallel Scaling While NAMD is designed to be a scalable program, particularly for simulations of 100,000 atoms or more, at some point adding additional processors to a simulation will provide little or no extra performance. If you are lucky enough to have access to a parallel machine you should measure NAMD's parallel speedup for a variety of processor counts when running your particular simulation. The easiest and most accurate way to do this is to look at the "Benchmark time:" lines that are printed after 20 and 25 cycles (usually less than 500 steps). You can monitor performance during the entire simulation by adding "outputTiming <steps>" to your configuration file, but be careful to look at the "wall time" rather than "CPU time" fields on the "TIMING:" output lines produced. For an external measure of performance, you should run simulations of both 25 and 50 cycles (see the stepspercycle parameter) and base your estimate on the additional time needed for the longer simulation in order to exclude startup costs and allow for initial load balancing. Multicore builds scale well within a single node. On machines with more than 32 cores it may be necessary to add a communication thread and run on one fewer core than the machine has. On a 48-core machine this would be run as "namd2 +p47 +commthread". Performance may also benefit from setting CPU affinity using the +setcpuaffinity +pemap <map> +commap <map> options described in CPU Affinity above. Experimentation is needed. We provide standard (UDP), TCP, and ibverbs (InfiniBand) precompiled binaries for Linux clusters. The TCP version may be faster on some networks but the UDP version now performs well on gigabit ethernet. The ibverbs version should be used on any cluster with InfiniBand, and for any other high-speed network you should compile an MPI version. SMP builds generally do not scale as well across nodes as single-threaded non-SMP builds because the communication thread is both a bottleneck and occupies a core that could otherwise be used for computation. As such they should only be used to reduce memory consumption or if for scaling reasons you are not using all of the cores on a node anyway, and you should run benchmarks to determine the optimal configuration. Extremely short cycle lengths (less than 10 steps) will limit parallel scaling, since the atom migration at the end of each cycle sends many more messages than a normal force evaluation. Increasing margin from 0 to 1 while doubling stepspercycle and pairlistspercycle may help, but it is important to benchmark. The pairlist distance will adjust automatically, and one pairlist per ten steps is a good ratio. NAMD should scale very well when the number of patches (multiply the dimensions of the patch grid) is larger or rougly the same as the number of processors. If this is not the case, it may be possible to improve scaling by adding ``twoAwayX yes'' to the config file, which roughly doubles the number of patches. (Similar options twoAwayY and twoAwayZ also exist, and may be used in combination, but this greatly increases the number of compute objects. twoAwayX has the unique advantage of also improving the scalability of PME.) Additional performance tuning suggestions and options are described at http://www.ks.uiuc.edu/Research/namd/wiki/?NamdPerformanceTuning ---------------------------------------------------------------------- Endian Issues Some architectures write binary data (integer or floating point) with the most significant byte first; others put the most significant byte last. This doesn't effect text files but it does matter when a binary data file that was written on a "big-endian" machine (POWER, PowerPC) is read on a "small-endian" machine (Intel) or vice versa. NAMD generates DCD trajectory files and binary coordinate and velocity files which are "endian-sensitive". While VMD can now read DCD files from any machine and NAMD reads most other-endian binary restart files, many analysis programs (like CHARMM or X-PLOR) require same-endian DCD files. We provide the programs flipdcd and flipbinpdb for switching the endianness of DCD and binary restart files, respectively. These programs use mmap to alter the file in-place and may therefore appear to consume an amount of memory equal to the size of the file. ----------------------------------------------------------------------