The containers can currently be found at $SINGULARITY_BASE/containers/dgx/Containers/lammps
The sample tests are at all located at /groups/ORC-VAST/app-tests/lammps.
$ tree -rf /groups/ORC-VAST/app-tests/lammps/dgx/containerized
├── /groups/ORC-VAST/app-tests/lammps/dgx/containerized/10Feb2021
└── /groups/ORC-VAST/app-tests/lammps/dgx/containerized/29Oct2020
Batch submission file
A typical batch submission file (named run.slurm ) would like this:
run.slurm
#!/bin/bash#SBATCH --partition=gpuq # the DGX only belongs in the 'gpu' partition#SBATCH --qos=gpu # need to select 'gpu' QoS#SBATCH --job-name=jlammps#SBATCH --output=%x.%j#SBATCH --nodes=1#SBATCH --ntasks-per-node=1 # up to 128, but make sute ntasks x cpus-per-task < 128#SBATCH --cpus-per-task=1 # up to 128; but make sute ntasks x cpus-per-task < 128#SBATCH --gres=gpu:A100.40gb:1 # up to 8; only request what you need#SBATCH --mem-per-cpu=35000M # memory per CORE; total memory is 1 PB (1,000,000 MB)# SBATCH --mail-user=user@inst.edu# SBATCH --mail-type=ALL#SBATCH --export=ALL#SBATCH --time=0-04:00:00 # set to 1hr; please choose carefullysetecho#-----------------------------------------------------# Example run from NVIDIA NGC# https://ngc.nvidia.com/catalog/containers/hpc:lammps# Please feel free to download and run it as follows#-----------------------------------------------------#-----------------------------------------------------# Determine GPU and CPU resources to use#-----------------------------------------------------# parse out number of GPUs and CPU cores reserved for your jobenv|grep-islurmGPU_COUNT=`echo $SLURM_JOB_GPUS |tr "," " " |wc-w`N_CORES=${SLURM_NTASKS}# Set OMP_NUM_THREADS# please note that ntasks x cpus-per-task <= 128if [ -n"$SLURM_CPUS_PER_TASK" ]; then OMP_THREADS=$SLURM_CPUS_PER_TASKelse OMP_THREADS=1fiexport OMP_NUM_THREADS=$OMP_THREADS#-----------------------------------------------------# Set up MPI launching#-----------------------------------------------------# If parallel, launch with MPIif [[ "${GPU_COUNT}">1 ]] || [[ ${SLURM_NTASKS} >1 ]]; then MPI_LAUNCH="prun"else MPI_LAUNCH=""fi#-----------------------------------------------------# Set up container#-----------------------------------------------------SINGULARITY_BASE=/containers/dgx/ContainersCONTAINER=${SINGULARITY_BASE}/lammps/lammps_10Feb2021.sif# Singularity will mount the host PWD to /host_pwd in the containerSINGULARITY_RUN="singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd"#-----------------------------------------------------# Run container#-----------------------------------------------------# Define input file and runLMP_INPUT=in.lj.txtLMP_OUTPUT=log-${GPU_COUNT}gpus-${SLURM_NTASKS}cores-${OMP_NUM_THREADS}thr_percore.lammpsecho"Running Lennard Jones 16x8x16 example on ${GPU_COUNT} GPUS..."${MPI_LAUNCH} ${SINGULARITY_RUN} ${CONTAINER} lmp \-kong ${GPU_COUNT} \-sfkk \-pkkokkoscuda/awareonneighfullcommdevicebinsize2.8 \-varx16 \-vary8 \-varz16 \-in ${LMP_INPUT} \-log ${LMP_OUTPUT}
Input files
You would need this input file to run this test. You can copy
in.lj.tzt
# 3d Lennard-Jones meltvariablexindex1variableyindex1variablezindex1variablexxequal20*$xvariableyyequal20*$yvariablezzequal20*$zunitsljatom_styleatomiclatticefcc0.8442regionboxblock0 ${xx} 0 ${yy} 0 ${zz}create_box1boxcreate_atoms1boxmass11.0velocityallcreate1.4487287loopgeompair_stylelj/cut2.5pair_coeff111.01.02.5neighbor0.3binneigh_modifydelay0every20checknofix1allnverun100
Benchmarks
For this particular example, the benchmarks indicate that the code scales well up 4 GPUs. Also, the performance of the latest container (20Feb2021) is marginally better than the previous one (29Oct2020).
It is always informative to see how much GPU acceleration speeds up calculations. For that purpose, we compared the above benchmarks with those run on nodes with CPUs only.
GPU-optimized LAMMPS container runs very well on our DGX A100
The GPU code scales well with the number of GPUs used, but it will depend heavily on the size of the simulation
The two GPU-accelerated containers we tested perform about the same
1 NVIDIA A100 GPU performs as well as 9-10 nodes (dual Intel Cascade Lake CPUs with 48 cores per server) combined
Native applications built using the GNU9+OpenMPI4 and Intel20+IMPI20 perform equally well. However, the way the jobs are launched is slightly different, so users are encouraged to see the examples at /groups/ORC-VAST/app-tests/lammps.