High performance containers

Overview

Teaching: 10 min
Exercises: 15 min

Questions

How can I run an MPI enabled application in a container using a bind approach?

How can I run a GPU enabled application in a container?

Objectives

Discuss the bind approach for containerised MPI applications, including its performance

Get started with containerised GPU applications

Run-real world MPI/GPU examples using OpenFoam and Gromacs

NOTE: the following hands-on session focuses on Singularity only.

Configure the MPI/interconnect bind approach

Before we start, let us cd to the openfoam example directory:

cd ~/sc20-tutorial/exercises/openfoam

Now, suppose you have an MPI installation in your host and a containerised MPI application, built upon MPI libraries that are ABI compatible with the former.

For this tutorial, we do have MPICH installed on the host machine:

which mpirun

/usr/local/packages/e4s/spack/opt/spack/linux-centos7-x86_64/gcc-7.3.0/mpich-3.2.1-cvou3wu4yikwet64n2ddhkywl6pnsb4l/bin/mpirun

and we’re going to pull an OpenFoam container, which was built on top of MPICH as well:

singularity pull library://marcodelapierre/beta/openfoam:v1812

OpenFoam comes with a collection of executables, one of which is simpleFoam. We can use the Linux command ldd to investigate the libraries that this executable links to. As simpleFoam links to a few tens of libraries, let’s specifically look for MPI (libmpi*) libraries in the command output:

singularity exec openfoam_v1812.sif bash -c 'ldd $(which simpleFoam) |grep libmpi'

libmpi.so.12 => /usr/lib/libmpi.so.12 (0x00007f73a729b000)

This is the container MPI installation that was used to build OpenFoam.

How do we setup a bind approach to make use of the host MPI installation?
We can make use of Singularity-specific environment variables, to make these host libraries available in the container (see location of MPICH from which mpirun above):

export SINGULARITY_BINDPATH=/usr/local/packages/e4s/spack
export SINGULARITYENV_LD_LIBRARY_PATH=/usr/local/packages/e4s/spack/opt/spack/linux-centos7-x86_64/gcc-7.3.0/mpich-3.2.1-cvou3wu4yikwet64n2ddhkywl6pnsb4l/lib

Now, if we inspect mpirun dynamic linking again:

singularity exec openfoam_v1812.sif bash -c 'ldd $(which simpleFoam) |grep libmpi'

libmpi.so.12 => /usr/local/packages/e4s/spack/opt/spack/linux-centos7-x86_64/gcc-7.3.0/mpich-3.2.1-cvou3wu4yikwet64n2ddhkywl6pnsb4l/lib/libmpi.so.12 (0x00007fc608b0d000)

Now OpenFoam is picking up the host MPI libraries!

Note that, on a HPC cluster, with the same mechanism it is possible to expose the host interconnect libraries in the container, to achieve maximum communication performance.

Let’s run OpenFoam in a container!

To get the real feeling of running an MPI application in a container, let’s run a practical example.
We’re using OpenFoam, a widely popular package for Computational Fluid Dynamics simulations, which is able to massively scale in parallel architectures up to thousands of processes, by leveraging an MPI library.
The sample inputs come straight from the OpenFoam installation tree, namely $FOAM_TUTORIALS/incompressible/pimpleFoam/LES/periodicHill/steadyState/.

Let’s execute the script in the current directory:

./mpirun.sh

This will take a few minutes to run. In the end, you will get the following output files/directories:

ls -ltr

total 1121572
-rwxr-xr-x. 1 tutorial livetau 1148433339 Nov  4 21:40 openfoam_v1812.sif
drwxr-xr-x. 2 tutorial livetau         59 Nov  4 21:57 0
-rw-r--r--. 1 tutorial livetau        798 Nov  4 21:57 slurm_pawsey.sh
-rwxr-xr-x. 1 tutorial livetau        843 Nov  4 21:57 mpirun.sh
-rwxr-xr-x. 1 tutorial livetau        197 Nov  4 21:57 clean-outputs.sh
-rwxr-xr-x. 1 tutorial livetau       1167 Nov  4 21:57 update-settings.sh
drwxr-xr-x. 2 tutorial livetau        141 Nov  4 21:57 system
drwxr-xr-x. 4 tutorial livetau         72 Nov  4 22:02 dynamicCode
drwxr-xr-x. 3 tutorial livetau         77 Nov  4 22:02 constant
-rw-r--r--. 1 tutorial livetau       3497 Nov  4 22:02 log.blockMesh
-rw-r--r--. 1 tutorial livetau       1941 Nov  4 22:03 log.topoSet
-rw-r--r--. 1 tutorial livetau       2304 Nov  4 22:03 log.decomposePar
drwxr-xr-x. 8 tutorial livetau         70 Nov  4 22:05 processor1
drwxr-xr-x. 8 tutorial livetau         70 Nov  4 22:05 processor0
-rw-r--r--. 1 tutorial livetau      18583 Nov  4 22:05 log.simpleFoam
drwxr-xr-x. 3 tutorial livetau         76 Nov  4 22:06 20
-rw-r--r--. 1 tutorial livetau       1533 Nov  4 22:06 log.reconstructPar

We ran using 2 MPI processes, who created outputs in the directories processor0 and processor1, respectively.
The final reconstruction creates results in the directory 20 (which stands for the 20th and last simulation step in this very short demo run), as well as the output file log.reconstructPar.

As execution proceeds, let’s ask ourselves: what does running singularity with MPI look run in the script? Here’s the script we’re executing:

#!/bin/bash

NTASKS="2"
image="library://marcodelapierre/beta/openfoam:v1812"

# this configuration depends on the host
export MPICH_ROOT="/usr/local/packages/e4s/spack"
export MPICH_LIBS="$( which mpirun )"
export MPICH_LIBS="${MPICH_LIBS%/bin/mpirun*}/lib"

export SINGULARITY_BINDPATH="$MPICH_ROOT"
export SINGULARITYENV_LD_LIBRARY_PATH="$MPICH_LIBS"



# pre-processing
singularity exec $image \
  blockMesh | tee log.blockMesh

singularity exec $image \
  topoSet | tee log.topoSet

singularity exec $image \
  decomposePar -fileHandler uncollated | tee log.decomposePar


# run OpenFoam with MPI
mpirun -n $NTASKS \
  singularity exec $image \
  simpleFoam -fileHandler uncollated -parallel | tee log.simpleFoam


# post-processing
singularity exec $image \
  reconstructPar -latestTime -fileHandler uncollated | tee log.reconstructPar

In the beginning, Singularity variable SINGULARITY_BINDPATH and SINGULARITYENV_LD_LIBRARY_PATH are defined to setup the bind approach for MPI.
Then, a bunch of OpenFoam commands are executed, with only one being parallel:

mpirun -n $NTASKS \
  singularity exec $image \
  simpleFoam -fileHandler uncollated -parallel | tee log.simpleFoam

That’s as simple as prepending mpirun to the singularity command line, as for any other MPI application.

Singularity interface to Slurm

Now, have a look at the script variant for the Slurm scheduler, slurm_pawsey.sh:

srun -n $SLURM_NTASKS \
  singularity exec $image \
  simpleFoam -fileHandler uncollated -parallel | tee log.simpleFoam

The key difference is that every OpenFoam command is executed via srun, i.e. the Slurm wrapper for the MPI launcher, mpirun. Other schedulers will require a different command.
In practice, all we had to do was to replace mpirun with srun, as for any other MPI application.

DEMO: Container vs bare-metal MPI performance

NOTE: this part was executed on a Pawsey HPC cluster. You can follow the outputs here.

Pawsey Centre provides a set of MPI base images, which also ship with the OSU Benchmark Suite. Let’s use it to get a feel of what it’s like to use or not to use the high-speed interconnect.
We’re going to run a small bandwidth benchmark using the image pawsey/mpich-base:3.1.4_ubuntu18.04. All of the required commands are in the script benchmark_pawsey.sh:

#!/bin/bash -l

#SBATCH --job-name=mpi
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:20:00

image="docker://pawsey/mpich-base:3.1.4_ubuntu18.04"
osu_dir="/usr/local/libexec/osu-micro-benchmarks/mpi"

# this configuration depends on the host
module unload xalt
module load singularity


# see that SINGULARITYENV_LD_LIBRARY_PATH is defined (host MPI/interconnect libraries)
echo $SINGULARITYENV_LD_LIBRARY_PATH

# 1st test, with host MPI/interconnect libraries
srun singularity exec $image \
  $osu_dir/pt2pt/osu_bw -m 1024:1048576


# unset SINGULARITYENV_LD_LIBRARY_PATH
unset SINGULARITYENV_LD_LIBRARY_PATH

# 2nd test, without host MPI/interconnect libraries
srun singularity exec $image \
  $osu_dir/pt2pt/osu_bw -m 1024:1048576

Basically we’re running the test twice, the first time using the full bind approach configuration as provided by the singularity module on the cluster, and the second time after unsetting the variable that makes the host MPI/interconnect libraries available in containers.

Here is the first output (using the interconnect):

# OSU MPI Bandwidth Test v5.4.1
# Size      Bandwidth (MB/s)
1024                  893.71
2048                 1393.48
4096                 2044.01
8192                 2739.81
16384                2872.01
32768                2993.75
65536                3032.19
131072               3037.76
262144               4324.86
524288               6472.44
1048576              8372.80

And here is the second one:

# OSU MPI Bandwidth Test v5.4.1
# Size      Bandwidth (MB/s)
1024                   87.56
2048                  101.33
4096                  107.77
8192                  112.01
16384                 113.98
32768                 116.21
65536                 116.82
131072                116.95
262144                117.25
524288                117.39
1048576               117.46

Well, you can see that for a 1 MB message, the bandwidth is 8 GB/s versus 100 MB/s, quite a significant difference in performance!

DEMO: Run a molecular dynamics simulation on a GPU with containers

NOTE: this part was executed on a Pawsey cluster with Nvidia GPUs. You can follow the outputs here.

For our example we are going to use Gromacs, a quite popular molecular dynamics package, among the ones that have been optimised to run on GPUs through Nvidia containers.
To start, let us cd to the gromacs example directory:

cd ~/sc20-tutorial/exercises/gromacs

This directory has got sample input files picked from the collection of Gromacs benchmark examples. In particular, we’re going to use the subset water-cut1.0_GMX50_bare/1536/.

Now, from a Singularity perspective, all we need to do to run a GPU application on Nvidia GPUs from a container is to add the runtime flag --nv. This will make Singularity look for the Nvidia drivers in the host, and mount them inside the container.
Then, on the host system side, when running GPU applications through Singularity the only requirement consists of the Nvidia driver for the relevant GPU card (the corresponding file is typically called libcuda.so.<VERSION> and is located in some library subdirectory of /usr).
Finally, GPU resources are usually made available in HPC systems through schedulers, to which Singularity natively and transparently interfaces. So, for instance let us have a look in the current directory at the Slurm batch script called gpu_pawsey.sh:

#!/bin/bash -l

#SBATCH --job-name=gpu
#SBATCH --partition=gpuq
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00

image="docker://nvcr.io/hpc/gromacs:2018.2"
module load singularity


# uncompress configuration input file
if [ -e conf.gro.gz ] ; then
 gunzip conf.gro.gz
fi


# run Gromacs preliminary step with container
srun singularity exec --nv $image \
    gmx grompp -f pme.mdp

# Run Gromacs MD with container
srun singularity exec --nv $image \
    gmx mdrun -ntmpi 1 -nb gpu -pin on -v -noconfout -nsteps 5000 -s topol.tpr -ntomp 1

Here, there are two key execution lines, who run a preliminary Gromacs job and the proper production job, respectively.
See how we have simply combined the Slurm command srun with singularity exec --nv <..> (similar to what we did in the episode on MPI):

srun singularity exec --nv $image gmx <..>

We can submit the script with:

$ sbatch gpu_pawsey.sh

A few files are produced, including the main output of the molecular dynamics run, md.log:

ls -ltr

total 139600
-rw-rw----+ 1 mdelapierre pawsey0001       664 Nov  5 14:07 topol.top
-rw-rw----+ 1 mdelapierre pawsey0001       950 Nov  5 14:07 rf.mdp
-rw-rw----+ 1 mdelapierre pawsey0001       939 Nov  5 14:07 pme.mdp
-rw-rw----+ 1 mdelapierre pawsey0001       556 Nov  5 14:07 gpu_pawsey.sh
-rw-rw----+ 1 mdelapierre pawsey0001 105984045 Nov  5 14:07 conf.gro
-rw-rw----+ 1 mdelapierre pawsey0001     11713 Nov  5 14:12 mdout.mdp
-rw-rw----+ 1 mdelapierre pawsey0001  36880760 Nov  5 14:12 topol.tpr
-rw-rw----+ 1 mdelapierre pawsey0001      9247 Nov  5 14:17 slurm-101713.out
-rw-rw----+ 1 mdelapierre pawsey0001     22768 Nov  5 14:17 md.log
-rw-rw----+ 1 mdelapierre pawsey0001      1152 Nov  5 14:17 ener.edr

Key Points

Appropriate Singularity environment variables can be used to configure the bind approach for MPI containers (sys admins can help); Shifter achieves this via a configuration file

Singularity and Shifter interface almost transparently with HPC schedulers such as Slurm

MPI performance of containerised applications almost coincide with those of a native run

You can run containerised GPU applications with Singularity using the flags --nv or --rocm for Nvidia or AMD GPUs, respectively

previous episode

Using Containers to Accelerate HPC

next episode