Parallel computing¶
Parallel computing is what HPC is really all about: processing things on more than one processor at once. By now, you should have read all of the previous tutorials.
Parallel programming models¶
Parallel programming is used to create programs that can execute instructions on multiple processors at a same time. Most of our users that run their programs in parallel utilize existing parallel execution features that are present in their programs and thus do not need to learn how to create parallel programs. But even when one is running programs in parallel, it is important to understand different models of parallel execution.
The two main models are:
Shared memory (or multithreaded/multiprocess) programs run multiple independent workers on the same machine. As the name suggests, all of the computer’s memory has to be accessible to all of the processes. Thus programs that utilize this model should request one node, one task and multiple CPUs. Likewise, the maximum number of workers is usually the number of CPU cores available on the computational node. The code is easier to implement and the same code can still be run in a serial mode. Example applications that utilize this model: Matlab, R, Python multithreading/multiprocessing, OpenMP applications, BLAS libraries, FFTW libraries, typical multithreaded/multiprocess parallel desktop programs.
Message passing programming (e.g. MPI, message passing interface) can run on multiple nodes interconnected with the network via passing data through MPI software libraries. Almost all large-scale scientific programs utilize MPI. MPI can scale to thousands of CPU cores, but depending on the case it can be harder to implement from the programmer’s point of view. Programs that utilize this model should request single/multiple nodes with multiple tasks each. You should not request multiple CPUs per task. Example applications that utilize this model: CP2K, GPAW, LAMMPS, OpenFoam.
Both models, MPI and shared memory, can be combined in one application, in this case we are talking about hybrid parallel programming model. Programs that utilize this model can require both multiple tasks and multiple CPUs per task.
Most historical scientific code is MPI, but these days more and more people are using shared memory models.
Important
Normal serial code can’t just be run in parallel without modifications. As a user it is your responsibility to understand what parallel model implementation your code has, if any.
When deciding whether using parallel programming is worth the effort, one should be mindful of Amdahl’s law and Gustafson’s law. All programs have some parts that can only be executed in serial and thus the theoretical speedup that one can get from using parallel programming depends on two factors:
How much of programs’ execution could be done in parallel?
What would be the speedup for that parallel part?
Thus if your program runs mainly in serial but has a small parallel part, running it in parallel might not be worth it. Sometimes, doing data parallelism with e.g. array jobs is much more fruitful approach.
Another important note regarding parallelism is that all the applications scale good up to some upper limit which depends on application implementation, size and type of problem you solve and some other factors. The best practice is to benchmark your code on different number of CPU cores before you start actual production runs.
If you want to run some program in parallel, you have to know something about it - is it shared memory or MPI? A program doesn’t magically get faster when you ask more processors if it’s not designed to.
Message passing programs: MPI¶
For compiling/running an MPI job one has to pick up one of the MPI library suites. There are various different MPI libraries that all implement the MPI standard. We recommend that you use either:
OpenMPI (e.g.
openmpi/3.1.4
)Intel’s MPI (e.g.
intel-parallel-studio/cluster.2020.0-intelmpi
)
Some libraries/programs might have already existing requirement for a certain MPI version. If so, use that version or ask for administrators to create a version of the library with dependency on the MPI version you require.
Warning
Different versions of MPI are not compatible with each other. Each version of MPI will create code that will run correctly with only that version of MPI. Thus if you create code with a certain version, you will need to load the same version of the library when you are running the code.
Also, the MPI libraries are usually linked to slurm and network drivers. Thus, when slurm or driver versions are updated, some older versions of MPI might break. If you’re still using said versions, let us know. If you’re just starting a new project, it is recommended to use our recommended MPI libraries.
For basic use of MPI programs, you will need to use the
-n N
/--ntasks=N
-option to specify the number of MPI workers.
Running a typical MPI program¶
The following use hpc-examples
from
the previous exercises.
Loading module:
# GCC + OpenMPI
module load gcc/9.2.0 # GCC
module load openmpi/3.1.4 # OpenMPI
# Intel compilers + Intel's MPI
module load intel-parallel-studio/cluster.2019.3-intelmpi
Compiling the code (depending on module and language):
# OpenMPI
mpicc -O2 -g hello_mpi.c -o hello_mpi # C code
mpifort -O2 -g hello_mpi_fortran.f90 -o hello_mpi_fortran # Fortran code
# Intel MPI
mpiicc -O2 -g hello_mpi.c -o hello_mpi # C code
mpiifort -O2 -g hello_mpi_fortran.f90 -o hello_mpi_fortran # Fortran code
Running the program with srun (for testing):
srun --time=00:05:00 --mem-per-cpu=200M --ntasks=4 ./hello_mpi
Running an MPI code in the batch mode:
#!/bin/bash
#SBATCH --time=00:05:00 # takes 5 minutes all together
#SBATCH --mem-per-cpu=200M # 200MB per process
#SBATCH --ntasks=4 # 4 processes
#SBATCH --constraint=avx # set constraint for processor architecture
module load openmpi/3.1.4 # NOTE: should be the same as you used to compile the code
srun ./hello_mpi
Triton has multiple architectures around (12, 20, 24, 40 CPU cores per node), even though SLURM optimizes resources usage and allocate CPUs within one node, which gives better performance for the app, it still makes sense to put constraints explicitly.
Important
It is important to use srun
when you launch your program.
This allows for the MPI libraries to obtain task placement information
(nodes, number of tasks per node etc.) from the slurm queue.
Spreading MPI workers evenly¶
In many cases you might require more than one node during your job’s runtime.
When this is the case, it is usually recommended to split the number of
workers somewhat evenly among the nodes. To do this, one can use
-N N
/--nodes=N
and --ntasks-per-node=n
. For example, the previous example
could be written as:
#!/bin/bash
#SBATCH --time=00:05:00 # takes 5 minutes all together
#SBATCH --mem-per-cpu=200M # 200MB per process
#SBATCH --nodes=2 # 2 nodes
#SBATCH --ntasks-per-node=2 # 2 processes per node * 2 nodes = 4 processes in total
#SBATCH --constraint=avx # set constraint for processor architecture
module load openmpi/3.1.4 # NOTE: should be the same as you used to compile the code
srun ./hello_mpi
This way the number of workers is distributed more evenly, which in turn reduces communication overhead between workers.
Monitoring performance¶
You can use the seff
program (with a jobid) to list what percent
of available processors and memory you used. If your processor usage
is far below 100%, your code may not be working correctly in a parallel
environment. If your memory usage is far below 100%, you might have a
problem with your requirements.
Important
When making job reservations it is important to distinguish
between requirements for the whole job (such as --mem
) and
requirements for each individual task/cpu (such as --mem-per-cpu
).
E.g. requesting --mem-per-cpu=2G
with --ntasks=2
and --cpus-per-task=4
will create a total memory reservation of
(2 tasks)*(4 cpus / task)*(2GB / cpu)=16GB.
Exercises¶
Run
srun --cpus-per-task=4 hostname
,srun --ntasks=4 hostname
, andsrun --nodes=4 hostname
. What’s the difference and why?
The following use hpc-examples
from the previous exercises:
Find the files
hpc-examples/openmp/hello_omp/hello_omp.c
andhpc-examples/hello_omp/hello_omp.slrm
that have a short example of OpenMP. Compile and run it - a slurm script is included.Find the files in
hpc-examples/python/python_openmp
. Try running the example with a few different--constraint=X
and--cpus-per-task=C
. In your opinion, what architecture / cpu number combination would provide the best efficiency? Useseff
to verify.Find the files
hpc-examples/mpi/hello_mpi/hello_mpi.c
andhpc-examples/mpi/hello_mpi/hello_mpi.slrm
that have a short example of MPI. Compile and run it - a slurm script is included.
Next steps¶
See the next pages:
You can check the Running programs on Triton page for the reference information on running jobs. This contains the general reference information.