Shared memory parallelism: multithreading & multiprocessing

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2024, 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Abstract

  • Verify that your program can use multiple CPUs.

  • Use --cpus-per-task=C to reserve C CPUs for your job.

  • In your program set the number of used CPUs to match the number of requested CPUs. You can use $SLURM_CPUS_PER_TASK environment variable to get this number dynamically.

  • You must always monitor jobs to make sure they are using all the resources you request (seff JOBID).

  • If you aren’t fully sure of how to scale up, contact us Research Software Engineers early.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

Shared-memory parallelism and multiprocessing lets you scale to the size of one node. For purposes of illustration, the picture isn’t true to life: we call the whole stack one node, but in reality each node is one of the rows.

What is shared memory parallelism?

In shared memory parallelism a program will launch multiple processes or threads so that it can leverage multiple CPUs available in the machine.

Slurm reservations for both methods behave similarly. This document will talk about processes, but everything mentioned would be applicable to threads as well. See section on difference between processes and threads for more information on who proceseses and threads differ.

Communication between processes happens via shared memory. This means that all processes need to run on the same machine.

../../../_images/parallel-shared.svg

Depending on a program, you might have multiple processes (Matlab parallel pool, R parallel-library, Python multiprocessing) or have multiple threads (OpenMP threads of BLAS libraries that R/numpy use).

Running a typical multiprocess program

Reserving resources for shared memory programs

Reserving resources for shared memory programs is easy: you’ll only need to specify how many CPUs you want via --cpus-per-task=C-flag.

For most programs using --mem=M is the correct way to reserve memory, but in some cases the amount of memory needed scales with the number of processors. This might happen, for example, if each process opens a different dataset or runs its own simulation that needs extra memory allocations. In these cases you can use --mem-per-core=M to specify a memory allocation that scales with the number of CPUs. We recommend starting with --mem=M if you do not know how your program scales.

Running an example shared memory parallel program

For this example, let’s again use the slurm/pi.py -script that estimates pi with Monte Carlo methods.

To utilize parallelism written into the script you can run it with the arguments python3 slurm/pi.py --nprocs=C N, where N is the number of iterations to be done by the algorithm and C is the number of processors to be used for the parallel calculation.

Let’s run the program with two processes using srun:

$ srun --cpus-per-task=2 --time=00:10:00 --mem=1G python3 slurm/pi.py --nprocs=2 1000000

It is vitally important to notice that the program needs to be told the amount of processes it should use. The program does not obtain this information from the queue system automatically. If the program does not know how many CPUs to use, it might try to use too many or too few. For more information, see the section on CPU over- and undersubscription.

Using a slurm script giving the number of CPUs to the program becomes easier:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=1G
#SBATCH --output=pi.out
#SBATCH --cpus-per-task=2

srun python3 slurm/pi.py --nprocs=$SLURM_CPUS_PER_TASK 1000000

Let’s call this script pi-sharedmemory.sh. You can submit it with:

$ sbatch pi-sharedmemory.sh

The environment variable $SLURM_CPUS_PER_TASK is set during program runtime and it is set based on the number of --cpus-per-task requested. For more tricks on how to set the number of processors, see the section on using it effectively.

Special cases and common pitfalls

Monitoring CPU utilization for parallel programs

You can use seff JOBID to see what percent of available CPUs and RAM was utilized. Example output is given below:

$ seff 60985042
Job ID: 60985042
Cluster: triton
User/Group: tuomiss1/tuomiss1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:29
CPU Efficiency: 90.62% of 00:00:32 core-walltime
Job Wall-clock time: 00:00:16
Memory Utilized: 1.59 MB
Memory Efficiency: 0.08% of 2.00 GB

If your processor usage is far below 100%, your code may not be working correctly. If your memory usage is far below 100% or above 100%, you might have a problem with your RAM requirements. You should set the RAM limit to be a bit above the RAM that you have utilized.

You can also monitor individual job steps by calling seff with the syntax seff JOBID.JOBSTEP.

Important

When making job reservations it is important to distinguish between requirements for the whole job (such as --mem) and requirements for each individual task/cpu (such as --mem-per-cpu). E.g. requesting --mem-per-cpu=2G with --ntasks=2 and --cpus-per-task=4 will create a total memory reservation of (2 tasks)*(4 cpus / task)*(2GB / cpu)=16GB.

Multithreaded vs. multiprocess and double-booking

Processes are individual program executions while threads are basically small work executions within a process. Processes have their own memory allocations and can work independently from the main process. Threads, on the other hand, are smaller parallel executions within the main process. Processes are slower to launch, but due to their independent nature they won’t block each other’s execution as easily as threads can.

Some programs can utilize both multithread and multiprocess parallelism. For example, R has parallel-library for running multiple processes while BLAS libraries that R uses can utilize multiple threads.

When running a program that can parallelise through processes and through threads, you should be careful to check that only one method of parallisation is in effect.

Using both can result in double-booked parallelism where you launch \(N\) processes and each process launches \(N\) threads, which results in \(N^2\) threads. This will usually tank the performance of the code as the CPUs are overbooked.

Often threading is done in a lower level library when they have been implemented using OpenMP. If you encounter bad performace or you see a huge number of threads appearing when you use parallel processes try setting export OMP_NUM_THREADS=1 in your Slurm script.

Over- and undersubscription of CPUs

The number of threads/processes you launch should match the number of requested processors. If you create a lower number, you will not utilize all CPUs. If you launch a larger number, you will oversubscribe the CPUs and the code will run slower as different threads/processes will have to swap in/out of the CPUs and compete for the same resources.

Using threads and processes at the same time can also result in double-booking.

Using $SLURM_CPUS_PER_TASK is the best way of letting your program know how many CPUs it should use. See section on using it effectively for more information.

Using SLURM_CPUS_PER_TASK effectively

The environment variable $SLURM_CPUS_PER_TASK can be utilized in multiple ways in your scripts. Below are few examples:

  • Setting a number of workers when $SLURM_CPUS_PER_TASK is not set:

    $SLURM_CPUS_PER_TASK is only set when --cpus-per-task has been specified. If you want to run the same code in your own machine and in the cluster it might be useful to set a variable like export NCORES=${SLURM_CPUS_PER_TASK:-4} and use that in your scripts.

    Here $NCORES is set to the number specified by $SLURM_CPUS_PER_TASK if it has been set. Otherwise, it will be set to 4 via Bash’s syntax for setting default values for unset variables.

  • In Python you can use the following for obtaining the environment variable:

    import os
    
    ncpus=int(os.environ.get("SLURM_CPUS_PER_TASK", 1))
    

    For more information on parallelisation in Python see our Python documentation.

  • In R you can use the following for obtaining the environment variable:

    ncpus <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK", unset=1))
    

    For more information on parallelisation in R see our R documentation.

  • In Matlab you can use the following for obtaining the environment variable:

    ncpus=str2num(getenv("SLURM_CPUS_PER_TASK"))
    

    For more information on parallelisation in Matlab see our Matlab documentation.

Asking for multiple tasks when code does not use MPI

Normally you should not use --ntasks=n when you want to run shared memory codes. The number of tasks is only relevant to MPI codes and by specifying it you might launch multiple copies of your program that all compete on the reserved CPUs.

Only hybrid parallelization codes should have both --ntasks=n and --cpus-per-task=C set to be greater than one.

Exercises

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a command line (section Copy your code to the cluster). You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

Shared memory parallelism 1: Test the example’s scaling

Run the example with a bigger number of trials (100000000 or \(10^{8}\)) and with 1, 2 and 4 CPUs. Check the running time and CPU utilization for each run.

Shared memory parallelism 2: Test scaling for a program that has a serial part

pi.py can be called with an argument --serial=0.1 to run a fraction of the trials in a serial fashion (here, 10%).

Run the example with a bigger number of trials (100000000 or \(10^{8}\)), 4 CPUs and a varying serial fraction (0.1, 0.5, 0.8). Check the running time and CPU utilization for each run.

Shared memory parallelism 3: More parallel \(\neq\) fastest solution

pi.py can be called with an argument --optimized to run an optimized version of the code that utilizes NumPy for vectorized calculations.

Run the example with a bigger number of trials (100000000 or \(10^{8}\)) and with 4 CPUs. Now run the optimized example with the same amount of trials and with 1 CPU. Check the CPU utilization and running time for each run.

Shared memory parallelism 4: Your program

Think of your program. Do you think it can use shared-memory parallelism?

If you do not know, you can check the program’s documentation for words such as:

  • nprocs

  • nworkers

  • num_workers

  • njobs

  • OpenMP

These usually point towards some method of shared-memory parallel execution.

What’s next?

The next tutorial is about MPI parallelism.