GPU computing

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Abstract

Request a GPU with the Slurm option --gres=gpu:1 or --gpus=1 (some clusters need -p gpu or similar).
Select a certain type of GPU with e.g. --constraint='volta' (see the quick reference for names).
Monitor GPU performance with sacct -j JOBID -o comment -p.
For development, run jobs of 4 hours or less, and they can run quickly in the gpushort queue.
If you aren’t fully sure of how to scale up, contact us Research Software Engineers early.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson. — GPU nodes allow specialized types of work to be done massively in parallel.

What are GPUs and how do they parallelise calculations?

GPUs, short for graphical processing unit, are massively-parallel processors that are optimized to perform numerical calculations in parallel. Due to this specialisation GPUs can be substantially faster than CPUs when solving suitable problems.

GPUs are especially handy when dealing with matrices and vectors. This has allowed GPUs to become an indispensable tool in many research fields such as deep learning, where most of the calculations involve matrices.

The programs we normally write in common programming languages, e.g. C++ are executed by the CPU. To run a part of that program in a GPU the program must do the following:

Specify a piece of code called a kernel, which contains the GPU part of the program and is compiled for the specific GPU architecture in use.
Transfer the data needed by the program from the RAM to GPU VRAM.
Execute the kernel on the GPU.
Transfer the results from GPU VRAM to RAM.

To help with this procedure special APIs (application programming interfaces) have been created. An example of such an API is CUDA toolkit, which is the native programming interface for NVIDIA GPUs.

On Triton, we have a large number of NVIDIA GPU cards from different generations and a single machine with AMD GPU cards. Triton GPUs are not the typical desktop GPUs, but specialized research-grade server GPUs with large memory, high bandwidth and specialized instructions. For scientific purposes, they generally outperform the best desktop GPUs.

Running a typical GPU program

Reserving resources for GPU programs

Slurm keeps track of the GPU resources as generic resources (GRES) or trackable resources (TRES). They are basically limited resources that you can request in addition to normal resources such as CPUs and RAM.

To request GPUs on Slurm, you should use the --gres=gpu:1 or --gpus=1 -flags.

You can also use syntax --gres=gpu:GPU_TYPE:1, where GPU_TYPE is a name chosen by the admins for the GPU. For example, --gres=gpu:v100:1 would give you a V100 card. See section on reserving specific GPU architectures for more information.

You can request more than one GPU with --gres=gpu:G, where G is the number of the requested GPUs.

Some GPUs are placed in a quick debugging queue. See section on reserving quick debugging resources for more information.

Note

Most GPU programs cannot utilize more than one GPU at a time. Before trying to reserve multiple GPUs you should verify that your code can utilize them.

Running an example program that utilizes GPU

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

For this example, let’s consider pi-gpu.cu in the slurm-folder. It estimates pi with Monte Carlo methods and can utilize a GPU for calculating the trials.

The script is in the slurm-folder. This example is written in C++ and CUDA. Thus it needs to be compiled before it can be run.

To compile CUDA-based code for GPUs, lets load a cuda-module and a newer compiler:

module load gcc/8.4.0 cuda

Now we should have a compiler and a CUDA toolkit loaded. After this we can compile the code with:

nvcc -arch=sm_60 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -o pi-gpu pi-gpu.cu

This monstrosity of a command is written like this because we want our code to be able run on multiple different GPU architectures. For more information, see section on setting compilation flags for GPU architectures.

Now we can run the program using srun:

srun --time=00:10:00 --mem=500M --gres=gpu:1 ./pi-gpu 1000000

This worked because we had the correct modules already loaded. Using a slurm script setting the requirements and loading the correct modules becomes easier:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=pi-gpu.out
#SBATCH --gres=gpu:1

module load gcc/8.4.0 cuda
./pi-gpu 1000000

Note

If you encounter problems with CUDA libraries, see the section on missing CUDA libraries.

Special cases and common pitfalls

Monitoring efficient use of GPUs

When running a GPU job, you should check that the GPU is being fully utilized.

When your job has started, you can ssh to the node and run nvidia-smi. It should be close to 100%.

Once the job has finished, you can use slurm history to obtain the jobID and run:

$ sacct -j JOBID -o comment -p
{"gpu_util": 99.0, "gpu_mem_max": 1279.0, "gpu_power": 204.26, "ncpu": 1, "ngpu": 1}|

This also shows the GPU utilization.

If the GPU utilization of your job is low, you should check whether its CPU utilization is close to 100% with seff JOBID. Having a high CPU utilization and a low GPU utilization can indicate that the CPUs are trying to keep the GPU occupied with calculations, but the workload is too much for the CPUs and thus GPUs are not constantly working.

Increasing the number of CPUs you request can help, especially in tasks that involve data loading or preprocessing, but your program must know how to utilize the CPUs.

However, you shouldn’t request too many CPUs: There wouldn’t be enough CPUs for everyone to use the GPUs and they would go to waste (all of our nodes have 4-12 CPUs for each GPU).

Reserving specific GPU types

You can restrict yourself to a certain type of GPU card by using using the --constraint option. For example, to restrict the submission to Pascal generation GPUs only you can use --constraint='pascal'.

For choosing between multiple generations, you can use the |-character between generations. For example, if you want to restrict the submission Volta or Ampere generations you can use --constraint='volta|ampere'. Remember to use the quotes since | is the shell pipe.

To see what GPU resources are available, run slurm features or sinfo -o '%50N %18F %26f %30G'.

Alternative way is to use syntax --gres=gpu:GPU_TYPE:1, where GPU_TYPE is a name chosen by the admins for the GPU. For example, --gres=gpu:v100:1 would give you a V100 card.

Reserving resources from the short job queue for quick debugging

There is a gpushort partition with a time limit of 4 hours that often has space (like with other partitions, this is automatically selected for short jobs). As of early 2022, it has four Tesla P100 cards in it (view with slurm partitions | grep gpushort). If you are doing testing and development and these GPUs meet your needs, you may be able to test much faster here. Use -p gpushort for this.

CUDA libraries not found

If you ever get libcuda.so.1: cannot open shared object file: No such file or directory, this means you are attempting to use a CUDA program on a node without a GPU. This especially happens if you try to test a GPU code on the login node.

Another problem that might occur is when a program will try to use pre-compiled kernels, but the corresponding CUDA toolkit is not available.

This might happen in you have used a cuda-module to compile the code and it is not loaded when you try to run the code.

If you’re using Python, see the section on CUDA libraries and Python.

CUDA libraries and Python deep learning frameworks

When using a Python deep learning frameworks such as Tensorflow or PyTorch you usually need to create a conda environment that contains both the framework and CUDA framework that the framework needs.

We recommend that you either use our centrally installed module that contains both frameworks (more info here) or install your own using environment using instructions presented here. These instructions make certain that the installed framework has a corresponding CUDA toolkit available. See the application list for more details on specific frameworks.

Please note that pre-installed software either has CUDA already present or it loads the needed modules. Thus you do not need to explicitly load CUDA from the module system when loading these.

Setting CUDA architecture flags when compiling GPU codes

Many GPU codes come with precompiled kernels, but in some cases you might need to compile your own kernels. When this is the case you’ll want to give the compiler flags that make it possible to run the code on multiple different GPU architectures.

For GPUs in Triton these flags are:

-arch=sm_60 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80

Here architectures (compute_XX/sm_XX) number 60, 70 and 80 correspond to GPU cards P100, V100 and A100 respectively.

For more information, you can check this excellent article or CUDA documentation on the subject.

Keeping GPUs occupied when doing deep learning

Many problems such as deep learning training are data-hungry. If you are loading large amounts of data you should make certain that the data loading is done in an efficient manner or the GPU will not be fully utilized.

All deep learning frameworks have their own guides on how to optimize the data loading, but they all are some variation of:

Store your data in multiple big files.
Create code that loads data from these big files.
Run optional pre-processing functions on the data.
Create a batch of data out of individual data samples.

Tasks 2 and 3 are usually parallelized across multiple CPUs. Using pipelines such as these can dramatically speed up the training procedure.

If your data consists of individual files that are not too big, it is a good idea to have the data stored in one file, which is then copied to nodes ramdisk /dev/shm or temporary disk /tmp.

Avoiding small files is in general a good rule to follow. Please refer to the small files page for more detailed information.

If your data is too big to fit in the disk, we recommend that you contact us for efficient data handling models.

For more information on suggested data loading procedures for different frameworks, see Tensorflow’s and PyTorch’s guides on efficient data loading.

Profiling GPU usage with nvprof

When using NVIDIA’s GPUs you can try to use a profiling tool called nvprof to monitor what took most of the GPU’s time during the code’s execution.

Sample output might look something like this:

==30251== NVPROF is profiling process 30251, command: ./pi-gpu 1000000000
==30251== Profiling application: ./pi-gpu 1000000000
==30251== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   84.82%  11.442ms         1  11.442ms  11.442ms  11.442ms  throw_dart(curandStateXORWOW*, int*, unsigned long*)
                   14.70%  1.9833ms         1  1.9833ms  1.9833ms  1.9833ms  setup_rng(curandStateXORWOW*, unsigned long)
                    0.30%  40.704us         1  40.704us  40.704us  40.704us  [CUDA memcpy DtoH]
                    0.17%  23.328us         1  23.328us  23.328us  23.328us  [CUDA memcpy HtoD]
      API calls:   89.52%  122.81ms         3  40.936ms  3.6360us  122.70ms  cudaMalloc
                   10.05%  13.794ms         2  6.8969ms  68.246us  13.726ms  cudaMemcpy
                    0.20%  269.55us         3  89.851us  11.283us  130.45us  cudaFree
                    0.14%  196.08us       101  1.9410us     122ns  83.854us  cuDeviceGetAttribute
                    0.04%  57.228us         2  28.614us  6.3760us  50.852us  cudaLaunchKernel
                    0.02%  32.426us         1  32.426us  32.426us  32.426us  cuDeviceGetName
                    0.01%  13.677us         1  13.677us  13.677us  13.677us  cuDeviceGetPCIBusId
                    0.01%  10.998us         1  10.998us  10.998us  10.998us  cudaGetDevice
                    0.00%  2.3540us         1  2.3540us  2.3540us  2.3540us  cudaGetDeviceCount
                    0.00%  1.2690us         3     423ns     207ns     850ns  cuDeviceGetCount
                    0.00%     663ns         2     331ns     170ns     493ns  cuDeviceGet
                    0.00%     656ns         1     656ns     656ns     656ns  cuDeviceTotalMem
                    0.00%     396ns         1     396ns     396ns     396ns  cuModuleGetLoadingMode
                    0.00%     234ns         1     234ns     234ns     234ns  cuDeviceGetUuid

This output shows that most of the computing time was caused by calling the throw_dart-kernel. It is important to note that in this example memory allocation cudaMalloc and memory copying cudaMemcpy used more time than the actual computation. Memory operations are time consuming operations and thus best codes try to minimize the need for doing them.

To see a chronological order of different GPU operations one can also run nprof --print-gpu-trace. The output will look something like this:

==31050== NVPROF is profiling process 31050, command: ./pi-gpu 1000000000
==31050== Profiling application: ./pi-gpu 1000000000
==31050== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
182.84ms  23.136us                    -               -         -         -         -  256.00KB  10.552GB/s    Pageable      Device  Tesla P100-PCIE         1         7  [CUDA memcpy HtoD]
182.89ms  1.9769ms            (512 1 1)       (128 1 1)        31        0B        0B         -           -           -           -  Tesla P100-PCIE         1         7  setup_rng(curandStateXORWOW*, unsigned long) [118]
184.87ms  11.450ms            (512 1 1)       (128 1 1)        19        0B        0B         -           -           -           -  Tesla P100-PCIE         1         7  throw_dart(curandStateXORWOW*, int*, unsigned long*) [119]
196.33ms  40.704us                    -               -         -         -         -  512.00KB  11.996GB/s      Device    Pageable  Tesla P100-PCIE         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

Here we see that the sample code did a memory copy to the device, ran kernel setup_rng, ran kernel throw_dart and did a memory copy back to the host memory.

For more information on nvprof, see NVIDIA’s documentation on it.

Available GPUs and architectures

Card	Slurm feature name (`--constraint=`)	Slurm gres name (`--gres=gpu:NAME:n`)	total amount	nodes	architecture	compute threads per GPU	memory per card	CUDA compute capability
Tesla K80*	`kepler`	`teslak80`	12	gpu[20-22]	Kepler	2x2496	2x12GB	3.7
Tesla P100	`pascal`	`teslap100`	20	gpu[23-27]	Pascal	3854	16GB	6.0
Tesla V100	`volta`	`v100`	40	gpu[1-10]	Volta	5120	32GB	7.0
Tesla V100	`volta`	`v100`	40	gpu[28-37]	Volta	5120	32GB	7.0
Tesla V100	`volta`	`v100`	16	dgx[1-7]	Volta	5120	16GB	7.0
Tesla A100	`ampere`	`a100`	56	gpu[11-17,38-44]	Ampere	7936	80GB	8.0
AMD MI100 (testing)	`mi100`	Use `-p gpu-amd` only, no `--gres`		gpuamd[1]

Exercises

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

GPU 1: Test nvidia-smi

Run nvidia-smi on a GPU node with srun. Use slurm history to check which GPU node you ended up on.

GPU 2: Running the example

Run the example given above with larger number of trials (10000000000 or \(10^{10}\)).

Try using sbatch and Slurm script as well.

GPU 3: Run the script and do basic profiling with nvprof

nvprof is part of NVIDIA’s profiling tools and it can be used to monitor which parts of the GPU code use up most time.

Run the program as before, but add nvprof before it.

Try running the program with chronological trace mode (nvprof --print-gpu-trace) as well.

Solution

With srun you can run the profiling as follows:

srun --time=00:10:00 --mem=500M --gres=gpu:1 nvprof ./pi-gpu 10000000000

To get the trace output, you need to add the --print-gpu-trace-flag:

srun --time=00:10:00 --mem=500M --gres=gpu:1 nvprof --print-gpu-trace ./pi-gpu 10000000000

You should see output similar to ones shown in the section profiling GPU usage with nvprof.

GPU 4: Your program

Think of your program. Do you think it can utilize GPUs?

If you do not know, you can check the program’s documentation for words such as:

GPU
CUDA
ROCm
OpenMP offloading
OpenACC
OpenCL
…

What’s next?

You have now seen the basics - but applying these in practice is still a difficult challenge! There is plenty to figure out while combining your own software, the Linux environment, and Slurm.

Your time is the most valuable thing you have. If you aren’t fully sure of how to use the tools, it is much better to ask that struggle forever. Contact us the Research Software Engineers early - for example in our daily garage, and we can help you get set up well. Then, you can continue your learning while your projects are progressing.