GPU computing¶
Video
Watch this in our courses: 2022 February, 2021 January
Abstract
Request a GPU with the Slurm option
--gres=gpu:1
(some clusters need-p gpu
or similar)If you use Python, generally don’t load your own CUDA module unless you know you need this. Instead, install what you need through anaconda.
Select a certain type of GPU with e.g.
--constraint='kepler'
(see the quick reference for names).Monitor GPU performance with
sacct -j JOBID -o comment -p
.For development, run jobs of 4 hours or less, and they can run quickly in the
gpushort
queue.If you aren’t fully sure of how to scale up, contact us Research Software Engineers early.
Introduction¶
GPUs, short for graphical processing unit, are massively-parallel processors that are optimized to perform parallel operations. Computations that might take days to run on CPUs, take substantially less time on GPUs. This speed-up specially comes in handy when dealing with large amounts of data, e.g. in machine learning/deep learning tasks, which is why GPUs have become an indispensable tool in the research community.
The programs we normally write in common programming languages, e.g. C++ are executed by the CPU. We need to explicitly communicate with the GPU if we want GPU to execute the program. That is, upload the program and the input data to the GPU, and transfer the result from the GPU to the main memory. What enable this procedure are programming environments designed to communicate with GPUs in such a manner. An example of such an API is CUDA which is the native programming interface for NVIDIA GPUs.
On Triton, we have a large number of NVIDIA GPU cards from different generations and currently only support CUDA. Triton GPUs are not the typical desktop GPUs, but specialized research-grade server GPUs with large memory, high bandwidth and specialized instructions, that are constantly increasing in number. For scientific purposes, they generally outperform the best desktop GPUs.
See also
Please ensure you have read Interactive jobs and Serial Jobs before you proceed with this tutorial.
GPU jobs¶
To request GPUs on Slurm, you should use the --gres
option either in
your batch script or as a command-line argument to your interactive job.
Used with a SBATCH directive in a batch script, exactly one GPU is
requested as follows. :
#SBATCH --gres=gpu:1
You can request as many GPUs as you’d like using #SBATCH --gres=gpu:N
wherein N
denotes the number of the requested GPUs.
Note
Most of the time, using more than one GPU isn’t worth it, unless you specially optimize, because communication takes too much time. It’s better to parallelize by splitting tasks into different jobs.
You can restrict yourself to a certain type of GPU card by using
using the --constraint
option. For example, to restrict to Kepler
generation (K80s), use --constraint='kepler'
or only Pascal or Volta
generations with --constraint='pascal|volta'
(Remember to use the quotes
since |
is the shell pipe)
There is a gpushort
partition with a time limit of 4 hours that
often has space (like with other partitions, this is automatically
selected for short jobs). As of early 2022, it has four Tesla P100
cards in it (view with slurm partitions | grep gpushort
). If you
are doing testing and development and these GPUs meet your needs, you
may be able to test much faster here.
Available machine learning frameworks¶
We support many common machine learning frameworks out of the box:
Tensorflow:
module load anaconda
. See the Tensorflow page for info on older versions.Keras:
module load anaconda
PyTorch:
module load anaconda
Please note that most of the pre-installed softwares have CUDA already present. Thus you do not need to load CUDA as a seperate module when loading these. See the application list for more details.
Compiling CUDA-based code¶
To compile CUDA-based code for GPUs, you need to load the relevant cuda
module. You can see what versions of CUDA is available using module spider
:
$ module spider cuda
When submitting a batch script, you need to load the cuda
module,
compile your code, and subsequently run the executable.
An example of such a submission script is shown below wherein the
output of the code is written to a file named helloworld.out
in the current directory:
#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --job-name=helloworld
#SBATCH --mem-per-cpu=500M
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH --output=helloworld.out
module load cuda
nvcc helloworld.cu -o helloworld
./helloworld
Note
If you ever get libcuda.so.1: cannot open shared object file: No such
file or directory
, this means you are attempting to use a CUDA
program on a node without a GPU. This especially happens if you try
to test GPU code on the login node, and happens (for example) even if
you try to import the GPU tensorflow
module in Python on the login
node.
Examples¶
Simple Tensorflow/Keras model¶
Let’s run the MNIST example from Tensorflow’s tutorials:
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
The full code for the example is in
tensorflow_mnist.py
.
One can run this example with srun
:
wget https://raw.githubusercontent.com/AaltoSciComp/scicomp-docs/master/triton/examples/tensorflow/tensorflow_mnist.py
module load anaconda
srun --time=00:15:00 --gres=gpu:1 python tensorflow_mnist.py
or with sbatch
by submitting
tensorflow_mnist.sh
:
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00
module load anaconda
python tensorflow_mnist.py
Do note that by default Keras downloads datasets to $HOME/.keras/datasets
.
Simple PyTorch model¶
Let’s run the MNIST example from PyTorch’s tutorials:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 20, 5, 1)
self.conv2 = nn.Conv2d(20, 50, 5, 1)
self.fc1 = nn.Linear(4*4*50, 500)
self.fc2 = nn.Linear(500, 10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 4*4*50)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1)
The full code for the example is in
tensorflow_mnist.py
.
One can run this example with srun
:
wget https://raw.githubusercontent.com/AaltoSciComp/scicomp-docs/master/triton/examples/pytorch/pytorch_mnist.py
module load anaconda
srun --time=00:15:00 --gres=gpu:1 python pytorch_mnist.py
or with sbatch
by submitting
pytorch_mnist.sh
:
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00
module load anaconda
python pytorch_mnist.py
The Python-script will download the MNIST dataset to data
folder.
Monitoring efficient use of GPUs¶
When running a GPU job, you should check that the GPU is being fully utilized.
When your job has started, you can ssh
to the node and run
nvidia-smi
. You can find your process by e.g. using htop
and inspect the GPU-Util
column. It should be close to 100%.
Once the job has finished, you can use slurm history
to obtain the
jobID
and run:
$ sacct -j JOBID -o comment -p
This also shows the GPU utilization.
Note
There are factors to be considered regarding efficient use of GPUs. For instance, is your code itself efficient enough? Are you using the framework pipelines in the intended fashion? Is it only using GPU for a small portion of the entire task? Amdahl’s law of parallelization speedup is relevant here.
If the GPU utilization of your job is low, you should check whether
its CPU utilization is close to 100% with seff JOBID
. This can
indicate that the CPUs are trying to keep the GPU occupied with calculations,
but the lack of CPU performance will cause a bottleneck on the GPU
utilization.
Please keep in mind that when using a GPU, you need to also request enough CPUs to supply the data to the process. So, you can increase the number of CPUs you request so that enough data is provided for the GPU. However, you shouldn’t request too many: There wouldn’t be enough CPUs for everyone to use the GPUs, and they would go to waste (all of our nodes have 4-6 CPUs for each GPU).
Input/output¶
Deep learning work is intrinsically very data-hungry. Remember what we said about storage and input/output being important before (Data storage)? This matter becomes very important when working with GPUs. In fact, faster memory bandwidth is the main improvement of our server-grade GPUs compared to desktop models.
If you are loading big amounts of data, you should package the data into a container format first; lots of small files are your worst enemy. Each framework has a way to do this efficiently in a whole pipeline.
See also
Please refer to the small files page for more detailed information.
If your data consists of individual files that are not too big,
it is a good idea to have the data stored in one file, which is then
copied to nodes ramdisk /dev/shm
or temporary disk /tmp
.
If your data is too big to fit in the disk, we recommend that you contact us for efficient data handling models.
Available GPUs and architectures¶
Card |
Slurm feature name ( |
Slurm gres name ( |
total amount |
nodes |
architecture |
compute threads per GPU |
memory per card |
CUDA compute capability |
---|---|---|---|---|---|---|---|---|
Tesla K80* |
|
|
12 |
gpu[20-22] |
Kepler |
2x2496 |
2x12GB |
3.7 |
Tesla P100 |
|
|
20 |
gpu[23-27] |
Pascal |
3854 |
16GB |
6.0 |
Tesla V100 |
|
|
40 |
gpu[1-10] |
Volta |
5120 |
32GB |
7.0 |
Tesla V100 |
|
|
40 |
gpu[28-37] |
Volta |
5120 |
32GB |
7.0 |
Tesla V100 |
|
|
16 |
dgx[1-7] |
Volta |
5120 |
16GB |
7.0 |
Tesla A100 |
|
|
28 |
gpu[11-17] |
Ampere |
7936 |
80GB |
8.0 |
AMD MI100 (testing) |
|
Use |
gpuamd[1] |
Exercises¶
The scripts you need for the following exercises can be found in this git
repository: hpc-examples.
You can clone the repository by running
git clone https://github.com/AaltoSciComp/hpc-examples.git
. This repository
will be used for most of the tutorial exercises.
GPU-1: Test nvidia-smi
Run nvidia-smi
on a GPU node with srun
. Use slurm history
to check which GPU node you ended up on. Try setting a constraint
to force a different GPU architecture.
GPU-2: Running a script
Run one of the samples given above. Try using sbatch
as well.
GPU-3: Test compiling CUDA
Load cuda
and gcc
(version less than 9) modules and
compile the gpu/pi.cu
example using nvcc
.
Run it. Does it say zero? Try running it with a GPU and see what happens.
(advanced) GPU-4: Local job files
(Advanced) The PyTorch example will try to load datasets from a folder
called data
in a local folder. Modify the Slurm script so that
the script:
Creates an unique folder in
/dev/shm
or$TMPDIR
before running the Python code.Moves to this folder when job is running.
Runs the PyTorch-example from this location. Verify that the datasets are stored in the local disk.
HINT: Check out mktemp --help
,
command output substitutions section
from our Linux shell tutorial and the API page for Python’s
os.environ.
See also¶
If you aren’t fully sure of how to scale up, contact us Research Software Engineers early.
What’s next?¶
We go on to Parallel computing.