GPU computing

See also

This tutorial assumes you have read Interactive jobs.

Main article: GPU Computing

GPUs and accelerators are basically very special parallel processors: they can apply the same instructions to a big chunk of data at the same time. The speedup can be 100x or more… but only in the specific cases where your code fits the model. It happens that machine learning/deep learning methods are able to use this type of parallelism, so now these are the standard for this type of research.

On Triton, we have a large number of NVIDIA GPU cards from different generations, and are constantly getting more. Our GPUs are not your typical desktop GPUs, but specialized research-grade server GPUs with large memory, high bandwidth and specialized instructions. For scientific purposes they generally exceed the best desktop GPUs.

Some nomenclature: a GPU is a graphical processing unit, CUDA is the software interface for Nvidia GPUs. Currently we only support CUDA.

Getting started

GPUs are, just like anything, resources which are scheduled by slurm. So in addition to time, memory, and CPUs, you have to specify how many GPUs you want. This is done with the --gres (generic resources) option:

srun --gres=gpu:1 $my_code

This means you request the gpu resources, and one of them (1). Combining this with the other required slurm options:

srun --gres=gpu:1 -t 2:00:00 --mem=10G -c 3

… and you’ve got yourself the basics. Of course, once you are ready for serious runs, you should put your code into slurm scripts.

If you want to restrict yourself to a certain type of card, you should use the --constraint option. For example, to restrict to Kepler generation (K80s), use --constraint=kepler or all new cards, --constraint='pascal|volta' (note the quotes - this is very important, because | is a shell pipe symbol!).

Our available GPUs and architectures:

Card total amount nodes architecture compute threads per GPU memory per card CUDA compute capability Slurm feature name Slurm gres name
Tesla K80* 12 gpu[20-22] Kepler 2x2496 2x12GB 3.7 kepler teslak80
Tesla P100 20 gpu[23-27] Pascal 3854 16GB 6.0 pascal teslap100
Tesla V100 40 gpu[28-37] Volta 5120 32GB 7.0 volta v100
Tesla V100 16 dgx[01-02] Volta 5120 16GB 7.0 volta v100

Ready software

We support these machine learning packages out of the box:

Do note that most of the pre-installed software has CUDA already present. Thus you do not need to load CUDA as a module when loading these. See the application list or GPU computing reference for more details.

Compiling code yourself

To compile things for GPUs, you need to load the relevant CUDA modules:

module avail cuda
module load gcc
module load cuda

nvcc cuda_code.cu -o cuda_code         # compile your CUDA code

More information is in the reference, but most people will use pre-built software through channels such as Anaconda for Python.

Making efficient use of GPUs

When running a job, you want to check that the GPU is being fully utilized. To do this, ssh to your node (while the job is running), and run nvidia-smi, find your process (which might take some work) and check the GPU-Util column. It should be close to 100%, otherwise see below.

After job has finished, you can use slurm history to obtain the JobID and run:

sacct -j INSERT_JOBID_HERE -o comment -p

This will show the GPU utilization.

Input/output

Deep learning work is intrinsically very data-hungry. Remember what we said about storage and input/output being important before (in the storage tutorial)? Now it’s really important. In fact, faster memory bandwidth is the main improvement of our server-grade GPUs compared to desktop models.

If you are loading lots of data, package the data into a container format first: lots of small files are your worst enemy, and we have a dedicated page on small files.

If your dataset consists of individual files and it is not too big, it is a good idea to have the data stored in one file, which is then copied to nodes ramdisk /dev/shm or temporary disk /tmp.

If your data is too big to fit to the disk, we recommend that you contact us for efficient data handling models.

Enough CPUs

When using a GPU, you need to also request enough CPUs to supply the data to the process. So, increase the number of CPUs you request so that you can provide the GPU with enough data. However, don’t request too many: then, there aren’t enough CPUs for everyone to use the GPUs, and they go to waste! (For the K80 nodes, we have only 1.5 CPUs per GPU, but on all others we have 4-6 CPUs/GPU)

Other

Most of the time, using more than one GPU isn’t worth it, unless you specially optimize, because communication takes too much time. It’s better to parallelize by splitting tasks into different jobs.

FAQ

If you ever get libcuda.so.1: cannot open shared object file: No such file or directory, this means you are attempting to use a CUDA program on a node without a GPU. This especially happens if you try to test GPU code on the login node, and happens (for example) even if you try to import the GPU tensorflow module in Python on the login node.

Examples

Simple Tensorflow/Keras model

Let’s run the MNIST example from Tensorflow’s tutorials:

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

The full code for the example is in tensorflow_mnist.py. One can run this example with srun:

wget https://raw.githubusercontent.com/AaltoScienceIT/scicomp-docs/master/triton/examples/tensorflow/tensorflow_mnist.py
module load anaconda3/latest
srun -t 00:15:00 --gres=gpu:1 python tensorflow_mnist.py

or with sbatch by submitting tensorflow_mnist.sh:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00

module load anaconda3/latest

python tensorflow_mnist.py

Do note that by default Keras downloads datasets to $HOME/.keras/datasets.

Simple PyTorch model

Let’s run the MNIST example from PyTorch’s tutorials:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

The full code for the example is in tensorflow_mnist.py. One can run this example with srun:

wget https://raw.githubusercontent.com/AaltoScienceIT/scicomp-docs/master/triton/examples/pytorch/pytorch_mnist.py
module load anaconda3/latest
srun -t 00:15:00 --gres=gpu:1 python pytorch_mnist.py

or with sbatch by submitting pytorch_mnist.sh:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00

module load anaconda3/latest

python pytorch_mnist.py

The Python-script will download the MNIST dataset to data folder.

Simple CNTK model

Let’s run the MNIST example from CNTK’s tutorials:

    # Instantiate the feedforward classification model
    scaled_input = element_times(constant(0.00390625), feature)

    z = Sequential([For(range(num_hidden_layers), lambda i: Dense(hidden_layers_dim, activation=relu)),
                    Dense(num_output_classes)])(scaled_input)

    ce = cross_entropy_with_softmax(z, label)
    pe = classification_error(z, label)

The full code for the example is in cntk_mnist.py. One can run this example with srun:

wget https://raw.githubusercontent.com/AaltoScienceIT/scicomp-docs/master/triton/examples/cntk/cntk_mnist.py
module load anaconda3/latest
srun -t 00:15:00 --gres=gpu:1 python cntk_mnist.py

or with sbatch by submitting cntk_mnist.sh:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00

module load nvidia-cntk

singularity_wrapper exec python cntk_mnist.py

Do note that datasets in the code come from /scratch/scip/data/cntk/MNIST. Thus model won’t run outside of Triton. Check the CNTK GitHub repo for the whole example.

Exercises

  1. Run nvidia-smi on a GPU node with srun. Use slurm history to check which GPU node you ended up on. Try setting a constraint to force a different GPU architecture.

  2. Copy /scratch/scip/examples/gpu/pi.cu to your work directory. Compile it using cuda module and nvcc. Run it. Does it say zero? Try running it with a GPU and see what happens.

  3. Run one of the samples given above. Try using sbatch as well.

  4. Modify CTNK sample slurm script in a way that it copies datasets to an unique folder in /dev/shm or $TMPDIR before running the Python code. Modify CNTK sample so that it loads data from the new location.

    HINT: Check out mktemp --help, command output substitutions-section from our Linux shell tutorial and the API page for Python’s os.environ.

    Solution to ex. 4: cntk_mnist_ex4.py cntk_mnist_ex4.sh.

Next steps

Check out or reference information about GPU computing, including examples of different machine learning languages.

If you came straight to this page, you should also read Interactive jobs and Serial Jobs (actually you should have read them first, but don’t worry).

This guide assumes you are using pre-existing GPU programs. If you need to write your own, that’s a whole other story, and you can find some hints on the reference page.