GPUs and accelerators are basically very special parallel processors: they can apply the same instructions to lots of different data at the same time. You can get a speedup of 100x or more… but only in the specific cases where your code fits the model. It happens that machine learning/deep learning methods are able to use this type of parallelism, so now these are the standard for this type of research.
On Triton, we have a large number of Nvidia GPU cards of different generations, and are constantly getting more. Our GPUs are not for desktops, but specialized research-grade server GPUs with large memory, high bandwidth, and for scientific purposes generally exceed the best desktop GPUs.
Some nomenclature: a GPU is a graphical processing unit, CUDA is the software interface for Nvidia GPUs. (we only support CUDA)
GPUs are, just like anything, resources which are scheduled by slurm.
So in addition to time, memory, and CPUs, you have to specify how many
GPUs you want. This is done with the
--gres (generic resources)
srun --gres=gpu:1 $my_code
This means you request the
gpu resources, and one of them
1). Combining this with the other required slurm options:
srun --gres=gpu:1 -t 2:00:00 --mem=10G -c 3
… and you’ve got yourself the basics. Of course, once you are ready for serious runs, you should put your code into slurm scripts.
If you want to restrict yourself to a certain type of card, you should
--constraint option. For example, to restrict to Kepler
generation (K80s), use
--constraint=kepler or all new cards,
--constraint='kepler|pascal' (note the quotes - this is very
| is a shell pipe symbol!).
Old ways of specifying things
Note: before summer 2016, you also had to specify a GPU partition
-p gpu or
-p gpushort). Now, this is automatically detected,
and the recommendation is to leave this off.
Note: before summer 2018, the recommended way of specifying a GPU was
--constraint= is preferred since
you can specify more than one type.
Our available GPUs and architectures:
|Card||total amount||nodes||architecture||compute threads per GPU||memory per card||CUDA compute capability||Slurm feature name||Slurm gres name|
We support these machine learning packages out of the box:
- keras: same as tensorflow
- pytorch: same module as tensorflow
- Detectron: via singularity images
- Torch: currently possibly but not easy, in the future through singularity
Compiling code yourself¶
To compile things for GPUs, you need to load the relevant
module avail CUDA module load CUDA nvcc cuda_code.cu -o cuda_code # compile your CUDA code
More information is in the reference, but most people will use pre-built software through channels such as Anaconda for Python.
Making efficient use of GPUs¶
When running a job, you want to check that the GPU is being fully
utilized. To do this, ssh to your node (while the job is running),
nvidia-smi, find your process (which might take some work)
and check the
GPU-Util column. It should be close to 100%,
otherwise see below.
Deep learning work is intrinsically very data-hungry. Remember what we said about storage and input/output being important before (in the storage tutorial)? Now it’s really important. In fact, faster memory bandwidth is the main improvement of our server-grade GPUs compared to desktop models.
If you are loading lots of data, package the data into a container format first: lots of small files are your worst enemy, and we have a dedicated page on small files.
When using a GPU, you need to also request enough CPUs to supply the data to the process. So, increase the number of CPUs you request so that you can provide the GPU with enough data. However, don’t request too many: then, there aren’t enough CPUs for everyone to use the GPUs, and they go to waste! (For the K80 nodes, we have only 1.5 CPUs per GPU, but on all others we have 4-6 CPUs/GPU)
Most of the time, using more than one GPU isn’t worth it, unless you specially optimize, because communication takes too much time. It’s better to parallelize by splitting tasks into different jobs.
If you ever get
libcuda.so.1: cannot open shared object file: No such
file or directory, this means you are attempting to use a CUDA
program on a node without a GPU. This especially happens if you try
to test GPU code on the login node, and happens (for example) even if
you try to import the GPU
tensorflow module in Python on the login
/scratch/scip/examples and also on
github), you find some examples:
- Compile and run using
gpu/pi.cuexample. Load the
nvccit, then try running the program. Does it say zero? Try running it with a GPU and see what happens.
Check out or reference information about GPU computing, including examples of different machine learning languages.
This guide assumes you are using pre-existing GPU programs. If you need to write your own, that’s a whole other story, and you can find some hints on the reference page.