Nvidia DGX machines¶
Triton currently has two Nvidia DGX-1 machines machines which contain 8 V100 GPUs and are optimized for deep learning.
The DGX usage in Slurm, and this page in general, are under development and testing. For latest changes, you can check git history using the link in the top right corner.
Access and prerequisites¶
The DGX machines have been specifically bought by several groups and these groups have priority access.
General access: you should us the
dgx-common partition. This has
preemption enabled, which means that if a higher priority job comes,
you job can be cancelled at any time, even if it is running. The
job will then be added back to the queue and possibly run again. Design
your code to take this into account. Furthermore, your job will only
start running when the priority partition is empty… so in effect
jobs happen very slowly. If you are using general access, all of the
-p dgx in the examples below need to be changed to
Dedicated group access: You can check the groups which may access
it by running
grep PartitionName=dgx /etc/slurm/slurm.conf and
AllowedGroups=, check your groups with
check all group members with
getent group $groupname. If you
should have access but don’t, email our support alias
with a CC to your group leader, and we will fix this.
You also need a Triton account.
The DGX machines have a special operating system from Nvidia based on Ubuntu 16.04, and thus form a very special of a Triton node because the rest of Triton is CentOS. We have done work to make them work together, but it will require special effort to make code run on both halves. You may find some problems, so please be aggressive about filing issues (but also aggressive about checking the background yourself and giving us good information).
Basic reading: Connecting to Triton.
Unlike before, direct access is not available: you should connect to the login node and submit jobs via Slurm, not running directly interactively.
Software and modules¶
Basic reading: Software Modules.
You should load software using the
module command, just like the
rest of Triton. However, since the base operating system is
different, modules are not automatically compatible. So, you can’t
automatically reuse the modules you use on the rest of Triton.
The current available modules are:
----------------------- /share/apps/anaconda-ci/modules ------------------------ anaconda2/5.1.0-cpu modules/anaconda2/5.1.0-cpu anaconda2/5.1.0-gpu (D) modules/anaconda2/5.1.0-gpu (D) anaconda3/5.1.0-cpu modules/anaconda3/5.1.0-cpu anaconda3/5.1.0-gpu (D) modules/anaconda3/5.1.0-gpu (D) -------------------- /share/apps/singularity-ci/dgx/modules -------------------- nvidia-caffe/18.02-py2 nvidia-tensorflow/18.02-py2 nvidia-cntk/18.02-py3 nvidia-tensorflow/18.02-py3 (D) nvidia-mxnet/18.02-py2 nvidia-theano/18.02 nvidia-mxnet/18.02-py3 (D) nvidia-torch/18.02-py2 nvidia-pytorch/18.02-py3 ------------------------- /share/apps/modulefiles/dgx -------------------------- matlab/r2012a matlab/r2015b matlab/r2016b matlab/r2014a matlab/r2016a matlab/r2017b (D) -------------------- /usr/share/lmod/lmod/modulefiles/Core --------------------- lmod/5.8 settarg/5.8
Unlike the rest of Triton, you can’t see which modules are available
on the login node: currently see above (which might go out of date)
or get an interactive shell on the DGX node (see below) and run
module avail yourself.
All runs on the DGX machines go via Slurm. For an introduction to slurm, see the tutorials linked above, and in general all the rest of the Triton user guide. Slurm is a cluster scheduling system, which takes job requests (code, CPU/memory/time/hardware requirements) and distributes it to nodes. You basically need to declare what your jobs require, and tell it to run on DGX nodes.
Basic required slurm options¶
The necessary Slurm parameters are:
-p dgx(dedicated group access) or
-p dgx-common(general access, jobs may be killed at any time, see above) to indicate that we want to run in the DGX partitions.
--gres=gpu:v100:1to request GPUs (Slurm also manages GPUs and limits you to the proper devices).
- To request more than one graphics card,
- To request more than one graphics card,
--export=HOME,USER,TERM,WRKDIRto limit the environment exported. Because these are a different operating system, you need to clear most environment variables. If there are extra environment variables you need, add them here.
/bin/bash -l: you need to give the full path to
bashand request a login shell, or else the environment won’t be properly set by Slurm.
- To set the run time,
--time=HH:MM:SS. If you want more CPUs, add
-c N. If you want more (system) memory, use
--mem=5GBand so on. (These are completely generic slurm options.)
To check running and jobs:
squeue -p dgx,dgx-common (whole cluster) or
slurm q (for your own jobs).
Getting an interactive shell for own work¶
For example, to get an interactive shell, run:
srun -p dgx --gres=gpu:v100:1 --export=HOME,USER,TERM,WRKDIR --pty /bin/bash -l
From here, you can do whatever you want interactively with your dedicated resources almost as if you logged in directly. Remember to log out when done, otherwise your resources stay dedicated to you and no one else can use them!
Similarly to the rest of Triton, you can make batch scripts:
#!/bin/bash -l #SBATCH -p dgx #SBATCH --gres=gpu:1 #SBATCH --mem=5G --time=5:00 #SBATCH --export=HOME,USER,TERM,WRKDIR your shell commands here
Some of the Nvidia containers designed for the DGX machines are available as modules - see above. They are integrated with our Triton singularity setup, so you can use those same procedures:
module load nvidia-tensorflow # Get a shell within the image: singularity_wrapper shell # Execute Python within the image singularity_wrapper exec python3 code.py
singularity_wrapper sets the image file (from the module you
loaded), important options (to bind-mount things), and starts it.
This is a minimum slurm script (submit with
sbatch, see the slurm
info above and tutorials for more info):
#!/bin/bash -l #SBATCH -p dgx #SBATCH --gres=gpu:1 #SBATCH --mem=5G --time=5:00 #SBATCH --export=HOME,USER,TERM,WRKDIR module load nvidia-tensorflow singularity_wrapper exec python -V
Note: if you are using tensorboard, just have it write data to the scratch filesystem, mount that on your workstation, and follow it that way. See the data storage tutorial.
Within jobs, us
/tmp for temporary local files. This is
bind-mounted per user (not per job, make sure that you prefix by job
ID or something to not get conflicts) to the
/raid SSD area.
(note: see below, this doesn’t work yet)
- You have to give the full path to
/bin/bashand give the
-loption to make a login shell to read necessary shell initialization.
- You have to limit the environment variables you export, because they
are different. But you have to export at least
HOMEand possibly more (see above).
- You can’t figure out modules are available without getting an interactive shell there.
/tmpdirectory is not automatically to a per-user tmpdir (or
/raid). For large amounts of intermediates, use a per-user subdirectory of
/raidfor your work.
/scratchisn’t automatically mounted for some reason. For now, we manually mount it on each reboot but this needs fixing.