Submitting jobs on Triton

Prerequisites

Optimally, before submitting a job: do enough tests and have a rough idea, how long your job takes, how much memory it needs and how much CPU(s)/GPU(s) it needs. Required Reading:

Required Setup:

Types of jobs:

Triton uses the Slurm scheduling system to allocate resources, like computer nodes, memory on the nodes, GPUs etc, to the submitted jobs. For more details on Slurm, have a look here. In this quickstart guide, we will only introduce the most important parameters, and skip over a lot of details. There are multiple different types of jobs available on Triton. Here we focus on the most commonly used ones.

  • Interactive jobs (commonly to test things or run graphical platforms with cluster resources)

  • Batch jobs (normal jobs submitted to the cluster without direct user input)

to run an interactive connect to Triton and job simply run

sinteractive

from the command line. You will then be connected to a free node, and can run your interactive session. More details can be found in the tutorial for interactive jobs. If you have a specific command that you want to run you can also use:

srun your_command

The most common job to run is a batch job, i.e. you submit a script that runs your code on the cluster. To run this kind of job, you need a small script where you set parameters for the job and submit it to the cluster. Using a script to set the parameters has the advantage that it is easier to modify and reuse than passing the parameters on the command line. A basic script (e.g. in the file BatchScript.slurm) for a slurm batch job could look as follows:

#!/bin/bash
#SBATCH --time=04:00:00
#SBATCH --mem=2G
#SBATCH --output=ScriptOutput.log

module load anaconda
srun python /path/to/script.py

To run this script use the command sbatch BatchScript.slurm.

So, let us go through this script:

#SBATCH --time=04:00:00 asks for a 4 hour time slot, after which the job will be stopped.
#SBATCH --mem=2G asks for 2Gb of memory for your job.
#SBATCH --output=ScriptOutput.log sets the terminal output of the job to the specified file.
module load anaconda tells the node you run on to load the anaconda module.
srun python /path/to/script tells the cluster to run the command python /path/to/script.py

Most programming languages and tools have their own modules that need to be loaded before they can be run. You can get a list of available modules by running module spider. If you need a specific version of a module, you can check the available versions by running module spider MODULENAME (e.g. module spider r for R). To load a specific version you have to specify this version during the load command (e.g. module load matlab/r2018b for the 2018b release of MATLAB). For further details please have a look at the instructions for the specific application

There are plenty more parameters that you can set for the slurm scheduler as well (for a detailed list can be found here), but we are not going to discuss them in detail here, since they are likely not necessary for your first job.