Triton quick reference

In this page, you have all important reference information

Quick reference guide for the Triton cluster at Aalto University, but also useful for many other Slurm clusters. See also this printable Triton cheatsheet, as well as other cheatsheets.

Connecting

Modules

Common software

Storage

Remote data access

Partitions

Partition	Max job size	Mem/core (GB)	Tot mem (GB)	Cores/node	Limits	Use
<default>						If you leave off all possible partitions will be used (based on time/mem)

Use slurm partitions to see more details.

Job submission

Command	Description
`sbatch`	submit a job to queue (see standard options below)
`srun`	Within a running job script/environment: Run code using the allocated resources (see options below)
`srun`	On frontend: submit to queue, wait until done, show output. (see options below)
`sinteractive`	Submit job, wait, provide shell on node for interactive playing (X forwarding works, default partition interactive). Exit shell when done. (see options below)
`srun --pty bash`	(advanced) Another way to run interactive jobs, no X forwarding but simpler. Exit shell when done.
`scancel` JOBID	Cancel a job in queue
`salloc`	(advanced) Allocate resources from frontend node. Use `srun` to run using those resources, `exit` to close shell when done (see options below)
`scontrol`	View/modify job and slurm configuration

Command	Option	Description
`sbatch`/`srun`/etc	`-t`, `--time=`HH:MM:SS	time limit
	`-t, --time=`DD-HH	time limit, days-hours
	`-p, --partition=`PARTITION	job partition. Usually leave off and things are auto-detected.
	`--mem-per-cpu=`N	request n MB of memory per core
	`--mem=`N	request n MB memory per node
	`-c`, `--cpus-per-task=`N	*Allocate n* CPU’s for each task. For multithreaded jobs. (compare ``–ntasks``: ``-c`` means the number of cores for each process started.)**
	`-N`, `--nodes=`N-M	allocate minimum of n, maximum of m nodes.
	`-n`, `--ntasks=`N	allocate resources for and start n tasks (one task=one process started, it is up to you to make them communicate. However the main script runs only on first node, the sub-processes run with “srun” are run this many times.)
	`-J`, `--job-name=`NAME	short job name
	`-o` OUTPUTFILE	print output into file output
	`-e` ERRORFILE	print errors into file error
	`--exclusive`	allocate exclusive access to nodes. For large parallel jobs.
	`--constraint=`FEATURE	request feature (see `slurm features` for the current list of configured features, or Arch under the hardware list). Multiple with `--constraint="hsw\|skl"`.
	`--array=`0-5,7,10-15	Run job multiple times, use variable `$SLURM_ARRAY_TASK_ID` to adjust parameters.
	`--gres=gpu`	request a GPU, or `--gres=gpu:`n for multiple
	`--gres=spindle`	request nodes that have disks, `spindle:`n, for a certain number of RAID0 disks
	`--mail-type=`TYPE	notify of events: `BEGIN`, `END`, `FAIL`, `ALL`, `REQUEUE` (not on triton) or `ALL.` MUST BE used with `--mail-user=` only
	`--mail-user=`YOUR@EMAIL	whome to send the email
`srun`	`-N` N_NODES hostname	Print allocated nodes (from within script)

Command	Description
`slurm q` ; `slurm qq`	Status of your queued jobs (long/short)
`slurm partitions`	Overview of partitions (A/I/O/T=active,idle,other,total)
`slurm cpus` PARTITION	list free CPUs in a partition
`slurm history` [1day,2hour,…]	Show status of recent jobs
`seff` JOBID	Show percent of mem/CPU used in job. See Monitoring.
`sacct -o comment -p -j` JOBID	Show GPU efficiency
`slurm j` JOBID	Job details (only while running)
`slurm s` ; `slurm ss` PARTITION	Show status of all jobs
`sacct`	Full history information (advanced, needs args)

Full slurm command help:

$ slurm

Show or watch job queue:
 slurm [watch] queue     show own jobs
 slurm [watch] q   show user's jobs
 slurm [watch] quick     show quick overview of own jobs
 slurm [watch] shorter   sort and compact entire queue by job size
 slurm [watch] short     sort and compact entire queue by priority
 slurm [watch] full      show everything
 slurm [w] [q|qq|ss|s|f] shorthands for above!
 slurm qos               show job service classes
 slurm top [queue|all]   show summary of active users
Show detailed information about jobs:
 slurm prio [all|short]  show priority components
 slurm j|job      show everything else
 slurm steps      show memory usage of running srun job steps
Show usage and fair-share values from accounting database:
 slurm h|history   show jobs finished since, e.g. "1day" (default)
 slurm shares
Show nodes and resources in the cluster:
 slurm p|partitions      all partitions
 slurm n|nodes           all cluster nodes
 slurm c|cpus            total cpu cores in use
 slurm cpus   cores available to partition, allocated and free
 slurm cpus jobs         cores/memory reserved by running jobs
 slurm cpus queue        cores/memory required by pending jobs
 slurm features          List features and GRES

Examples:
 slurm q
 slurm watch shorter
 slurm cpus batch
 slurm history 3hours

Other advanced commands (many require lots of parameters to be useful):

Command	Description
`squeue`	Full info on queues
`sinfo`	Advanced info on partitions
`slurm nodes`	List all nodes

Slurm examples

Simple batch script, submit with sbatch the_script.sh:

#!/bin/bash -l
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=1G

module load anaconda
python my_script.py

Simple batch script with array (can also submit with sbatch --array=1-10 the_script.sh):

#!/bin/bash -l
#SBATCH --array=1-10

python my_script.py --seed=$SLURM_ARRAY_TASK_ID

Method	Description	From where?
ssh from Aalto networks	Standard way of connecting via command line. Hostname is `triton.aalto.fi`. More SSH info. >Linux/Mac/Win from command line: `ssh USERNAME@triton.aalto.fi` >Windows: same, see Connecting via ssh for details options.	VPN and Aalto networks (which is VPN, most wired, internal servers, `eduroam`, `aalto` only if using an Aalto-managed laptop, but not `aalto open`). Simplest SSH option if you can use VPN.
ssh (from rest of Internet)	Use Aalto VPN and row above. If needed: same as above, but must set up SSH key and then `ssh -J USERNAME@kosh.aalto.fi USERNAME@triton.aalto.fi`.	Whole Internet, if you first set up SSH key AND also use passwords (since 2023)
VDI	“Virtual desktop interface”, https://vdi.aalto.fi, from there you can `ssh` to Triton or access OOD. More info.	Whole Internet
Jupyter	Since April 2024 Jupyter is part of Open OnDemand, see below. More info.	See the corresponding OOD section
Open OnDemand	https://ondemand.triton.aalto.fi, Web-based interface to the cluster. Also known as OOD. Includes shell access, GUI, data transfer, Jupyter and a number of GUI applications like Matlab etc. More info.	VPN and Aalto networks or through VDI

Command	Description
`module load` NAME	load module
`module avail`	list all modules
`module spider` PATTERN	search modules
`module spider` NAME/ver	show prerequisite modules to this one
`module list`	list currently loaded modules
`module show` NAME	details on a module
`module help` NAME	details on a module
`module unload` NAME	unload a module
`module save` ALIAS	save module collection to this alias (saved in `~/.lmod.d/`)
`module savelist`	list all saved collections
`module describe` ALIAS	details on a collection
`module restore` ALIAS	load saved module collection (faster than loading individually)
`module purge`	unload all loaded modules (faster than unloading individually)

Name	Path	Quota	Backup	Locality	Purpose
Home	`$HOME` or `/home/USERNAME/`	hard quota 10GB	Nightly	all nodes	Small user specific files, no calculation data.
Work	`$WRKDIR` or `/scratch/work/USERNAME/`	200GB and 1 million files	x	all nodes	Personal working space for every user. Calculation data etc. Quota can be increased on request.
Scratch	`/scratch/DEPT/PROJECT/`	on request	x	all nodes	Department/group specific project directories.
Local temp	`/tmp/`	limited by disk size	x	single-node	Primary (and usually fastest) place for single-node calculation data. Removed once user’s jobs are finished on the node.
Local persistent	`/l/`	varies	x	dedicated group servers only	Local disk persistent storage. On servers purchased for a specific group. Not backed up.
ramfs (login nodes only)	`$XDG_RUNTIME_DIR`	limited by memory	x	single-node	Ramfs on the login node only, in-memory filesystem

Node name	Number of nodes	Node type	Year	Arch (`--constraint`)	CPU type	Memory Configuration	Infiniband	GPUs	Disks
pe[1-48,65-81]	65	Dell PowerEdge C4130	2016	hsw avx avx2	2x12 core Xeon E5 2680 v3 2.50GHz	128GB DDR4-2133	FDR		900GB HDD
pe[49-64,82]	17	Dell PowerEdge C4130	2016	hsw avx avx2	2x12 core Xeon E5 2680 v3 2.50GHz	256GB DDR4-2133	FDR		900GB HDD
pe[83-91]	8	Dell PowerEdge C4130	2017	bdw avx avx2	2x14 core Xeon E5 2680 v4 2.40GHz	128GB DDR4-2400	FDR		900GB HDD
skl[1-48]	48	Dell PowerEdge C6420	2019	skl avx avx2 avx512	2x20 core Xeon Gold 6148 2.40GHz	192GB DDR4-2667	EDR		No disk
csl[1-48]	48	Dell PowerEdge C6420	2020	csl avx avx2 avx512	2x20 core Xeon Gold 6248 2.50GHz	192GB DDR4-2667	EDR		No disk
milan[1-32]	32	Dell PowerEdge C6525	2023	milan avx avx2	2x64 core AMD EPYC 7713 @2.0 GHz	512GB DDR4-3200	HDR-100		No disk
fn3	1	Dell PowerEdge R940	2020	avx avx2 avx512	4x20 core Xeon Gold 6148 2.40GHz	2TB DDR4-2666	EDR		No disk
gpu[1-10]	10	Dell PowerEdge C4140	2020	skl avx avx2 avx512 volta	2x8 core Intel Xeon Gold 6134 @ 3.2GHz	384GB DDR4-2667	EDR	4x V100 32GB	1.5 TB SSD
gpu[11-17,38-44]	14	Dell PowerEdge XE8545	2021, 2023	milan avx avx2 ampere a100	2x24 core AMD EPYC 7413 @ 2.65GHz	503GB DDR4-3200	EDR	4x A100 80GB	440 GB SSD
gpu[20-22]	3	Dell PowerEdge C4130	2016	hsw avx avx2 kepler	2x6 core Xeon E5 2620 v3 2.50GHz	128GB DDR4-2133	EDR	4x2 GPU K80	440 GB SSD
gpu[23-27]	5	Dell PowerEdge C4130	2017	hsw avx avx2 pascal	2x12 core Xeon E5-2680 v3 @ 2.5GHz	256GB DDR4-2400	EDR	4x P100	720 GB SSD
gpu[28-37]	10	Dell PowerEdge C4140	2019	skl avx avx2 avx512 volta	2x8 core Intel Xeon Gold 6134 @ 3.2GHz	384GB DDR4-2667	EDR	4x V100 32GB	1.5 TB SSD
dgx[1-2]	2	Nvidia DGX-1	2018	bdw avx avx2 volta	2x20 core Xeon E5-2698 v4 @ 2.2GHz	512GB DDR4-2133	EDR	8x V100 16GB	7 TB SSD
dgx[3-7]	5	Nvidia DGX-1	2018	bdw avx avx2 volta	2x20 core Xeon E5-2698 v4 @ 2.2GHz	512GB DDR4-2133	EDR	8x V100 32GB	7 TB SSD
gpuamd1	1	Dell PowerEdge R7525	2021	rome avx avx2 mi100	2x8 core AMD EPYC 7262 @3.2GHz	250GB DDR4-3200	EDR	3x MI100	32GB SSD

Card	Slurm feature name (`--constraint=`)	Slurm gres name (`--gres=gpu:NAME:n`)	total amount	nodes	architecture	compute threads per GPU	memory per card	CUDA compute capability
Tesla K80*	`kepler`	`teslak80`	12	gpu[20-22]	Kepler	2x2496	2x12GB	3.7
Tesla P100	`pascal`	`teslap100`	20	gpu[23-27]	Pascal	3854	16GB	6.0
Tesla V100	`volta`	`v100`	40	gpu[1-10]	Volta	5120	32GB	7.0
Tesla V100	`volta`	`v100`	40	gpu[28-37]	Volta	5120	32GB	7.0
Tesla V100	`volta`	`v100`	16	dgx[1-7]	Volta	5120	16GB	7.0
Tesla A100	`ampere`	`a100`	56	gpu[11-17,38-44]	Ampere	7936	80GB	8.0
AMD MI100 (testing)	`mi100`	Use `-p gpu-amd` only, no `--gres`		gpuamd[1]