Triton quick reference

In this page, you have all important reference information

Quick reference guide for the Triton cluster at Aalto University, but also useful for many other Slurm clusters. See also this printable Triton cheatsheet, as well as other cheatsheets.

Connecting

See also: Connecting to Triton.

Method

Description

From where?

ssh from Aalto networks

Standard way of connecting via command line. Hostname is triton.aalto.fi. More SSH info.

>Linux/Mac/Win from command line: ssh USERNAME@triton.aalto.fi

>Windows: same, see Connecting via ssh for details options.

VPN and Aalto networks (which is VPN, most wired, internal servers, eduroam, aalto only if using an Aalto-managed laptop, but not aalto open). Simplest SSH option if you can use VPN.

ssh (from rest of Internet)

Use Aalto VPN and row above.

If needed: same as above, but must set up SSH key and then ssh -J USERNAME@kosh.aalto.fi USERNAME@triton.aalto.fi.

Whole Internet, if you first set up SSH key AND also use passwords (since 2023)

Open OnDemand

https://ondemand.triton.aalto.fi, Web-based interface to the cluster. Also known as OOD. Includes shell access, GUI, data transfer, Jupyter and a number of GUI applications like Matlab etc. More info.

Whole internet

Jupyter

Since April 2024 Jupyter is part of Open OnDemand, see above. Use the “Jupyter” app to get same environment as before. More info.

See Open OnDemand above

VS Code / Codium desktop

With the “Remote-SSH” extension it can provide shell access and file transfer. See the VS Code page for some important usage warnings.

Same as SSH options above above.

Modules

See also: Software modules.

Command

Description

module load NAME

load module

module avail

list all modules

module spider PATTERN

search modules

module spider NAME/ver

show prerequisite modules to this one

module list

list currently loaded modules

module show NAME

details on a module

module help NAME

details on a module

module unload NAME

unload a module

module save ALIAS

save module collection to this alias (saved in ~/.lmod.d/)

module savelist

list all saved collections

module describe ALIAS

details on a collection

module restore ALIAS

load saved module collection (faster than loading individually)

module purge

unload all loaded modules (faster than unloading individually)

Common software

See also: Applications.

  • Python: module load scicomp-python-env for the an Aalto Scientific Computing managed Python environment with common packages. More info.

    • module load mamba for mamba/conda for making your own environments (see below)

  • R: module load r for a basic R package. More info.

    • module load scicomp-r-env for an R module with various packages pre-installed

  • Matlab: module load matlab for the latest Matlab version. More info.

Storage

See also: Data storage

Name

Path

Quota

Backup

Sharing across

Purpose

Home

$HOME or /home/USERNAME/

hard quota 10GB

Nightly

all nodes

Small user specific files, no calculation data.

Work

$WRKDIR or /scratch/work/USERNAME/

200GB and 1 million files

x

all nodes

Personal working space for every user. Calculation data etc. Quota can be increased on request.

Scratch

/scratch/DEPT/PROJECT/

on request

x

all nodes

Department/group specific project directories.

Local temp

/tmp/ (nodes with disks only)

local disk size

x

single-node

(Usually fastest) place for single-node calculation data. Removed once user’s jobs are finished on the node. Request with --constraint=localdisk.

ramfs

/dev/shm/ (and /tmp/ on diskless nodes)

limited by memory

x

single-node

Very fast but small in-memory filesystem

Remote data access

See also: Remote access to data.

Method

Description

rsync transfers

Transfer back and forth via command line. Set up ssh first.

rsync triton.aalto.fi:/path/to/file.txt file.txt

rsync file.txt triton.aalto.fi:/path/to/file.txt

SFTP transfers

Operates over SSH. sftp://triton.aalto.fi in file browsers (Linux at least), FileZilla (to triton.aalto.fi).

SMB mounting

Mount (make remote viewable locally) to your own computer.

Linux: File browser, smb://data.triton.aalto.fi/scratch/

MacOS: File browser, same URL as Linux

Windows: \\data.triton.aalto.fi\scratch\

Partitions

Partition

Max job size

Mem/core (GB)

Tot mem (GB)

Cores/node

Limits

Use

<default>

If you leave off all possible partitions will be used (based on time/mem)

Use slurm partitions to see more details.

Job submission

See also: Serial Jobs, Array jobs: embarassingly parallel execution, Parallel computing: different methods explained, Serial Jobs.

Command

Description

sbatch

submit a job to queue (see standard options below)

srun

Within a running job script/environment: Run code using the allocated resources (see options below)

srun

On frontend: submit to queue, wait until done, show output. (see options below)

sinteractive

Submit job, wait, provide shell on node for interactive playing (X forwarding works, default partition interactive). Exit shell when done. (see options below)

srun --pty bash

(advanced) Another way to run interactive jobs, no X forwarding but simpler. Exit shell when done.

scancel JOBID

Cancel a job in queue

salloc

(advanced) Allocate resources from frontend node. Use srun to run using those resources, exit to close shell when done (see options below)

scontrol

View/modify job and slurm configuration

Command

Option

Description

sbatch/srun/etc

-t, --time=HH:MM:SS

time limit

-t, --time=DD-HH

time limit, days-hours

-p, --partition=PARTITION

job partition. Usually leave off and things are auto-detected.

--mem-per-cpu=N

request n MB of memory per core

--mem=N

request n MB memory per node

-c, --cpus-per-task=N

Allocate *n* CPU’s for each task. For multithreaded jobs. (compare ``–ntasks``: ``-c`` means the number of cores for each process started.)

-N, --nodes=N-M

allocate minimum of n, maximum of m nodes.

-n, --ntasks=N

allocate resources for and start n tasks (one task=one process started, it is up to you to make them communicate. However the main script runs only on first node, the sub-processes run with “srun” are run this many times.)

-J, --job-name=NAME

short job name

-o OUTPUTFILE

print output into file output

-e ERRORFILE

print errors into file error

--exclusive

allocate exclusive access to nodes. For large parallel jobs.

--constraint=FEATURE

request feature (see slurm features for the current list of configured features, or Arch under the hardware list). Multiple with --constraint="hsw|skl".

--constraint=localdisk

request nodes that have local disks

--array=0-5,7,10-15

Run job multiple times, use variable $SLURM_ARRAY_TASK_ID to adjust parameters.

--gres=gpu

request a GPU, or --gres=gpu:n for multiple

--mail-type=TYPE

notify of events: BEGIN, END, FAIL, ALL, REQUEUE (not on triton) or ALL. MUST BE used with --mail-user= only

--mail-user=YOUR@EMAIL

whome to send the email

srun

-N N_NODES hostname

Print allocated nodes (from within script)

Command

Description

slurm q ; slurm qq

Status of your queued jobs (long/short)

slurm partitions

Overview of partitions (A/I/O/T=active,idle,other,total)

slurm cpus PARTITION

list free CPUs in a partition

slurm history [1day,2hour,…]

Show status of recent jobs

seff JOBID

Show percent of mem/CPU used in job. See Monitoring.

sacct -o comment -p -j JOBID

Show GPU efficiency

slurm j JOBID

Job details (only while running)

slurm s ; slurm ss PARTITION

Show status of all jobs

sacct

Full history information (advanced, needs args)

Full slurm command help:

$ slurm

Show or watch job queue:
 slurm [watch] queue     show own jobs
 slurm [watch] q   show user's jobs
 slurm [watch] quick     show quick overview of own jobs
 slurm [watch] shorter   sort and compact entire queue by job size
 slurm [watch] short     sort and compact entire queue by priority
 slurm [watch] full      show everything
 slurm [w] [q|qq|ss|s|f] shorthands for above!
 slurm qos               show job service classes
 slurm top [queue|all]   show summary of active users
Show detailed information about jobs:
 slurm prio [all|short]  show priority components
 slurm j|job      show everything else
 slurm steps      show memory usage of running srun job steps
Show usage and fair-share values from accounting database:
 slurm h|history   show jobs finished since, e.g. "1day" (default)
 slurm shares
Show nodes and resources in the cluster:
 slurm p|partitions      all partitions
 slurm n|nodes           all cluster nodes
 slurm c|cpus            total cpu cores in use
 slurm cpus   cores available to partition, allocated and free
 slurm cpus jobs         cores/memory reserved by running jobs
 slurm cpus queue        cores/memory required by pending jobs
 slurm features          List features and GRES

Examples:
 slurm q
 slurm watch shorter
 slurm cpus batch
 slurm history 3hours

Other advanced commands (many require lots of parameters to be useful):

Command

Description

squeue

Full info on queues

sinfo

Advanced info on partitions

slurm nodes

List all nodes

Slurm examples

See also: Serial Jobs, Array jobs: embarassingly parallel execution.

Simple batch script, submit with sbatch the_script.sh:

#!/bin/bash -l
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=1G

module load scicomp-python-env
python my_script.py

Simple batch script with array (can also submit with sbatch --array=1-10 the_script.sh):

#!/bin/bash -l
#SBATCH --array=1-10

python my_script.py --seed=$SLURM_ARRAY_TASK_ID

Hardware

See also: Cluster technical overview.

Node name

Number of nodes

Node type

Year

Arch (--constraint)

CPU type

Memory Configuration

Infiniband

GPUs

Disks

pe[1-48,65-81]

65

Dell PowerEdge C4130

2016

hsw avx2

2x12 core Xeon E5 2680 v3 2.50GHz

128GB DDR4-2133

FDR

900GB HDD

pe[49-64,82]

17

Dell PowerEdge C4130

2016

hsw avx2

2x12 core Xeon E5 2680 v3 2.50GHz

256GB DDR4-2133

FDR

900GB HDD

pe[83-91]

8

Dell PowerEdge C4130

2017

bdw avx2

2x14 core Xeon E5 2680 v4 2.40GHz

128GB DDR4-2400

FDR

900GB HDD

skl[1-48]

48

Dell PowerEdge C6420

2019

skl avx2 avx512

2x20 core Xeon Gold 6148 2.40GHz

192GB DDR4-2667

EDR

No disk

csl[1-48]

48

Dell PowerEdge C6420

2020

csl avx2 avx512

2x20 core Xeon Gold 6248 2.50GHz

192GB DDR4-2667

EDR

No disk

milan[1-32]

32

Dell PowerEdge C6525

2023

milan avx2

2x64 core AMD EPYC 7713 @2.0 GHz

512GB DDR4-3200

HDR-100

No disk

fn3

1

Dell PowerEdge R940

2020

avx2 avx512

4x20 core Xeon Gold 6148 2.40GHz

2TB DDR4-2666

EDR

No disk

gpu[1-10]

10

Dell PowerEdge C4140

2020

skl avx2 avx512 volta

2x8 core Intel Xeon Gold 6134 @ 3.2GHz

384GB DDR4-2667

EDR

4x V100 32GB

1.5 TB SSD

gpu[11-17,38-44]

14

Dell PowerEdge XE8545

2021, 2023

milan avx2 ampere a100

2x24 core AMD EPYC 7413 @ 2.65GHz

503GB DDR4-3200

EDR

4x A100 80GB

440 GB SSD

gpu[20-22]

3

Dell PowerEdge C4130

2016

hsw avx2 kepler

2x6 core Xeon E5 2620 v3 2.50GHz

128GB DDR4-2133

EDR

4x2 GPU K80

440 GB SSD

gpu[23-27]

5

Dell PowerEdge C4130

2017

hsw avx2 pascal

2x12 core Xeon E5-2680 v3 @ 2.5GHz

256GB DDR4-2400

EDR

4x P100

720 GB SSD

gpu[28-37]

10

Dell PowerEdge C4140

2019

skl avx2 avx512 volta

2x8 core Intel Xeon Gold 6134 @ 3.2GHz

384GB DDR4-2667

EDR

4x V100 32GB

1.5 TB SSD

dgx[1-2]

2

Nvidia DGX-1

2018

bdw avx2 volta

2x20 core Xeon E5-2698 v4 @ 2.2GHz

512GB DDR4-2133

EDR

8x V100 16GB

7 TB SSD

dgx[3-7]

5

Nvidia DGX-1

2018

bdw avx2 volta

2x20 core Xeon E5-2698 v4 @ 2.2GHz

512GB DDR4-2133

EDR

8x V100 32GB

7 TB SSD

gpuamd1

1

Dell PowerEdge R7525

2021

rome avx2 mi100

2x8 core AMD EPYC 7262 @3.2GHz

250GB DDR4-3200

EDR

3x MI100

32GB SSD

gpu[45-48]

4

Dell PowerEdge XE8640

2024

saphr avx2 h100 hopper

2x48 core Xeon Platinum 8468 2.1GHz

1024GB DDR5-4800

HDR

4x H100 SXM 80GB

21 TB SSD

GPUs

See also: GPU computing.

Card

Slurm partition (--partition=)

Slurm feature name (--constraint=)

Slurm gres name (--gres=gpu:NAME:n)

total amount

nodes

architecture

compute threads per GPU

memory per card

CUDA compute capability

Tesla K80*

Not available

kepler

teslak80

12

gpu[20-22]

Kepler

2x2496

2x12GB

3.7

Tesla P100

gpu-p100-16g

pascal

teslap100

20

gpu[23-27]

Pascal

3854

16GB

6.0

Tesla V100

gpu-v100-32g

volta

v100

40

gpu[1-10]

Volta

5120

32GB

7.0

Tesla V100

gpu-v100-32g

volta

v100

40

gpu[28-37]

Volta

5120

32GB

7.0

Tesla V100

gpu-v100-16g

volta

v100

16

dgx[1-2]

Volta

5120

16GB

7.0

Tesla V100

gpu-v100-32g

volta

v100

16

dgx[3-7]

Volta

5120

32GB

7.0

Tesla A100

gpu-a100-80g

ampere

a100

56

gpu[11-17,38-44]

Ampere

7936

80GB

8.0

Tesla H100

gpu-h100-80g

hopper

h100

16

gpu[45-48]

Hopper

16896

80GB

9.0

AMD MI100 (testing)

Not yet installed

mi100

Use -p gpu-amd only, no --gres

gpuamd[1]

Conda Environments (Mamba)

See also: Python Environments with Conda. Note that mamba is a drop-in replacement for conda.

Command

Description

module load mamba

Load module that provides conda/mamba Triton, for use making your own environments. mamba is a faster drop-in replacement for conda.

First time setup

See link for six commands to run once per user account on Triton (to avoid filling up all space on your home directory).

name: conda-example
channels:
  - conda-forge
dependencies:
  - numpy
  - pandas

Minimal environment.yml example. By defining our requirements in one place, our environment becomes reproducible and we can solve problems by re-creating it. “Dependencies” lists packages that will be installed.

Environment management:

Creating, activating, removing:

mamba env create --file environment.yml

Create environment from yaml file. Use -n NAME to set or override the name from the .yml file. Environments with -n are stored in conda config --show envs_dirs.

source activate NAME

Activate environment of name NAME. Note we use this and not conda init/conda activate to avoid changing Python for your whole account. HPC Cluster specific.

source deactivate

Deactivate conda from this session. HPC Cluster specific.

mamba env list

List all environments.

mamba env remove -n NAME

Remove the environment of that name.

Package management:

Inside the activate environment

mamba list

List packages in currently active environment.

mamba env update --file environment.yml

Update an environment based on updated environment.yml

mamba install --freeze-installed --channel CHANNEL PACKAGE_NAME

Install packages in an environment with minimal changes to what is already installed. Usually you would want to go at add them to environment.yml if it is a dependency. Better: add to environment.yml and see the previous line.

mamba env export

Export an environment.yml that describes the current environment. Add --no-builds to make it more portable across operating systems. Add --from-history to list only what you have explicitly requested in the past.

mamba search [--channel conda-forge] NAME

Search for a package. List includes name, version, build version (often including linked libraries like Python/CUDA), and channel.

Other notes:

mamba ...

Use mamba instead of conda for faster operations. mamba is a drop-in replacement. It should be installed in the environment.

mamba clean -a

Clean up cached files to free up space (not environments or packages in them).

CONDA_OVERRIDE_CUDA="11.2" mamba ..

Used when making CUDA environment on login node (choose right CUDA version for you). Used with ... env create or ... install to indicate that CUDA will be available when the program runs.

Channel conda-forge

Package selection tensorflow=*=*cuda*

Package selection for tensorflow. The first * can be replaced with the Tensorflow version specification

Channels pytorch and conda-forge

Package selection pytorch=*=*cuda*

Package selection for pytorch. The first * can be replaced with the pytorch version specification.

CUDA

In channel conda-forge, automatically selected based on software you need. For manual compilation, package cudatoolkit in conda-forge.

Command line

See also: Linux shell crash course.

General notes

The command line has many small programs that when connected, allow you to do many things. Only a little bit of this is shown here.

Programs are generally silent if everything worked, and only print an error if something goes wrong.

ls [DIR]

List current directory (or DIR if given).

pwd

Print current directory.

cd DIR

Change directory. .. is parent directory, / is root, / is also chaining directories, e.g. dir1/dir2 or ../../

nano FILE

Edit a file (there are many other editors, but nano is common, nice, and simple).

mkdir DIR-NAME

Make a new directory.

cat FILE

Print entire contents of file to standard output (the terminal).

less FILE

Less is a “pager”, and lets you scroll through a file (up/down/pageup/pagedown). q to quit, / to search.

mv SOURCE DEST

Move (=rename) a file. mv SOURCE1 SOURCE2 DEST-DIRECTORY/ copies multiple files to a directory.

cp SOURCE DEST

Copy a file. The DEST-DIRECTORY/ syntax of mv works as well.

rm FILE ...

Remove a file. Note, from the command line there is no recovery, so always pause and check before running this command! The -i option will make it confirm before removing each file. Add -r to remove whole directories recursively.

head [FILE]

Print the first 10 (or N lines with -n N) of a file. Can take input from standard input instead of FILE. tail is similar but the end of the file.

tail [FILE]

See above.

grep PATTERN [FILE]

Print lines matching a pattern in a file, suitable as a primitive find feature, or quickly searching for output. Can also use standard input instead of FILE.

du [-ash] [DIR]

Print disk usage of a directory. Default is KiB, rounded up to block sizes (1 or 4 KiB), -h means “human readable” (MB, GB, etc), -s means “only of DIR, not all subdirectories also”. -a means “all files, not only directories”. A common pattern is du -h DIR | sort -h to print all directories and their sizes, sorted by size.

stat

Show detailed information on a file’s properties.

find [DIR]

find can do almost anything, but that means it’s really hard to use it well. Let’s be practical: with only a directory argument, it prints all files and directories recursively, which might be useful itself. Many of us do find DIR | grep NAME to grep for the name we want (even though this isn’t the “right way”, there are find options which do this same thing more efficiently).

| (pipe): COMMAND1 | COMMAND2

The output of COMMAND1 is sent to the input of COMMAND2. Useful for combining simple commands together into complex operations - a core part of the unix philosophy.

> (output redirection): COMMAND > FILE

Write standard output of COMMAND to FILE. Any existing content is lost.

>> (appending output redirection): COMMAND >> FILE

Like above, but doesn’t lose content: it appends.

< (input redirection): COMMAND < FILE

Opposite of >, input to COMMAND comes from FILE.

type COMMAND or which COMMAND

Show exactly what will be run, for a given command (e.g. type python3).

man COMMAND-NAME

Browse on-line help for a command. q will exit, / will search (it uses less as its pager by default).

-h and --help

Common command line options to print help on a command. But, it has to be implemented by each command.