Frequently asked questions
Job status and submission
Why are my jobs waiting in the queue with reason AssocGrpMemRunMinutes/AssocGrpCPURunMinutes or such?
Accounts are limited in how much the can run at a time, in order to prevent a single or a few users from hogging the entire cluster with long-running jobs if it happens to be idle (e.g. after a service break). The limit is such that it limits the maximum remaining runtime of all the jobs of a user. So the way to run more jobs concurrently is to run shorter and/or smaller (less CPU’s, less memory) jobs. For an in-depth explanation see http://tech.ryancox.net/2014/04/scheduler-limit-remaining-cputime-per.html and for a graphical simulator you can play around with: https://rc.byu.edu/simulation/grpcpurunmins.php . You can see the exact limits of your account with
sacctmgr -s show user $USER format=user,account,grptresrunmins%70
Why are my jobs in state "launch failed requeued held"?
Slurm is configured such that if a job fails due to some outside reason (e.g. the node where it’s running fails rather than the job itself crashing due to a bug in the job) the job is requeued in a held state. If you’re sure that everything is ok again you can release the job for scheduling with “scontrol release JOBID”. If you don’t want this behavior (i.e. you’d prefer that such failed jobs would just disappear) then you can prevent the requeuing with
Why are my jobs in state "PENDING" with "BadConstraints" when it seems constraints are OK.
This happens when a job is submitted to multiple partitions (this
is the default: it tries to go to partitions of all node types) and
it is BadConstraints for some partitions. Then, it gives the
BadConstraints reason for the whole job, even though it will
eventually run. (If constraints are bad in all partitions, it will
usually fail right when you are trying to submit it, something like
sbatch: error: Batch job submission failed: Requested node
configuration is not available).
You don’t need to do anything, but if you want a clean status: you
can get rid of this message by limiting to partitions that
actually satisfy the constraints. For example, if you request 96
CPUs, you can limit to the Milan nodes with
since those are tho only nodes with more than 40 CPUs. This
example is valid as of 2023, if you are reading this later you need
to figure out what the current state is (or ask us).
How can I find out the remaining runtime of my job/allocation?
You can find out the remaining time of any job that is running with
squeue -h -j -o %L
Inside a job script or sinteractive session you can use the environment variable SLURM_JOB_ID to refer to the current job ID.
There seems to be running a lot of jobs in the short queue that has gone for longer than 4 hours. Should that be possible?
SLURM kills jobs based on the partition’s TimeLimit + OverTimeLimit
parameter. The later in our case is 60 minutes. If for instance queue
time limit is 4 hours, SLURM will allow to run on it 4 hours, plus 1
hour, thus no longer than 5 hours. Though OverTimeLimit may vary, don’t
rely on it. Partition’s (aka queue’s) TimeLimit is the one that end user
should take into account when submit his/her job. Time limits per
partiton one can check with
slurm p command.
For setting up exact time frame after which you want your job to be
killed anyway, set
--time parameter when submitting the job. When
the time limit is reached, each task in each job step is sent SIGTERM
followed by SIGKILL. If you run a parallel job, set
srun as well. See ‘
man srun' and ‘
man sbatch’ for details.
srun --time=1:00:00 ...
``srun: error: Unable to allocate resources: Requested node configuration is not available``
You have requested some Slurm options which do not include any
nodes (for example, asking for a GPU with
--gres=gpu and a
partition without GPUs). Figure out what the problem is and adjust
your Slurm options.
``srun: Required node not available (down, drained or reserved)``
This error usually occurs when a requested node is down, drained or reserved which can happen if the cluster is undergoing some work - and might happen if there are very few default nodes that Slurm chooses from. If this error occurs then the shell will usually hang after the job has been submitted if the job is still waiting for allocation. To find which nodes are available for us to run jobs we can use
sinfo and under the
STATE column you will see for each partition the states of the nodes.
To fix this we can either wait for the node to be available or choose a different partition with the
--partition= command, using one of the partitions from
sinfo which has free and available (
Accounts and Access to triton
How can I access my Triton files from outside?
The scratch filesystem can be mounted from inside the Aalto networks
smb://data.triton.aalto.fi/scratch/. For example, from
Nautilus (the file manager) on Ubuntu, use “File” -> “Connect to
server”. Outside Aalto networks, use the Aalto VPN. If it is not an
Aalto computer, you may need to us
AALTO\username as the username,
and your Aalto password.
Or you can use
sshfs – filesystem client based on SSH. Most Linux workstations
have it installed by default, if not, install it or ask your local IT
support to do it for you. For setting up your SSHFS mount from your
local workstation: create a local directory and mount remote directory
$ mkdir /LOCALDIR/triton
$ sshfs firstname.lastname@example.org:/triton/PATH/TO/DIR /LOCALDIR/triton
user1 with your real username and
a real directory on your local drive. After successful mount, use you
/triton directory as it would be local. To unmount it,
fusermount -u /LOCALDIR/triton.
PHYS users example, assuming that Triton and PHYS accounts are the same:
$ mkdir /localwrk/$USER/triton
$ sshfs triton.aalto.fi:/triton/tfy/work/$USER /localwrk/$USER/triton
$ cd /localwrk/$USER/triton
... (do what you need, and then unmount when there is no need any more)
$ fusermount -u /localwrk/$USER/triton
Easy access with Nautilus
The SSHFS method described above works from any console. Though in case
of Linux desktops, when one has a GUI like Gnome or Unity (read all
Ubuntu users) one may use Nautilus – default file manager – to mount
remote SSH directory. Click
File -> Connect to Server choose
SSH, input triton.aalto.fi as a server and directory
/triton/PATH/TO/DIR you’d like to mount, type your name. Leave
password field empty if you use SSH key. As soon as Nautilus will
establish connection it will appear on the left-hand side below Network
header. Now you may access it as it would be your local directory. To
keep it as a bookmark click on the mount point and press
will appear below Bookmark header on the same menu.
If your workstatios has no NFS mounts from Triton (CS and NBE have, consult with your local admins for exact paths), you may always use SSH. Either copy your files from triton to a local directory on your workstation, like:
$ sftp email@example.com:/triton/path/to/dir/* .
I need to connect to some server on a node
Let’s say you have some server (e.g. debugging server, notebook server, …) running on a node. As usual, you can do this with ssh using port forwarding. It is the same principle as in several of the above questions.
For example, you want to connect from your own computer to port
nnnNNN. You run this command:
ssh -L BBBB:nnnNNN:AAAA firstname.lastname@example.org
Then, when you connect to port
BBBB on your own computer
localhost, it gets forwarded straight to port
AAAA on node
nnnNNN. Thus only one ssh connection gets us to any node. It is
BBBB to be the same as
AAAA. By the way, this works
with any type of connection. The node has to be listening on any
interface, not just the local interface. To connect to
localhost:AAAA on a node, you need to repeat the above steps twice
to forward from workstation->login and login->node, with the second
Graphical programs don't work (X11, -X)
In order for graphical programs on Linux to work, a file
~/.Xauthority has to be written. If your home directory quota
quota) is exceeded, then this can’t be written and
graphical programs can’t open. If your quota is exceeded, clean up
some files, close connections, and log in again. You can find where
most of your space goes with
du -h $HOME | sort -hr | less.
This is often the case if you get
X11 connection rejected because of
Storage, file transfer and quota
``Disk quota exceeded`` error but I have plenty of space
Main article: Triton Quotas
Everyone should have a group quota, but no user quota. All files need to be in a proper group (either a shared group with quota, or your “user private group”). First of all, use the ‘quota’ command to make sure that neither disk space nor number of files are exceeded. Also, make sure that you use $WRKDIR for data and not $HOME. If you actually need more quota, ask us.
Solution: add to your main directory and all your subdirectories to
the right group, and make sure all directories have the group s-bit set,
(SETGID bit, see
man chmod). This means “any files created within
this directory get the directory’s group”. Since your default group is
“domain users” which has no quota, if the s-bit is not set, you get an
immediate quota exceeded by default.
# Fix everything
# (only for $WRKDIR or group directories, still in testing):
/share/apps/bin/quotafix -sg --fix /path/to/dir/
# Manual fixing:
# Fix sticky bit:
lfs find $WRKDIR -type d --print0 | xargs -0 chmod g+s
# Fix group:
lfs find /path/to/dir ! --group $GROUP -print0 | xargs -0 chgrp $GROUP
Why this happens: $WRKDIR directory is owned by the user and user’s group that has the same name and GID as UID. Quota is set per group, not per user. That is how it was implemented since 2011 when we got Lustre in use. Since spring 2015 Triton is using Aalto AD for the authentication which sets everyone a default group ID to ‘domain users’. If you copy anything to $WRKDIR/subdirectory that has no +s bit you copy as a ‘domain users’ member and file system refuses to do so due to no quota available. If g+s bit is set, all your directories/files copied/created will get the directory’s group ownership instead of that default group ‘domain users’. There can be very confusing interactions between this and user/shared directories.
My $WRKDIR is not visible on my department computer
How can I copy Triton files from outside of Aalto?
It is an extension of the previous question. In case you are outside of Aalto and has neither direct access to Triton nor access to NFS mounted directories on your directory servers. Say you want to copy your Triton files to your home workstation. It could be done by setting up an SSH tunnel to your department SSH server. A few steps to be done: set tunnel to your local department server, then from your department server to Triton, and then run any rsync/sftp/ssh command you want from your client using that tunnel. The tunnel should be up during whole session.
client: ssh -L9509:localhost:9509 department.ssh.server
department server: ssh -L9509:localhost:22 triton.aalto.fi
client: sftp -P 9509 localhost:/triton/own/dir/* /local/dir
Note that port 9509 is taken for example only. One can use any other available port. Alaternatively, if you have a Linux or Mac OS X machine, you can setup a “proxy command”, so you don’t have to do the steps above manually everytime. On your home machine/laptop, in the file ~/.ssh/config put the lines
ProxyCommand /usr/bin/ssh DEPARTMENTUSERNAME@department.ssh.server "/usr/bin/nc -w 10 triton.aalto.fi 22"
This creates a host alias “triton” that is proxied via the department server. So you can copy a file from your home machine/laptop to triton with a command like:
rsync filename triton:remote_filename
I can't save anything to my ``$HOME`` directory, get some fsync error.
Most probably your quota has exceeded, check it out with
quota is a wrapper at
/usr/local/bin/quota on front end which
merges output from classic quota utility that supports NFS and Lustre’s
lfs quota. NFS
$HOME directory is limited to 10GB for everyone
and intended for initialization files mainly. Grace period is set to 7
days and “hard” quota is set to 11GB, which means you may exceed your
10GB quota by 1GB and have 7 days to go below 10GB again. However none
can exceed 11GB limit.
Note: Lustre mounted under
/triton is the right place for your
simulation files. It is fast and has large quotas.
Can you recovery some files from my ``$HOME`` or ``$WRKDIR`` directory?
Short answer: yes for $HOME directory and no for $WRKDIR.
/triton) is fast Lustre, has large quota, mounted
through InfiniBand. Though no backups made from
/triton, the DDN
storage system as such is secure and safe place for your data, though
you can always loose your data deleting them by mistake. Every user
must take care about his work files himself. We provide as much
diskspace to every user, as one needs and the amount of data is
growing rapidly. That is the reason why the user should manage his
important data himself. Consider backups of your valuable data on
DVDs/ USB drives or other resources outside of Triton.
Command line interface
Can I change zsh to bash?
Yes. Change shell to your Aalto account and re-login to Triton to get
your newly changed shell to work. For Aalto account changes one can
login to kosh.aalto.fi, run
kinit first and then run
type /bin/bash. To find out what is your current shell, run
For the record: your default shell is not set by Triton environment but by your Aalto account.
Why all of the files on triton cluster are in one color? How can I make them colorful? Like green for execution files, blue for folds
That is made intentionally due to high load on Lustre filesystem. Being
a high performance filesystem Lustre still has its own bottlenecks, and
one of the common Lustre troublemakers are
ls -lr or
which generate lots of requests to Lustre meta servers which regular
usage by all users may get whole system in stuck. Please follow the
recommendations given at the last section at Data storage on the Lustre
When ssh:ing, I get some LC_ALL error all the time
This happens because your computer is sending the “locale”
information (language, number format, etc) to the other computer
(Triton), but Triton doesn’t know the one on your computer. You can
unset/adjust all the
variables, or in your
.ssh/config, try setting the following in
your Triton section (see SSH for info on how this
works, you need more than you see here):
env | grep LC_ and
env | grep LANG might give you hints
about exactly what environment variables are being sent from your
computer (and thus you should override in the ssh config file).
Modules and environment settings
Job fails due to missed module environment variables.
You have included ‘module load module/name’ but job still fails due to
missing shared libraries or that it can not find some binary etc. That
is a known ZSH related issue. In your sbatch script please use
--login) which forces bash to read all the
initialization files at /etc/profile.
Alternatively, one can change shell from ZSH to BASH to avoid this hacks, see the post above.
Can I use a more up-to-date version of git on triton?
Indeed the default git with Triton OS system (CentOS) is quite old (v 1.8.x).
To get a more modern git you can run
module load git (version 2.28.0 when this is being written).
Coding and Compiling
libcuda.so.1: cannot open shared object file: No such file or directory
You are trying to run a GPU program (using CUDA) on a node without a
GPU (and thus, no
libcuda.so.1. Remember to specify that
you need GPUs
What is a good scaling factor for parallel applications? What is the recommended number of processors for parallel jobs?
Few recommendations about CPU number:
benchmark your applications on different number of CPU cores 1, 2, 12, 24, 36, and larger. Check out with the developers, your application may have ready scalability benchmarks and recommendations for compiler, MPI libraries choice.
benchmark on shared memory i.e. up to 12 CPU cores within one node and then on different nodes (distributed memory): involving interconnect make have huge difference
if you are not sure about program scalability and you have no time for testing, don’t run on more than 12 CPU cores within one node
be considerate! it is not you against others! do not try to fill up the cluster just for being cool
The cluster has a few compiler sets. Which one am I suppose to use? What are the limits for commercial compilers?
Currently there are two different sets of compilers: (i) GNU compilers, native for Linux, installed by default, (ii) Intel compilers plus MKL, a commercial suite, often the fastest compiler on Xeons.
FGI provides all FGI sites with 7 Intel licenses, thus only 7 users can compile/link with Intel at once.
``version GLIBC_2.29 not found`` (or ``GLIBCXX_3.4.26``, or ``LIBCSTDCXX_version``) when running some program.
Background: Compiled code has dynamic libraries. When a program
runs, it needs to load that code. The code embeds the name of the
libc.so.6 and then when it runs, it uses built-in
/etc/ld.so.conf) and the
variable. It takes the first thing it finds and loads it.
In all of these cases, they work in the fine line between the operating system, software we have installed, and software you have installed. Have a very low threshold to ask for help by coming to our daily garage with your problem. We might have a much easier solution much faster than you con figure out.
Problem 1: Library not found: In this case, something expects a certain library, but it can’t be found. Possible solutions could include:
Loading a module that provides the library (did you have a module loaded when you compiled the code? Are you installing a Python/R extension that needs a library from outside?)
LD_LIBRARY_PATHvariable to point to the library. If you have self-complied things this might be appropriate, but it might also be a sign that something else is wrong.
Problem 2: library version not found (such as
found): This usually means that it’s finding a library, loading
it, but the version is too old. This especially happens on
clusters, where the operating system can’t change that often.
If it’s about
GLIBCXX_version, and you can
module load gccof a proper version, or if you are in a conda environment, install the
gccpackage to bring.
If it’s about
GLIBC, then it’s about the base C library
libc, and that is very hard to solve, since this is intrinsically connected to the operating system. Likely, the program is compiled on an operating system too new for the cluster and you’d think about re-compiling on the cluster, putting it in a container.
LD_LIBRARY_PATHmight help to direct to a proper version. Again, this probably indicates some other problem.
Problem 3: you think you have the newer library loaded by a
module or something, but it’s still giving a version error: This
has sometimes happened with programs that use extensions. The base
program uses is older version of the library, but an extension
needs a newer version. Since the base program has already loaded
an older version, even specifying the new version via
LD_LIBRARY_PATH doesn’t help much.
Solution: this is tricky, since the program should be using the never version if it’s on
LD_LIBRARY_PATHalready. Maybe it’s hard-coded to use a particular older version? In this case, since it’s hard-coded to an old version, maybe you need a newer version of the base program itself (an example of this was an R extension that expected a newer
GLIBCXX_version: the answer was to build Triton’s R module with a newer
gcccompiler version). If you get this case, you should be asking us to take a look.
While compiling should I use static or shared version of some library?
One can use both, though for shared libs all your linked libs must be
either in your
/shared/apps or must be installed by
default on all the compute nodes like vast majority of GCC and other
default Linux libs.
I've got a binary file, may I find out somehow whether it is 32-bit or 64-bit compiled?
# file /usr/bin/gcc
/usr/bin/gcc: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV),
for GNU/Linux 2.4.0, dynamically linked (uses shared libs), not stripped
it displays the type of an executable or object file.
How can I print my text file to a local department printer?
We don’t have local department printers configured anywhere on Triton. But one can use SSH magic to send a file or command output to a remote printer. Run from your local workstation, insert the target printer name:
... printing text file
$ ssh email@example.com "cat file.txt" | enscript -P printer_name
... printing a PostScript file
$ ssh firstname.lastname@example.org "cat file.ps" | lp -d printer_name -
... printing a man page
$ ssh email@example.com "man -t sbatch" | lp -d printer_name -
How do I subscribe to triton-users maillist?
Having a user account on Triton also means being on the triton-users at aalto.fi mailist. That is where support team sends all the Triton related announcements. All the Triton users MUST be subscibed to the list. It is automatically kept up to date these days, but just in case you are not yet there, please send an email to your local team member and ask to add your email.
How to unsubscribe? You will be removed from the maillist as soon as your Triton account is deleted from the system. Otherwise no way, since we can’t notify about urgent things that affect data integrity or other issues.
What node names like cn[01-224] mean?
All the hardware delivered by the vendor has been labeled with some short name. In particular every single compute node has a label like Cn01 or GPU001 etc. we used this notation to name compute nodes, that is cn01 is just a hostname for Cn01, gpu001 is a hostname for GPU001 etc. Shorthands like cn[01-224] mean all the hostnames in the range cn01, cn02, cn03 .. cn224. Same for gpu[001-008], tb[003-008], fn[01-02]. Similar notations can be used with SLURM commands like:
$ scontrol show node cn[01-12]
Can't run graphical applications on nodes and "Warning: untrusted X11 forwarding setup failed: xauth key data not generated"
.bashrc and other startup files. Some modules bring
in so many dependencies that it can interfere with standard
operating system functions: in this case, SSH setting up X11
forwarding for graphical applications.