Triton is a large system that combines many different individual computers. At the same time, hundreds of people are using it. Thus, we don’t just have machines sitting around to run directly on. You need to share resources among everyone by applying for them using the queuing system, slurm. As you will see, this is very fast and lets you get basically whatever you need.
This page discusses what is necessary to use Triton interactively. This means no scripts, no overhead. You “just add srun!”. For the small jobs that you would use this for, you will almost always get your time right away. Still, you have to request the resources you need (time/cores/memory). It also means that if you don’t do things properly, it is inefficient because you request more than you need. You should start here, but once you need more go to more advanced usage.
Advantages of interactive running: It’s good for getting started quickly and scaling up: “just add srun!”. It’s good when task is so small that scripting isn’t worth it.
Downsides include: You have to be there and wait for things to run. If your shell connection gets interrupted, you lose the process. If you don’t stop interactive shells, they will continue it will count against your fairshare quota, making your jobs run slower in the future.
The simplest way is to use srun. Let’s say you run some program like this:
python3 -c 'import os; print("hi from", os.uname().nodename)'
You switch to use srun. All input/output still goes to your terminal (but note X forwarding for graphical applications doesn’t work - see below for that).
srun --mem=50G --time=5:00:00 python3 -c 'import os; print("hi from", os.uname().nodename)'
This has some possible problems: it is connected to your shell. If your shell quits, the process gets lost. Also, this runs only one single process. If you need to do multiple things in a row, then you have to wait before each one starts. Note: srun is used directly with a command to run, not batch scripts like sbatch is (though of course you could run a shell script). srun does not look at the #SBATCH options inside of scripts.
How do you find the right time/CPU/memory requirements? Slurm (the
queuing system) has extensive reporting. For example,
slurm history will show you the actual run time and actual memory
used of your job. You generally make a guess and adjust based on what
you see. There is a little bit about this below and more in
the next tutorial.
So, let’s say you need to do something a bit fancier: what if you want
an actual shell to do things interactively? You just need the extra
--pty and run
-p interactive says “give me a
partition dedicated to interactive usage (more on this later). Full
srun -p interactive --time=HH:MM:SS --mem=nnG --pty bash
Now you have a shell… do whatever you need to do. Close the shell when you are done! If you don’t, the process will keep running until your time limit. All of this time will be counted against your usage. It doesn’t cost money, but does mean that your priority will go down in the future. (Note that we specify the interactive partition with “-p interactive”. More on this below.)
Interactive shell with graphics¶
sinteractive is very similar to srun, however it is more clever and thus allows you to do X forwarding. In the background, it starts the job, starts a screen session on the node, then sshes to there and connects to the screen. You can also ssh to this node again and connect to the process again.
sinteractive --time=HH:MM:SS --mem=nnG
Just like with
srun --pty, remember to close the process when done.
However, it’s even harder than before. Since there is a separate screen
session running, just closing the terminal isn’t enough. Exit all
shells in the screen session on the node (C-d or ``exit``), or cancel
the process (see below).
Partitions: This used to be important, but is now mostly
automatically set. They trade off different resource limits and
quickness which it is available. For example, the
tells us to us the interactive partition - which should always be
available for quick tests.
-p debug is a short partition for
debugging. See a bit more in the serial tutorial.
time/CPU/memory requirements: The commands srun/sinteractive have
many more options that let you specific resources. The most important
for interactive running are probably
How much time/memory/CPU resources should you request? The less you request, the faster you are likely to run. As for all you need, but not ridiculously large amounts. If you request something slightly less than a node size (note that we have different size nodes) or partition limit (see below), you are more likely to fit into a spare spot. We have many nodes with 12 cores, and some with 20 or 24. If you request 24, you have very limited options. If you request 10, or 18, you will have a lot more options. Same with memory: most common cutoffs are 48, 64, 128, 256GB. Use smaller values when interactive testing, then more for batch running overnight.
Configure your program well: Also, note that requesting more CPUs doesn’t magically mean that your program becomes parallel. Make sure you turn that on in your code to enable that. Also specify how many CPUs to use (matching how many you request with slurm). If you don’t get an entire node, your program might try to use all CPUs, and the OS will limit the number you can use (with cgroups, if you are interested). This leads to inefficiency.
Monitoring your usage¶
When you start running in the queue, you need to be able to get information on how much time, memory, etc is being used. Without this, you won’t know how much time and memory to request. You always want to request the least possible, so that your stuff can run faster. The next tutorial (about batch jobs) goes into this in more detail. You probably want to be checking things like slurm history even if you aren’t running batch, to see how many resources you are actually using.
slurm q will tell you the currently running processes (a good way to make sure you have stopped everything).
slurm history will tell you about recent jobs, including how much total memory they used and their total CPU time.
scancel will cancel a job by job-id (useful is something keeps running after you don’t need it anymore).
- Check out some of these commands:
slurm job $jobidon some running job - does anything look interesting?
srun -p debug hostname- what do you see and why? Run
slurm historyand find that jobid. Run
sacct -j $jobidand
scontrol show node wsm1What is this?