Monitoring job progress and job efficiency¶
When running jobs, one usually wants to do monitoring at various different stages:
Firstly, when job is submitted, one wants to monitor the position of the job in the queue and expected starting time for the job.
Secondly, when job is running, one wants to monitor the jobs state and how the simulations is performing.
Thirdly, once the job has finished, one wants to monitor the job’s performance and resource usage.
There are various tools available for each of these steps.
Monitoring job queue state after it has been submitted¶
slurm queue (or
squeue -u $USER) can be used
to monitor the status of your jobs in the queue. An example output is given below:
$ slurm q JOBID PARTITION NAME TIME START_TIME STATE NODELIST(REASON) 60984785 interacti _interactive 0:29 2021-06-06T20:41 RUNNING pe6 60984796 batch-csl hostname 0:00 N/A PENDING (Priority)
Here the output are as follows:
JOBIDshows the id number that Slurm has assigned for your job.
PARTITIONshows the partition(s) that the job has been assigned to.
NAMEshows the name of the submission script / job step / command.
TIMEshows the amount of time of the job has run so far.
START_TIMEshows the start time of the job. If job isn’t currently running, Slurm will try to form an estimate on when the job will run.
STATEshows the state of the job. Usually it is
NODESshows the names of the nodes where the program is running. If the job isn’t running, Slurm tries to give a reason why the job is not running.
When submitting a job one often wants to see if job starts successfully.
This can be made easier by running
slurm w q/
slurm watch queue
watch -n 15 squeue -u $USER).
This opens a watcher that prints the output of
slurm queue every 15
seconds. This watcher can be closed with
<CTRL> + C. Do remember to
close the watcher when you’re not watching the output interactively.
To see all of the information that Slurm sees, one can use the command
scontrol show -d jobid <jobid>.
slurm queue is a wrapper built around
squeue-command. One can also
use it directly to get more information on the job’s status. See
squeue’s documentation for more
There are other commands to
slurm that you can use to monitor the
cluster status, job history etc.. A list of examples is given below:
Status of your queued jobs (long/short)
Overview of partitions (A/I/O/T=active,idle,other,total)
list free CPUs in a partition
Show status of recent jobs
Show percent of mem/CPU used in job
Job details (only while running)
Show status of all jobs
Full history information (advanced, needs args)
Full slurm command help:
$ slurm Show or watch job queue: slurm [watch] queue show own jobs slurm [watch] q show user's jobs slurm [watch] quick show quick overview of own jobs slurm [watch] shorter sort and compact entire queue by job size slurm [watch] short sort and compact entire queue by priority slurm [watch] full show everything slurm [w] [q|qq|ss|s|f] shorthands for above! slurm qos show job service classes slurm top [queue|all] show summary of active users Show detailed information about jobs: slurm prio [all|short] show priority components slurm j|job show everything else slurm steps show memory usage of running srun job steps Show usage and fair-share values from accounting database: slurm h|history show jobs finished since, e.g. "1day" (default) slurm shares Show nodes and resources in the cluster: slurm p|partitions all partitions slurm n|nodes all cluster nodes slurm c|cpus total cpu cores in use slurm cpus cores available to partition, allocated and free slurm cpus jobs cores/memory reserved by running jobs slurm cpus queue cores/memory required by pending jobs slurm features List features and GRES Examples: slurm q slurm watch shorter slurm cpus batch slurm history 3hours
Other advanced commands (many require lots of parameters to be useful):
Full info on queues
Advanced info on partitions
List all nodes
Monitoring job while it is running¶
As the most common way of using HPC resources is to run non-interactive jobs, it is usually a good idea to make certain that the program that will be run will produce some output that can be used to monitor the jobs’ progress.
Typical way of monitoring the progress is to add print-statements that produce output to the standard output. This output is then redirected to the Slurm output file where it can be read by the user.
It is important to differentiate between different types of output:
Monitoring output is usually print statements and it describes what the program is doing (e.g. “Loading data”, “Running iteration 31”), what is the state of the simulation (e.g. “Total energy is 4.232 MeV”, “Loss is 0.432”) and to get timing information (e.g. “Iteration 31 took 182s”). This output can then be used to see if the program works, if the simulation converges and to determine how long does it take to do different calculations.
Debugging output is similar to monitoring output, but it is usually more verbose and writes the internal state of the program (e.g. values of variables). This is usually required during development stage of a program, but once the program works and longer simulations are needed, printing debugging output is not recommended.
Checkpoint output can be used to resume the current state of the simulation in the case of unexpected situations such as bugs, network problems or hardware failures. These should be in binary data as this keeps the accuracy of the floating point numbers intact. In big simulations checkpoints can be large, so the frequency of taking checkpoints should not be too high. In iterative processes e.g. Markov chain, taking checkpoints can be very quick and can be done more frequently. In smaller applications it is usually good to take checkpoints if the program starts a different phase of the simulation (e.g. plotting after simulation). This minimizes loss of simulation time due to programming bugs.
Simulation output is something that the program outputs when the simulation is done. When doing long simulations it is important to consider what output parameters do you want to output. One should include all parameters that might be needed so that the simulations do not need to be run again. When doing time series output this is even more important as e.g. averages, statistical moments cannot necessarily be recalculated after the simulation has ended. It is usually good idea to save a checkpoint at the end as well.
When creating monitoring output it is usually best to write it in a human-readable format and human-readable quantities. This makes it easy to see the state of the program.
Checking job history after it has finished¶
slurm history can be used to check the history
of your jobs. Example output is given below:
$ slurm h JobID JobName Start ReqMem MaxRSS TotalCPUTime WallTime Tasks CPU Ns Exit State Nodes 60984785 _interactive 06-06 20:41:31 500Mc - 00:01.739 00:07:36 none 1 1 0:0 CANC pe6 └─ batch * 06-06 20:41:31 500Mc 6M 00:01.737 00:07:36 1 1 1 0:0 COMP pe6 └─ extern * 06-06 20:41:31 500Mc 1M 00:00.001 00:07:36 1 1 1 0:0 COMP pe6 60984796 hostname 06-06 20:49:36 500Mc - 00:00.016 00:00:00 none 10 10 0:0 CANC csl[3-6,9,14,17-18,20,23] └─ extern * 06-06 20:49:36 500Mc 1M 00:00.016 00:00:01 10 10 10 0:0 COMP csl[3-6,9,14,17-18,20,23]
Here the output are as follows:
JobIDshows the id number that Slurm has assigned for your job.
JobNameshows the name of the submission script / job step / command.
Startshows the start time of the job.
ReqMemshows the amount of memory requested by the job. The format is an an amount in megabytes or gigabytes followed by
nfor memory per core or memory per node respectively.
MaxRSSshows the maximum memory usage of the job as calculated by Slurm. This is measured in set intervals.
TotalCPUTimeshows the total CPU time used by the job. It shows the amount of seconds the CPUs were at full utilization. For single CPU jobs, this should be close to the
WallTime. For jobs that use multiple CPUs, this should be close to the number of CPUs reserved times
WallTimeshows the runtime of the job in seconds.
Tasksshows the number of MPI tasks reserved for the job.
CPUshows the number of CPUs reserved for the job.
Nsshows the number of nodes reserved for the job.
Exit Stateshows the exit code of the command. Successful run of the program should return 0 as the exit code.
Nodesshows the names of the nodes where the program ran.
slurm history-command is a wrapper built around
can also use it directly to get more information on the job’s status. See
sacct’s documentation for more
For example, command
sacct --format=jobid,elapsed,ncpus,ntasks,state,MaxRss --jobs=<jobid>
which will show information as indicated in the
--format option (jobid,
elapsed time, number of reserved CPUs, etc.). You can specify any field of
interest to be shown using
Monitoring job’s CPU and RAM usage efficiency after it has finished¶
You can use
seff <jobid> to see what percent of available CPUs and RAM was
utilized. Example output is given below:
$ seff 60985042 Job ID: 60985042 Cluster: triton User/Group: tuomiss1/tuomiss1 State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 00:00:29 CPU Efficiency: 90.62% of 00:00:32 core-walltime Job Wall-clock time: 00:00:16 Memory Utilized: 1.59 MB Memory Efficiency: 0.08% of 2.00 GB
If your processor usage is far below 100%, your code may not be working correctly. If your memory usage is far below 100% or above 100%, you might have a problem with your RAM requirements. You should set the RAM limit to be a bit above the RAM that you have utilized.
You can also monitor individual job steps by calling
seff with the syntax
seff <jobid>.<job step>.
When making job reservations it is important to distinguish
between requirements for the whole job (such as
requirements for each individual task/cpu (such as
will create a total memory reservation of
(2 tasks)*(4 cpus / task)*(2GB / cpu)=16GB.
Monitoring job’s GPU utilization¶
When running a GPU job, you should check that the GPU is being fully utilized.
When your job has started, you can
ssh to the node and run
nvidia-smi. You can find your process by e.g. using
and inspect the
GPU-Util column. It should be close to 100%.
Once the job has finished, you can use
slurm history to obtain the
jobID and run:
$ sacct -j <jobID> -o comment -p
This also shows the GPU utilization.
There are factors to be considered regarding efficient use of GPUs. For instance, is your code itself efficient enough? Are you using the framework pipelines in the intended fashion? Is it only using GPU for a small portion of the entire task? Amdahl’s law of parallelization speedup is relevant here.
If the GPU utilization of your job is low, you should check whether
its CPU utilization is close to 100% with
seff <jobid>. This can
indicate that the CPUs are trying to keep the GPU occupied with calculations,
but the lack of CPU performance will cause a bottleneck on the GPU
Please keep in mind that when using a GPU, you need to also request enough CPUs to supply the data to the process. So, you can increase the number of CPUs you request so that enough data is provided for the GPU. However, you shouldn’t request too many: There wouldn’t be enough CPUs for everyone to use the GPUs, and they would go to waste (all of our nodes have 4-6 CPUs for each GPU).
The scripts you need for the following exercises can be found in this git
You can clone the repository by running
git clone https://github.com/AaltoSciComp/hpc-examples.git. This repository
will be used for most of the tutorial exercises.
slurm/pi.pythere is a pi estimation algorithm that uses Monte Carlo methods to get an estimate of its value. You can call the script with
python pi.py <n>, where
<n>is the number of iterations to be done by the algorithm.
Create a slurm script that runs the algorithm with 100000000 (\(10^8\)) iterations. Submit it to the queue and use
seffto monitor the job’s performance.
Add multiple job steps (separate
srunlines), each of which runs the algorithm
pi.pywith increasing number of iterations (from range 100 - 10000000 (\(10^7\)). How does this appear in
seffto check performance of individual job steps. Can you explain why the CPU utilization numbers change between steps?
pi.pyhas been written so that it can be run using multiple threads. Run the script with multiple threads and \(10^8\) iterations with:
srun --cpus-per-task=2 python pi.py --threads=2 100000000
After you have run the script, do the following:
slurm historyto check the
WallTime. Compare them to the timings for the single CPU run with \(10^8\) iterations.
seffto check CPU performance of the job.