Monitoring job progress and job efficiency

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Abstract

  • You must always monitor jobs to make sure they are using all the resources you request.

  • Test scaling: double resources, if it doesn’t run almost twice as fast, it’s not worth it.

  • seff JOBID shows efficiency and performance of a single jobs

  • slurm queue shows waiting and running jobs (this is a custom command)

  • slurm history shows completed jobs (also custom command)

  • GPU efficiency: A job’s comment field shows GPU performance info (custom setup at Aalto), sacct -j JOBID -o comment -p shows this.

Introduction

When running jobs, one usually wants to do monitoring at various different stages:

  • Firstly, when job is submitted, one wants to monitor the position of the job in the queue and expected starting time for the job.

  • Secondly, when job is running, one wants to monitor the jobs state and how the simulations is performing.

  • Thirdly, once the job has finished, one wants to monitor the job’s performance and resource usage.

There are various tools available for each of these steps.

See also

Please ensure you have read Interactive jobs and Serial Jobs before you proceed with this tutorial.

Monitoring during queueing

The command slurm q/slurm queue (or squeue -u $USER) can be used to monitor the status of your jobs in the queue. An example output is given below:

$ slurm q
JOBID              PARTITION NAME                  TIME       START_TIME    STATE NODELIST(REASON)
60984785           interacti _interactive          0:29 2021-06-06T20:41  RUNNING pe6
60984796           batch-csl hostname              0:00              N/A  PENDING (Priority)

Here the output are as follows:

  • JOBID shows the id number that Slurm has assigned for your job.

  • PARTITION shows the partition(s) that the job has been assigned to.

  • NAME shows the name of the submission script / job step / command.

  • TIME shows the amount of time of the job has run so far.

  • START_TIME shows the start time of the job. If job isn’t currently running, Slurm will try to form an estimate on when the job will run.

  • STATE shows the state of the job. Usually it is RUNNING or PENDING.

  • NODES shows the names of the nodes where the program is running. If the job isn’t running, Slurm tries to give a reason why the job is not running.

When submitting a job one often wants to see if job starts successfully. This can be made easier by running slurm w q/slurm watch queue or (watch -n 15 squeue -u $USER). This opens a watcher that prints the output of slurm queue every 15 seconds. This watcher can be closed with <CTRL> + C. Do remember to close the watcher when you’re not watching the output interactively.

To see all of the information that Slurm sees, one can use the command scontrol show -d jobid JOBID.

The slurm queue is a wrapper built around squeue-command. One can also use it directly to get more information on the job’s status. See squeue’s documentation for more information.

There are other commands to slurm that you can use to monitor the cluster status, job history etc.. A list of examples is given below:

Slurm status info reference

Command

Description

slurm q ; slurm qq

Status of your queued jobs (long/short)

slurm partitions

Overview of partitions (A/I/O/T=active,idle,other,total)

slurm cpus PARTITION

list free CPUs in a partition

slurm history [1day,2hour,…]

Show status of recent jobs

seff JOBID

Show percent of mem/CPU used in job. See Monitoring.

sacct -o comment -p -j JOBID

Show GPU efficiency

slurm j JOBID

Job details (only while running)

slurm s ; slurm ss PARTITION

Show status of all jobs

sacct

Full history information (advanced, needs args)

Full slurm command help:

$ slurm

Show or watch job queue:
 slurm [watch] queue     show own jobs
 slurm [watch] q   show user's jobs
 slurm [watch] quick     show quick overview of own jobs
 slurm [watch] shorter   sort and compact entire queue by job size
 slurm [watch] short     sort and compact entire queue by priority
 slurm [watch] full      show everything
 slurm [w] [q|qq|ss|s|f] shorthands for above!
 slurm qos               show job service classes
 slurm top [queue|all]   show summary of active users
Show detailed information about jobs:
 slurm prio [all|short]  show priority components
 slurm j|job      show everything else
 slurm steps      show memory usage of running srun job steps
Show usage and fair-share values from accounting database:
 slurm h|history   show jobs finished since, e.g. "1day" (default)
 slurm shares
Show nodes and resources in the cluster:
 slurm p|partitions      all partitions
 slurm n|nodes           all cluster nodes
 slurm c|cpus            total cpu cores in use
 slurm cpus   cores available to partition, allocated and free
 slurm cpus jobs         cores/memory reserved by running jobs
 slurm cpus queue        cores/memory required by pending jobs
 slurm features          List features and GRES

Examples:
 slurm q
 slurm watch shorter
 slurm cpus batch
 slurm history 3hours

Other advanced commands (many require lots of parameters to be useful):

Command

Description

squeue

Full info on queues

sinfo

Advanced info on partitions

slurm nodes

List all nodes

Monitoring a job while it is running

As the most common way of using HPC resources is to run non-interactive jobs, it is usually a good idea to make certain that the program that will be run will produce some output that can be used to monitor the jobs’ progress.

The typical way of monitoring the progress is to add print-statements that produce output to the standard output. This output is then redirected to the Slurm output file (-o FILE, default slurm-JOBID.log) where it can be read by the user. This file is updated while the job is running, but after some delay (every few KB written) because of buffering.

It is important to differentiate between different types of output:

  • Monitoring output is usually print statements and it describes what the program is doing (e.g. “Loading data”, “Running iteration 31”), what is the state of the simulation (e.g. “Total energy is 4.232 MeV”, “Loss is 0.432”) and to get timing information (e.g. “Iteration 31 took 182s”). This output can then be used to see if the program works, if the simulation converges and to determine how long does it take to do different calculations.

  • Debugging output is similar to monitoring output, but it is usually more verbose and writes the internal state of the program (e.g. values of variables). This is usually required during development stage of a program, but once the program works and longer simulations are needed, printing debugging output is not recommended.

  • Checkpoint output can be used to resume the current state of the simulation in the case of unexpected situations such as bugs, network problems or hardware failures. These should be in binary data as this keeps the accuracy of the floating point numbers intact. In big simulations checkpoints can be large, so the frequency of taking checkpoints should not be too high. In iterative processes e.g. Markov chain, taking checkpoints can be very quick and can be done more frequently. In smaller applications it is usually good to take checkpoints if the program starts a different phase of the simulation (e.g. plotting after simulation). This minimizes loss of simulation time due to programming bugs.

  • Simulation output is something that the program outputs when the simulation is done. When doing long simulations it is important to consider what output parameters do you want to output. One should include all parameters that might be needed so that the simulations do not need to be run again. When doing time series output this is even more important as e.g. averages, statistical moments cannot necessarily be recalculated after the simulation has ended. It is usually good idea to save a checkpoint at the end as well.

When creating monitoring output it is usually best to write it in a human-readable format and human-readable quantities. This makes it easy to see the state of the program.

Checking job history after completion

The command slurm h/slurm history can be used to check the history of your jobs. Example output is given below:

$ slurm h
JobID         JobName              Start            ReqMem  MaxRSS TotalCPUTime    WallTime Tasks CPU Ns Exit State Nodes
60984785      _interactive         06-06 20:41:31    500Mc       -    00:01.739    00:07:36  none   1 1   0:0 CANC  pe6
  └─ batch    *                    06-06 20:41:31    500Mc      6M    00:01.737    00:07:36     1   1 1   0:0 COMP  pe6
  └─ extern   *                    06-06 20:41:31    500Mc      1M    00:00.001    00:07:36     1   1 1   0:0 COMP  pe6
60984796      hostname             06-06 20:49:36    500Mc       -    00:00.016    00:00:00  none  10 10  0:0 CANC  csl[3-6,9,14,17-18,20,23]
  └─ extern   *                    06-06 20:49:36    500Mc      1M    00:00.016    00:00:01    10  10 10  0:0 COMP  csl[3-6,9,14,17-18,20,23]

Here the output are as follows:

  • JobID shows the id number that Slurm has assigned for your job.

  • JobName shows the name of the submission script / job step / command.

  • Start shows the start time of the job.

  • ReqMem shows the amount of memory requested by the job. The format is an an amount in megabytes or gigabytes followed by c or n for memory per core or memory per node respectively.

  • MaxRSS shows the maximum memory usage of the job as calculated by Slurm. This is measured in set intervals.

  • TotalCPUTime shows the total CPU time used by the job. It shows the amount of seconds the CPUs were at full utilization. For single CPU jobs, this should be close to the WallTime. For jobs that use multiple CPUs, this should be close to the number of CPUs reserved times WallTime.

  • WallTime shows the runtime of the job in seconds.

  • Tasks shows the number of MPI tasks reserved for the job.

  • CPU shows the number of CPUs reserved for the job.

  • Ns shows the number of nodes reserved for the job.

  • Exit State shows the exit code of the command. Successful run of the program should return 0 as the exit code.

  • Nodes shows the names of the nodes where the program ran.

The slurm history-command is a wrapper built around sacct-command. One can also use it directly to get more information on the job’s status. See sacct’s documentation for more information.

For example, command sacct --format=jobid,elapsed,ncpus,ntasks,state,MaxRss --jobs=JOBID which will show information as indicated in the --format option (jobid, elapsed time, number of reserved CPUs, etc.). You can specify any field of interest to be shown using --format.

CheckingCPU and RAM efficiency after completion

You can use seff JOBID to see what percent of available CPUs and RAM was utilized. Example output is given below:

$ seff 60985042
Job ID: 60985042
Cluster: triton
User/Group: tuomiss1/tuomiss1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:29
CPU Efficiency: 90.62% of 00:00:32 core-walltime
Job Wall-clock time: 00:00:16
Memory Utilized: 1.59 MB
Memory Efficiency: 0.08% of 2.00 GB

If your processor usage is far below 100%, your code may not be working correctly. If your memory usage is far below 100% or above 100%, you might have a problem with your RAM requirements. You should set the RAM limit to be a bit above the RAM that you have utilized.

You can also monitor individual job steps by calling seff with the syntax seff JOBID.JOBSTEP.

Important

When making job reservations it is important to distinguish between requirements for the whole job (such as --mem) and requirements for each individual task/cpu (such as --mem-per-cpu). E.g. requesting --mem-per-cpu=2G with --ntasks=2 and --cpus-per-task=4 will create a total memory reservation of (2 tasks)*(4 cpus / task)*(2GB / cpu)=16GB.

Monitoring a job’s GPU utilization

See also

GPU computing. We will talk about how to request GPUs later, but it’s kept here for clarity.

When running a GPU job, you should check that the GPU is being fully utilized.

When your job has started, you can ssh to the node and run nvidia-smi. It should be close to 100%.

Once the job has finished, you can use slurm history to obtain the jobID and run:

$ sacct -j JOBID -o comment -p
{"gpu_util": 99.0, "gpu_mem_max": 1279.0, "gpu_power": 204.26, "ncpu": 1, "ngpu": 1}|

This also shows the GPU utilization.

If the GPU utilization of your job is low, you should check whether its CPU utilization is close to 100% with seff JOBID. Having a high CPU utilization and a low GPU utilization can indicate that the CPUs are trying to keep the GPU occupied with calculations, but the workload is too much for the CPUs and thus GPUs are not constantly working.

Increasing the number of CPUs you request can help, especially in tasks that involve data loading or preprocessing, but your program must know how to utilize the CPUs.

However, you shouldn’t request too many CPUs: There wouldn’t be enough CPUs for everyone to use the GPUs and they would go to waste (all of our nodes have 4-12 CPUs for each GPU).

Exercises

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

Monitoring-1: Adding more verbosity into your scripts

echo is a shell command which prints something - the equivalent of “print debugging”.

date is a shell command that prints the current date and time. It is useful for getting timestamps.

Modify one of the scripts from Serial Jobs with a lot of echo MY LINE OF TEXT commands to be able to verify what it’s doing. Check the output.

Now change the script and add date-command below the echo-commands. Run the script and check the output. What do you see?

Now change the script, remove the echos, and add “set -x” below the #SBATCH-comments. Run the script again. What do you see?

Monitoring-2: Basic monitoring example

Using our standard pi.py example,

  1. Create a slurm script that runs the algorithm with 100000000 (\(10^8\)) iterations. Submit it to the queue and use slurm queue, slurm history and seff to monitor the job’s performance.

  2. Add multiple job steps (separate srun lines), each of which runs the algorithm pi.py with increasing number of iterations (from range 100 - 10000000 (\(10^7\)). How does this appear in slurm history?

Monitoring-3: Using seff

Continuing from the example above,

  1. Use seff to check performance of individual job steps. Can you explain why the CPU utilization numbers change between steps?

This is really one of the most important take-aways from this lesson.

Monitoring-4: Multiple processors

The script pi.py has been written so that it can be run using multiple processors. Run the script with multiple processors and \(10^8\) iterations with:

$ srun --cpus-per-task=2 python pi.py --nprocs=2 100000000

After you have run the script, do the following:

  1. Use slurm history to check the TotalCPUTime and WallTime. Compare them to the timings for the single CPU run with \(10^8\) iterations.

  2. Use seff to check CPU performance of the job.

Monitoring-5: No output

You submit a job, and it should be writing some stuff to the output. But nothing is appearing in the output file. What’s wrong?

What’s next?

Next tutorial is about different ways of doing parallel computing.