Pi exercises

This series of exercises uses a simple Python code that calculates pi.

Using the cluster from a command line

Shell-4: Clone the hpc-examples repository

(Part of a series: pi, ngrams)

Do the steps above to clone the hpc-examples repository. List the directory from the command line and verify it matches what you see in the view on Github repo page.

Is your home directory the right place to store a cloned git repository?

Solution

The steps are listed above. You also can check that everything is correct with git status. Output should be something like this:
$ ls
io/    mpi/     postgres/  R/          scip/      gpu/
misc/  openmp/  python/    README.rst  slurm/

$ git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
Normally, large projects you are working on should be in your work directory. This is small enough we can ignore that for now (and make our exercises work on different clusters).

In triton/tut/cluster-shell.rst:

Shell-7: Try the --help option

(Part of a series: pi)

Many programs have a --help option which gives a reminder of the options of the program. (Note that this has to be explicitly programmed - it’s a convention, not magic.) Try giving this option to pi.py and see what happens.

Solution

pi.py does have a --help option. Libraries that handle command line arguments for you can auto-generate this help, which is useful even if you wrote the program yourself. In this case, the help output is automatically generated by the Python standard library module argparse.

$ python slurm/pi.py --help
usage: pi.py [-h] [--nprocs NPROCS] [--seed SEED] [--sleep SLEEP]
             [--optimized] [--serial SERIAL]
             iters

positional arguments:
  iters            Number of iterations

optional arguments:
  -h, --help       show this help message and exit
  --nprocs NPROCS  Number of nprocs, using multiprocessing
  --seed SEED      Random seed
  --sleep SLEEP    Sleep this many seconds
  --optimized      Run an optimized vectorized version of the code
  --serial SERIAL  This fraction [0.0--1.0] of iterations to be run serial.

Interactive jobs

In triton/tut/interactive.rst:

Interactive-3: Time scaling

The program hpc-examples/slurm/pi.py calculates pi using a simple stochastic algorithm. The program takes one positional argument: the number of trials.

The time program allows you to time any program, e.g. you can time python x.py to print the amount of time it takes.

Run the program, timing it with time, a few times, increasing the number of trials, until it takes about 10 seconds: time python hpc-examples/slurm/pi.py 500, then 5000, then 50000, and so on.
Add srun in front (srun python ...). Use the seff JOBID command to see how much time the program took to run. (If you’d like to use the time command, you can run srun --mem=MEM --time=TIME time python hpc-examples/slurm/pi.py ITERS)
Look at the job history using slurm history - can you see how much time each process used? What’s the relation between TotalCPUTime and WallTime?

Solution

$ time python3 slurm/pi.py 5000
Calculating pi via 5000 stochastic trials
{"pi_estimate": 3.1384, "iterations": 5000, "successes": 3923}

real   0m0.095s
user   0m0.082s
sys    0m0.014s
$ time python3 slurm/pi.py 50000
Calculating pi via 50000 stochastic trials
{"pi_estimate": 3.13464, "iterations": 50000, "successes": 39183}

real   0m0.154s
user   0m0.134s
sys    0m0.020s
$ time python3 slurm/pi.py 500000
Calculating pi via 500000 stochastic trials
{"pi_estimate": 3.141776, "iterations": 500000, "successes": 392722}

real   0m0.792s
user   0m0.766s
sys    0m0.023s
$ time python3 slurm/pi.py 5000000
Calculating pi via 5000000 stochastic trials
{"pi_estimate": 3.1424752, "iterations": 5000000, "successes": 3928094}

real   0m6.287s
user   0m6.262s
sys    0m0.026s

$ srun python3 slurm/pi.py 5000000
srun: job 19201873 queued and waiting for resources
srun: job 19201873 has been allocated resources
Calculating pi via 5000000 stochastic trials
{"pi_estimate": 3.1424752, "iterations": 5000000, "successes": 3928094}
$ srun python3 slurm/pi.py 50000000
srun: job 19201880 queued and waiting for resources
srun: job 19201880 has been allocated resources
Calculating pi via 50000000 stochastic trials
{"pi_estimate": 3.14153752, "iterations": 50000000, "successes": 39269219}
$ srun python3 slurm/pi.py 500000000
srun: job 19201910 queued and waiting for resources
srun: job 19201910 has been allocated resources
Calculating pi via 500000000 stochastic trials
{"pi_estimate": 3.14152692, "iterations": 500000000, "successes": 392690865}

$ seff 19201873
Job ID: 19201873
Cluster: triton
User/Group: darstr1/darstr1
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:04
CPU Efficiency: 100.00% of 00:00:04 core-walltime
Job Wall-clock time: 00:00:04
Memory Utilized: 1.21 MB
Memory Efficiency: 0.24% of 500.00 MB
$ seff 19201880
...
CPU Utilized: 00:00:44
CPU Efficiency: 97.78% of 00:00:45 core-walltime
Job Wall-clock time: 00:00:45
...
$ seff 19201910
...
CPU Utilized: 00:07:51
CPU Efficiency: 99.58% of 00:07:53 core-walltime
Job Wall-clock time: 00:07:53
...

each process should be visible as a separate step indexed from 0. For larger iterations, TotalCpuTime should be similar to WallTime, Since TotalCpuTime shows amount of time Cpus were at full utilization, times the number of Cpus. Note that TotalCPUTime has precision of milliseconds, whereas WallTime has precision of seconds.

JobID         JobName              Start            ReqMem  MaxRSS TotalCPUTime    WallTime Tasks CPU Ns Exit State Nodes
19201873      python3              06-06 23:18:21     500M       -    00:04.044    00:00:04  none   1 1   0:0 COMP  csl48
  └─ extern   *                    06-06 23:18:21               0M    00:00.001    00:00:04     1   1 1   0:0 COMP  csl48
  └─ 0        python3              06-06 23:18:21               1M    00:04.043    00:00:04     1   1 1   0:0 COMP  csl48
19201880      python3              06-06 23:18:35     500M       -    00:44.417    00:00:45  none   1 1   0:0 COMP  csl48
  └─ extern   *                    06-06 23:18:35               1M    00:00.001    00:00:45     1   1 1   0:0 COMP  csl48
  └─ 0        python3              06-06 23:18:35               1M    00:44.415    00:00:45     1   1 1   0:0 COMP  csl48
19201910      python3              06-06 23:19:25     500M       -    07:51.107    00:07:53  none   1 1   0:0 COMP  csl48
  └─ extern   *                    06-06 23:19:25               1M    00:00.001    00:07:53     1   1 1   0:0 COMP  csl48
  └─ 0        python3              06-06 23:19:25              10M    07:51.106    00:07:53     1   1 1   0:0 COMP  csl48

Serial Jobs

In triton/tut/serial.rst:

Serial-3: Submitting and cancelling a job

Create a batch script which does nothing (or some pointless operation for a while), for example sleep 300 (this shell command does nothing for 300 seconds). Check the queue to see when it starts running. Then, cancel the job. What output is produced?

Solution

You can check when your job starts running with slurm q. Then you can cancel it with scancel JOBID, where JOBID can be found from slurm q output. After cancelling the job, it should still produce an output file (named either slurm-JOBID.out or whatever you defined in the

#!/bin/bash

echo "We are waiting"
sleep 300
echo "We are done waiting"
srun python3 slurm/pi.py 1000000

You can check when your job starts running with slurm q. Then you can cancel it with scancel JOBID, where JOBID can be found from slurm q output. After cancelling the job, it should still produce an output file (named either slurm-JOBID.out or whatever you defined in the sbatch file.) The output file also says the job was cancelled.

In triton/tut/serial.rst:

Serial-4: Modifying Slurm script while its running

Modifying scripts while a job has been submitted is a bad practice.

Add sleep 120 into the Slurm script that runs pi.py. Submit the script and while it is running, open the Slurm script with an editor of your choice and add the following line near the end of the script.

echo 'Modified'

Use slurm q to check when the job finishes and check the output. What can you interpret from this?

Remove the created line after you have finished.

Solution

In this case we modified a script after we had submitted it. These modifications do not affect the script that is in the queue as that script has already been given to Slurm.

The Slurm script is locked in place when you submit a script. Modifications to the script do not affect the script that is being run.

You should always make certain that you do not modify the Slurm script being run or you cannot replicate your run.

In triton/tut/serial.rst:

Serial-5: Modify script while it is running

Modifying scripts while a job has been submitted is a bad practice.

Add sleep 180 into the Slurm script that runs pi.py. Submit the script and while it is running, open the pi.py with an editor of your choice and add the following line near the start of the script.

raise Exception()

Use slurm q to check when the job finishes and check the output. What can you interpret from this?

Remove the created line after you have finished. You can also use git checkout -- pi.py (remember to give a proper relative path, depending on your current working directory!)

Solution

In this case we modified the Python code before it had begun executing (we added a line that raised an error while the sleep 180 was being executed).

The code that the Slurm script executes will be determined when the script is running. It is not locked in place when you submit a script.

You should always make certain that you do not modify the code that the Slurm script will execute while it is queued or while its being run.

Otherwise you can get errors and you cannot replicate your run.

Array jobs: embarassingly parallel execution

In triton/tut/array.rst:

Array-2: Array jobs and different random seeds

Create a job array that uses the slurm/pi.py to calculate a combination of different iterations and seed values and save them all to different files. Keep the standard output (#SBATCH --output=FILE) separate from the standard error (#SBATCH --error=FILE).

In triton/tut/array.rst:

Array-3: Combine the outputs of the previous exercise.

You find the slurm/pi_aggregation.py program in hpc-examples. Run this and give all the output files as arguments. It will combine all the statistics and give a more accurate value of $\pi$.

Shared memory parallelism: multithreading & multiprocessing

In triton/tut/parallel-shared.rst:

Shared memory parallelism 1: Test the example’s scaling

Run the example with a bigger number of trials (100000000 or $10^{8}$) and with 1, 2 and 4 CPUs. Check the running time and CPU utilization for each run.

Solution

You can run the program without parallelization with:

srun --time=00:10:00 --mem=1G python3 slurm/pi.py 100000000

Afterwards you can use seff JOBID to get the utilization. You can run the program with multiple CPUs with:

srun --cpus-per-task=2 --time=00:10:00 --mem=1G python3 slurm/pi.py --nprocs=2 100000000
srun --cpus-per-task=4 --time=00:10:00 --mem=1G python3 slurm/pi.py --nprocs=4 100000000

You should see that the time needed to run the program (“Job Wall-clock time”) ) is basically divided by the number of processors while the CPU utilization time (“CPU Utilized”) remains the same.

In triton/tut/parallel-shared.rst:

Shared memory parallelism 2: Test scaling for a program that has a serial part

pi.py can be called with an argument --serial=0.1 to run a fraction of the trials in a serial fashion (here, 10%).

Run the example with a bigger number of trials (100000000 or $10^{8}$), 4 CPUs and a varying serial fraction (0.1, 0.5, 0.8). Check the running time and CPU utilization for each run.

Solution

You can run the program with 10% serial execution using the following:

srun --cpus-per-task=4 --time=00:10:00 --mem=1G python3 slurm/pi.py --serial=0.1 --nprocs=4 100000000

Afterwards you can use seff JOBID to get the utilization.

Doing the run with different serial portion should show that a bigger the serial portion, the less benefit the parallelization gives.

In triton/tut/parallel-shared.rst:

Shared memory parallelism 3: More parallel $\neq$ fastest solution

pi.py can be called with an argument --optimized to run an optimized version of the code that utilizes NumPy for vectorized calculations.

Run the example with a bigger number of trials (100000000 or $10^{8}$) and with 4 CPUs. Now run the optimized example with the same amount of trials and with 1 CPU. Check the CPU utilization and running time for each run.

Solution

You can run the program with 4 CPUs using the following:

srun --cpus-per-task=4 --time=00:10:00 --mem=1G python3 slurm/pi.py --nprocs=4 100000000

You can run the optimized version with the following:

srun --time=00:10:00 --mem=1G python3 slurm/pi.py --optimized 100000000

Afterwards you can use seff JOBID to get the utilization.

The optimized version, which uses NumPy to create a big batch of random numbers at a time and calculates the hits for all of the random numbers at a same time should be significantly faster. NumPy itself uses libraries written in C and Fortran that make the calculations a lot faster than Python would.

Using libraries and coding practices that are better suited for the task can provide bigger performance boost that using multiple CPUs.