Pi exercises
This series of exercises uses a simple Python code that calculates pi.
Using the cluster from a command line
In triton/tut/cluster-shell.rst:
Shell-4: Clone the hpc-examples repository
(Part of a series: pi, ngrams)
Do the steps above to clone the hpc-examples repository. List the directory from the command line and verify it matches what you see in the view on Github repo page.
Is your home directory the right place to store a cloned git repository?
Solution
The steps are listed above. You also can check that everything is correct with
git status. Output should be something like this:$ ls io/ mpi/ postgres/ R/ scip/ gpu/ misc/ openmp/ python/ README.rst slurm/ $ git status On branch master Your branch is up to date with 'origin/master'. nothing to commit, working tree cleanNormally, large projects you are working on should be in your work directory. This is small enough we can ignore that for now (and make our exercises work on different clusters).
In triton/tut/cluster-shell.rst:
Shell-7: Try the --help option
(Part of a series: pi)
Many programs have a --help option which gives a reminder of the
options of the program. (Note that this has to be explicitly
programmed - it’s a convention, not magic.) Try giving this option
to pi.py and see what happens.
Solution
pi.py does have a --help option. Libraries that handle
command line arguments for you can auto-generate this help, which
is useful even if you wrote the program yourself. In this case,
the help output is automatically generated by the Python standard
library module argparse.
$ python slurm/pi.py --help
usage: pi.py [-h] [--nprocs NPROCS] [--seed SEED] [--sleep SLEEP]
[--optimized] [--serial SERIAL]
iters
positional arguments:
iters Number of iterations
optional arguments:
-h, --help show this help message and exit
--nprocs NPROCS Number of nprocs, using multiprocessing
--seed SEED Random seed
--sleep SLEEP Sleep this many seconds
--optimized Run an optimized vectorized version of the code
--serial SERIAL This fraction [0.0--1.0] of iterations to be run serial.
Interactive jobs
In triton/tut/interactive.rst:
Interactive-3: Time scaling
The program hpc-examples/slurm/pi.py
calculates pi using a simple stochastic algorithm. The program takes
one positional argument: the number of trials.
The time program allows you to time any program, e.g. you can
time python x.py to print the amount of time it takes.
Run the program, timing it with
time, a few times, increasing the number of trials, until it takes about 10 seconds:time python hpc-examples/slurm/pi.py 500, then 5000, then 50000, and so on.Add
srunin front (srun python ...). Use theseff JOBIDcommand to see how much time the program took to run. (If you’d like to use thetimecommand, you can runsrun --mem=MEM --time=TIME time python hpc-examples/slurm/pi.py ITERS)Look at the job history using
slurm history- can you see how much time each process used? What’s the relation between TotalCPUTime and WallTime?
Solution
$ time python3 slurm/pi.py 5000 Calculating pi via 5000 stochastic trials {"pi_estimate": 3.1384, "iterations": 5000, "successes": 3923} real 0m0.095s user 0m0.082s sys 0m0.014s $ time python3 slurm/pi.py 50000 Calculating pi via 50000 stochastic trials {"pi_estimate": 3.13464, "iterations": 50000, "successes": 39183} real 0m0.154s user 0m0.134s sys 0m0.020s $ time python3 slurm/pi.py 500000 Calculating pi via 500000 stochastic trials {"pi_estimate": 3.141776, "iterations": 500000, "successes": 392722} real 0m0.792s user 0m0.766s sys 0m0.023s $ time python3 slurm/pi.py 5000000 Calculating pi via 5000000 stochastic trials {"pi_estimate": 3.1424752, "iterations": 5000000, "successes": 3928094} real 0m6.287s user 0m6.262s sys 0m0.026s$ srun python3 slurm/pi.py 5000000 srun: job 19201873 queued and waiting for resources srun: job 19201873 has been allocated resources Calculating pi via 5000000 stochastic trials {"pi_estimate": 3.1424752, "iterations": 5000000, "successes": 3928094} $ srun python3 slurm/pi.py 50000000 srun: job 19201880 queued and waiting for resources srun: job 19201880 has been allocated resources Calculating pi via 50000000 stochastic trials {"pi_estimate": 3.14153752, "iterations": 50000000, "successes": 39269219} $ srun python3 slurm/pi.py 500000000 srun: job 19201910 queued and waiting for resources srun: job 19201910 has been allocated resources Calculating pi via 500000000 stochastic trials {"pi_estimate": 3.14152692, "iterations": 500000000, "successes": 392690865}$ seff 19201873 Job ID: 19201873 Cluster: triton User/Group: darstr1/darstr1 State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:04 CPU Efficiency: 100.00% of 00:00:04 core-walltime Job Wall-clock time: 00:00:04 Memory Utilized: 1.21 MB Memory Efficiency: 0.24% of 500.00 MB $ seff 19201880 ... CPU Utilized: 00:00:44 CPU Efficiency: 97.78% of 00:00:45 core-walltime Job Wall-clock time: 00:00:45 ... $ seff 19201910 ... CPU Utilized: 00:07:51 CPU Efficiency: 99.58% of 00:07:53 core-walltime Job Wall-clock time: 00:07:53 ...
each process should be visible as a separate step indexed from 0. For larger iterations, TotalCpuTime should be similar to WallTime, Since TotalCpuTime shows amount of time Cpus were at full utilization, times the number of Cpus. Note that TotalCPUTime has precision of milliseconds, whereas WallTime has precision of seconds.
JobID JobName Start ReqMem MaxRSS TotalCPUTime WallTime Tasks CPU Ns Exit State Nodes 19201873 python3 06-06 23:18:21 500M - 00:04.044 00:00:04 none 1 1 0:0 COMP csl48 └─ extern * 06-06 23:18:21 0M 00:00.001 00:00:04 1 1 1 0:0 COMP csl48 └─ 0 python3 06-06 23:18:21 1M 00:04.043 00:00:04 1 1 1 0:0 COMP csl48 19201880 python3 06-06 23:18:35 500M - 00:44.417 00:00:45 none 1 1 0:0 COMP csl48 └─ extern * 06-06 23:18:35 1M 00:00.001 00:00:45 1 1 1 0:0 COMP csl48 └─ 0 python3 06-06 23:18:35 1M 00:44.415 00:00:45 1 1 1 0:0 COMP csl48 19201910 python3 06-06 23:19:25 500M - 07:51.107 00:07:53 none 1 1 0:0 COMP csl48 └─ extern * 06-06 23:19:25 1M 00:00.001 00:07:53 1 1 1 0:0 COMP csl48 └─ 0 python3 06-06 23:19:25 10M 07:51.106 00:07:53 1 1 1 0:0 COMP csl48
Serial Jobs
Serial-3: Submitting and cancelling a job
Create a batch script which does nothing (or some pointless
operation for a while), for example sleep 300 (this shell
command does nothing for 300 seconds). Check the queue to see when
it starts running. Then, cancel the job. What output is produced?
Solution
You can check when your job starts running with slurm q. Then
you can cancel it with scancel JOBID, where JOBID can be found
from slurm q output. After cancelling the job, it should still produce
an output file (named either slurm-JOBID.out or whatever you defined in the
#!/bin/bash
echo "We are waiting"
sleep 300
echo "We are done waiting"
srun python3 slurm/pi.py 1000000
You can check when your job starts running with slurm q. Then
you can cancel it with scancel JOBID, where JOBID can be found
from slurm q output. After cancelling the job, it should still produce
an output file (named either slurm-JOBID.out or whatever you defined in the
sbatch file.) The output file also says the job was cancelled.
Serial-4: Modifying Slurm script while its running
Modifying scripts while a job has been submitted is a bad practice.
Add sleep 120 into the Slurm script that runs pi.py. Submit the
script and while it is running, open the Slurm script with an editor of your
choice and add the following line near the end of the script.
echo 'Modified'
Use slurm q to check when the job finishes and check the output. What
can you interpret from this?
Remove the created line after you have finished.
Solution
In this case we modified a script after we had submitted it. These modifications do not affect the script that is in the queue as that script has already been given to Slurm.
The Slurm script is locked in place when you submit a script. Modifications to the script do not affect the script that is being run.
You should always make certain that you do not modify the Slurm script being run or you cannot replicate your run.
Serial-5: Modify script while it is running
Modifying scripts while a job has been submitted is a bad practice.
Add sleep 180 into the Slurm script that runs pi.py. Submit the
script and while it is running, open the pi.py with an editor of your
choice and add the following line near the start of the script.
raise Exception()
Use slurm q to check when the job finishes and check the output. What
can you interpret from this?
Remove the created line after you have finished. You can also use
git checkout -- pi.py (remember to give a proper relative path,
depending on your current working directory!)
Solution
In this case we modified the Python code before it had begun executing
(we added a line that raised an error while the sleep 180 was being
executed).
The code that the Slurm script executes will be determined when the script is running. It is not locked in place when you submit a script.
You should always make certain that you do not modify the code that the Slurm script will execute while it is queued or while its being run.
Otherwise you can get errors and you cannot replicate your run.
Array jobs: embarassingly parallel execution
Array-2: Array jobs and different random seeds
Create a job array that uses the slurm/pi.py to calculate a
combination of different iterations and seed values and save them
all to different files. Keep the standard output (#SBATCH
--output=FILE) separate from the standard error (#SBATCH
--error=FILE).
Array-3: Combine the outputs of the previous exercise.
You find the slurm/pi_aggregation.py program in hpc-examples. Run this
and give all the output files as arguments. It will combine all
the statistics and give a more accurate value of \(\pi\).