Ngrams exercises

This series of exercises computes word or character ngrams based on fiction books in Project Gutenberg. This page has all of the exercises about ngrams inserted here, with links to their original locations. You will have to refer to the main pages to figure out how to do the exercises. The code is on GitHub.

The general outline:

Looking at data storage and the dataset that is already on the cluster
Copying over your own personal copy of the data (make a duplicate dataset)
Copy over the code (clone a Git repository)
Run the code on the cluster’s login node
Run the code on the cluster itself
Run the code in parallel

Using the cluster from a command line

In triton/tut/cluster-shell.rst:

Shell-2: Look at the Ngrams dataset

(Part of a series: ngrams)

First, let’s look at the Gutenberg-Fiction dataset. There is already a copy on the cluster (this is different than the personal copy you will make later).

Use the command line to look at this directory with some books in it: /scratch/shareddata/teaching/gutenberg-fiction/. Try to find:

What files are they?
How big are they?

Solution

$ ls /scratch/shareddata/teaching/gutenberg-fiction/
$ du -sh /scratch/shareddata/teaching/gutenberg-fiction/*

In triton/tut/cluster-shell.rst:

Shell-4: Clone the hpc-examples repository

(Part of a series: pi, ngrams)

Do the steps above to clone the hpc-examples repository. List the directory from the command line and verify it matches what you see in the view on Github repo page.

Is your home directory the right place to store a cloned git repository?

Solution

The steps are listed above. You also can check that everything is correct with git status. Output should be something like this:
$ ls
io/    mpi/     postgres/  R/          scip/      gpu/
misc/  openmp/  python/    README.rst  slurm/

$ git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
Normally, large projects you are working on should be in your work directory. This is small enough we can ignore that for now (and make our exercises work on different clusters).

In triton/tut/cluster-shell.rst:

Shell-9: Practice looking at README files

(Part of a series: ngrams)

“README” files are simple text file documentation, which is good to include with your data. Once you get new data, it’s good to look at README files to get oriented. Check out the README files within the Gutenberg Fiction dataset located at /scratch/shareddata/teaching/gutenberg-fiction/ (again: you’ll download your own copy of this later)

Solution

We haven’t given you the exact file paths, so you need to use ls some to find them. (You can also push TAB twice to tab-complete, which lists what is in a directory. Then, type a few characters and push TAB again and it’ll fill in the path for you. This is a fast way to explore directories.)

$ ls /scratch/shareddata/teaching/gutenberg-fiction/
$ ls /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction/
$ less /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction/README.md
$ less /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100/README

Data storage

In triton/tut/storage.rst:

Storage-1: Create a place to store our Gutenberg (ngrams) data.

We looked at the Gutenberg-Fiction dataset before, where it’s already on Triton. In this next few exercises, let’s pretend we didn’t already have it on the cluster, and practice copying the data to the cluster yourself.

Background: we will do a recurring example using public domain Project Gutenberg books. We will compute ngrams (tuples of words that occur in a sequence (for example (this, book, is) is a 3-gram). After computing a list of all n-grams and how often they occur, we can understand something about the books (and sometime generate some text).

In the next step you will use data that is a 2.6 GB zipfile (6.3GB uncompressed). Where would you store this data?

Solution

6.3 GB is small enough to fit in your home directory, but would use up most of the space.
Also, the data is downloadable from a public archive, so you don’t need to worry about backups.
You aren’t currently working with other people

So your personal Triton work directory ($WRKDIR, /scratch/work/USER/) seems appropriate. You would make a subdirectory within here:

$ mkdir $WRKDIR/gutenberg-fiction/

Since it’s a common dataset, it also makes sense to make one copy for everyone. We have done this in /scratch/shareddata/:

/scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction.zip
/scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first1000.zip
/scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip

Remote access to data

In triton/tut/remotedata.rst:

RemoteData-1: Copy the ngrams data over to the cluster

Download one of the following archives and upload it to Triton. The data is the same, just different numbers of books, so choose a file small enough for your internet connection to be happy:

19 MB repack of the original: https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip (recommended)
152MB repack of the original: https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first1000.zip
Full 2.6GB (original): https://zenodo.org/records/5783256

Then, upload the data to Triton from your computer, into the location decided in the previous step (some project directory within your work directory.)

Interactive jobs

In triton/tut/interactive.rst:

Interactive-2: Compute ngrams via batch jobs

Let’s compute some ngrams. This uses the code from the hpc-examples repository AND the data we have transferred, even though they are in two different directories. We want to go to the code directory and point it to the data directory

This and following examples use the data that is already downloaded to the cluster, stored under /scratch/shareddata/teaching/gutenberg-fiction/ (so that the examples will just work, without having to do previous steps). You can also give it the path to the copy of the data you downloaded.

Some things about this code: If you run with --words, it computes word-ngrams ([“the”, “lake”]). Otherwise, it computes character ngrams ([“t”, “h”]). The option -n specifies the n in ngrams (like -n 2; the default is 1-gram which is simple character/word frequencies). The -o option says where an output file should be saved, otherwise it prints it to the screen. The --help option tells you more or check the code on Github.

First, we try running on the login node. This is just a quick test to make sure that nothing is really wrong. We don’t want to do real computing here. These save the output to a file named ngrams2, which you might want to change between examples:

$ cd hpc-examples
$ python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip -o ngrams2 -n 2

Now we do the same, but with srun to run on the cluster:

## character ngrams
$ srun python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip -o ngrams2 -n 2

## word ngrams
$ srun python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip -o ngrams2 -n 2 --words

If we compute word-ngrams for the 1000-book dataset, we see that we run out of memory. Thus, we try again with the --mem=5G option to see that it then works.

$ srun python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first1000.zip --words -o ngrams2 -n 2
srun: slurm_job_submit: Automatically setting partition to: batch-hsw,batch-bdw,batch-csl,batch-skl,batch-milan
srun: job 5766825 queued and waiting for resources
srun: job 5766825 has been allocated resources
Found 1000 files in /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first1000.zip
slurmstepd: error: Detected 1 oom_kill event in StepId=5766825.0. Some of the step tasks have been OOM Killed.

$ srun --mem=5G --time=0-1 python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first1000.zip --words -o ngrams2 -n 2

Serial Jobs

In triton/tut/serial.rst:

Serial-2: Compute ngrams via a batch jobs

Create a batch job that computes our ngrams.

Solution

Create a script file ngrams.sh with the nano editor:

$ nano ngrams.sh

#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --mem=1G

srun python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip -o ngrams2 -n 2

We submit with:

$ sbatch ngrams.sh

Array jobs: embarassingly parallel execution

In triton/tut/array.rst:

Array-1: Compute ngrams via an array jobs

Computing the n-grams over the whole Gutenberg-Fiction dataset could take a while. Using array jobs can make this calculation faster by splitting the calculation across multiple array tasks where each task does the n-gram calculation for a subset of books in the dataset.

Before jumping to the full dataset, let’s start with the smaller subset of first 100 books and test our code with that.

Type along with the following:

The following batch job computes 3-grams on the dataset in 20 batches, and saves them all to their own file (The \ at the end of the line allows you to continue to following lines).

count-3grams-array.sh:

#!/bin/bash
#SBATCH --mem=2G
#SBATCH --array=0-19
#SBATCH --time=00:10:00

mkdir -p ngrams-output

python3 ngrams/count.py /scratch/work/$USER/gutenberg-fiction/Gutenberg-Fiction-first100.zip \
  -n 3 --words \
  --start=$SLURM_ARRAY_TASK_ID --step=20 \
  --output=ngrams-output/ngrams3-words-array_$SLURM_ARRAY_TASK_ID.out

Submit the script with sbatch count-3grams-array.sh. It will run very fast.

We can then see there are 20 outputs:

$ ls ngrams-output/
ngrams3-words-array_0.out   ngrams3-words-array_14.out  ngrams3-words-array_2.out   ngrams3-words-array_7.out
ngrams3-words-array_1.out   ngrams3-words-array_15.out  ngrams3-words-array_20.out  ngrams3-words-array_8.out
...

$ head -5 ngrams-output/ngrams3-words-array_0.out
521 ["i", "don", "t"]
189 ["don", "t", "know"]
166 ["one", "of", "the"]
156 ["it", "was", "a"]
153 ["you", "don", "t"]

We can then combine the individual output files to one:

$ srun --mem=6G --time=00:10:00 python3 ngrams/combine-counts.py ngrams-output/ngrams3-words-array_* -o ngrams-output/ngrams3-words.out

This file has ngrams from all of them:

$ head -5 ngrams-output/ngrams3-words.out
["i", "don", "t"]
["one", "of", "the"]
["it", "was", "a"]
["out", "of", "the"]
["there", "was", "a"]

Shared memory parallelism: multithreading & multiprocessing

In triton/tut/parallel-shared.rst:

Shared memory parallelism1: Test scaling

Test scaling of the ngrams code. How many processors should be used? (This is involved enough you might want to follow the steps in the solution).

To get good stats, compute with -n 2 --words (word 2-grams), and constrain to a single processor archicteure (like the Slurm option --constraint=skl).

Solution

First, let’s run the scaling tests for character ngrams just for comparison. Note we don’t save the output anywhere so that we don’t measure the time of disk writing, and we constrain to the same type of node (processor architecture) to make sure our results are comparable:

## 100 books
$ srun --constraint=skl -c 1 python3 ngrams/count.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
$ srun --constraint=skl -c 1 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
$ srun --constraint=skl -c 2 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
$ srun --constraint=skl -c 4 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
$ srun --constraint=skl -c 8 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null

Summary table for character ngrams: Speedup is (single core time)/(multi-core time). “Total core time” is (time × number of processors) and is a measure of how much computing resources you actually used:

N processors	time	speedup	total core time
single-core version	24.7 s		24.7 s
1	23.8 s	1.04	23.8 s
2	12.5 s	1.98	25.0 s
4	7.2 s	3.4	28.8 s
8	5.0 s	4.9	40.0 s

For these, it seems reasonable to go up to 4 cores, since that’s how far you can go with a reasonable speedup and you aren’t wasting too many resources.

Second, let’s compute the same thing but with words (--words). We see that the speedup is much worse, and it almost doesn’t make sense to use multi-core at all. This is because word ngrams have much more output data, and the programs spend more time moving that data around than computing anything. This is computed for both 100 and 1000 books:

## 1000 books:
$ srun --constraint=skl --time=0-1 --mem=20G -c 1 python3 ngrams/count.py -n 2 --words /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
$ srun --constraint=skl --time=0-1 --mem=20G -c 1 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
$ srun --constraint=skl --time=0-1 --mem=20G -c 2 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
$ srun --constraint=skl --time=0-1 --mem=20G -c 4 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
$ srun --constraint=skl --time=0-1 --mem=20G -c 8 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null

N processors	time (100 books)	speedup	core time used	time (1000 books)	speedup	core time used
single-core code	22.5 s		22.5	170 s		170 s
1	29.6 s	0.76	29.6	201 s	.84	201 s
2	17.7 s	1.27	35.4	116 s	1.5	232 s
4	14.4 s	1.56	57.6	82 s	2.1	328 s
8	12.1 s	1.86	96.8	79 s	2.2	632 s

For word ngrams, it doesn’t even seem justifiable to use two processes because the speed up there is not even 1.5. In this case, the problem is that the code isn’t very efficient and is spending too much time passing the data around. Using an array job allows every array task to write separately, and then one single-core job is used to accumulate the counts. Better yet would be to re-do the code so that this inefficiency is improved.

From our experience, some of the main slow points of the code are:

Reading all the individual files (even from the zip file) is slow.
Writing the results in plain text + JSON is slow: using a binary format of some form would be better.
The python multiprocessing module has to inefficiently move data between processes. Since this is a heavily data-based computation (and there are quite a few ngrams it makes), it is slow.