Ngrams exercises

This series of exercises computes word or character ngrams based on fiction books in Project Gutenberg. This page has all of the exercises about ngrams inserted here, with links to their original locations. You will have to refer to the main pages to figure out how to do the exercises. The code is on GitHub.

The general outline:

  • Looking at data storage and the dataset that is already on the cluster

  • Copying over your own personal copy of the data (make a duplicate dataset)

  • Copy over the code (clone a Git repository)

  • Run the code on the cluster’s login node

  • Run the code on the cluster itself

  • Run the code in parallel

Using the cluster from a command line

In triton/tut/cluster-shell.rst:

Shell-2: Look at the Ngrams dataset

(Part of a series: ngrams)

First, let’s look at the Gutenberg-Fiction dataset. There is already a copy on the cluster (this is different than the personal copy you will make later).

Use the command line to look at this directory with some books in it: /scratch/shareddata/teaching/gutenberg-fiction/. Try to find:

  • What files are they?

  • How big are they?

In triton/tut/cluster-shell.rst:

Shell-4: Clone the hpc-examples repository

(Part of a series: pi, ngrams)

Do the steps above to clone the hpc-examples repository. List the directory from the command line and verify it matches what you see in the view on Github repo page.

Is your home directory the right place to store a cloned git repository?

In triton/tut/cluster-shell.rst:

Shell-9: Practice looking at README files

(Part of a series: ngrams)

“README” files are simple text file documentation, which is good to include with your data. Once you get new data, it’s good to look at README files to get oriented. Check out the README files within the Gutenberg Fiction dataset located at /scratch/shareddata/teaching/gutenberg-fiction/ (again: you’ll download your own copy of this later)

Data storage

In triton/tut/storage.rst:

Storage-1: Create a place to store our Gutenberg (ngrams) data.

We looked at the Gutenberg-Fiction dataset before, where it’s already on Triton. In this next few exercises, let’s pretend we didn’t already have it on the cluster, and practice copying the data to the cluster yourself.

Background: we will do a recurring example using public domain Project Gutenberg books. We will compute ngrams (tuples of words that occur in a sequence (for example (this, book, is) is a 3-gram). After computing a list of all n-grams and how often they occur, we can understand something about the books (and sometime generate some text).

In the next step you will use data that is a 2.6 GB zipfile (6.3GB uncompressed). Where would you store this data?

Remote access to data

In triton/tut/remotedata.rst:

RemoteData-1: Copy the ngrams data over to the cluster

Download one of the following archives and upload it to Triton. The data is the same, just different numbers of books, so choose a file small enough for your internet connection to be happy:

Then, upload the data to Triton from your computer, into the location decided in the previous step (some project directory within your work directory.)

Interactive jobs

In triton/tut/interactive.rst:

Interactive-2: Compute ngrams via batch jobs

Let’s compute some ngrams. This uses the code from the hpc-examples repository AND the data we have transferred, even though they are in two different directories. We want to go to the code directory and point it to the data directory

This and following examples use the data that is already downloaded to the cluster, stored under /scratch/shareddata/teaching/gutenberg-fiction/ (so that the examples will just work, without having to do previous steps). You can also give it the path to the copy of the data you downloaded.

Some things about this code: If you run with --words, it computes word-ngrams ([“the”, “lake”]). Otherwise, it computes character ngrams ([“t”, “h”]). The option -n specifies the n in ngrams (like -n 2; the default is 1-gram which is simple character/word frequencies). The -o option says where an output file should be saved, otherwise it prints it to the screen. The --help option tells you more or check the code on Github.

First, we try running on the login node. This is just a quick test to make sure that nothing is really wrong. We don’t want to do real computing here. These save the output to a file named ngrams2, which you might want to change between examples:

$ cd hpc-examples
$ python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip -o ngrams2 -n 2

Now we do the same, but with srun to run on the cluster:

## character ngrams
$ srun python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip -o ngrams2 -n 2

## word ngrams
$ srun python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip -o ngrams2 -n 2 --words

If we compute word-ngrams for the 1000-book dataset, we see that we run out of memory. Thus, we try again with the --mem=5G option to see that it then works.

$ srun python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first1000.zip --words -o ngrams2 -n 2
srun: slurm_job_submit: Automatically setting partition to: batch-hsw,batch-bdw,batch-csl,batch-skl,batch-milan
srun: job 5766825 queued and waiting for resources
srun: job 5766825 has been allocated resources
Found 1000 files in /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first1000.zip
slurmstepd: error: Detected 1 oom_kill event in StepId=5766825.0. Some of the step tasks have been OOM Killed.

$ srun --mem=5G --time=0-1 python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first1000.zip --words -o ngrams2 -n 2

Serial Jobs

In triton/tut/serial.rst:

Serial-2: Compute ngrams via a batch jobs

Create a batch job that computes our ngrams.

Array jobs: embarassingly parallel execution

In triton/tut/array.rst:

Array-1: Compute ngrams via an array jobs

Computing the n-grams over the whole Gutenberg-Fiction dataset could take a while. Using array jobs can make this calculation faster by splitting the calculation across multiple array tasks where each task does the n-gram calculation for a subset of books in the dataset.

Before jumping to the full dataset, let’s start with the smaller subset of first 100 books and test our code with that.

Type along with the following:

The following batch job computes 3-grams on the dataset in 20 batches, and saves them all to their own file (The \ at the end of the line allows you to continue to following lines).

count-3grams-array.sh:

#!/bin/bash
#SBATCH --mem=2G
#SBATCH --array=0-19
#SBATCH --time=00:10:00

mkdir -p ngrams-output

python3 ngrams/count.py /scratch/work/$USER/gutenberg-fiction/Gutenberg-Fiction-first100.zip \
  -n 3 --words \
  --start=$SLURM_ARRAY_TASK_ID --step=20 \
  --output=ngrams-output/ngrams3-words-array_$SLURM_ARRAY_TASK_ID.out

Submit the script with sbatch count-3grams-array.sh. It will run very fast.

We can then see there are 20 outputs:

$ ls ngrams-output/
ngrams3-words-array_0.out   ngrams3-words-array_14.out  ngrams3-words-array_2.out   ngrams3-words-array_7.out
ngrams3-words-array_1.out   ngrams3-words-array_15.out  ngrams3-words-array_20.out  ngrams3-words-array_8.out
...

$ head -5 ngrams-output/ngrams3-words-array_0.out
521 ["i", "don", "t"]
189 ["don", "t", "know"]
166 ["one", "of", "the"]
156 ["it", "was", "a"]
153 ["you", "don", "t"]

We can then combine the individual output files to one:

$ srun --mem=6G --time=00:10:00 python3 ngrams/combine-counts.py ngrams-output/ngrams3-words-array_* -o ngrams-output/ngrams3-words.out

This file has ngrams from all of them:

$ head -5 ngrams-output/ngrams3-words.out
5265 ["i", "don", "t"]
2848 ["one", "of", "the"]
2535 ["it", "was", "a"]
2428 ["out", "of", "the"]
2389 ["there", "was", "a"]

Shared memory parallelism: multithreading & multiprocessing

In triton/tut/parallel-shared.rst:

Shared memory parallelism1: Test scaling

Test scaling of the ngrams code. How many processors should be used? (This is involved enough you might want to follow the steps in the solution).

To get good stats, compute with -n 2 --words (word 2-grams), and constrain to a single processor archicteure (like the Slurm option --constraint=skl).