Ngrams exercises
This series of exercises computes word or character ngrams based on fiction books in Project Gutenberg. This page has all of the exercises about ngrams inserted here, with links to their original locations. You will have to refer to the main pages to figure out how to do the exercises. The code is on GitHub.
The general outline:
Looking at data storage and the dataset that is already on the cluster
Copying over your own personal copy of the data (make a duplicate dataset)
Copy over the code (clone a Git repository)
Run the code on the cluster’s login node
Run the code on the cluster itself
Run the code in parallel
Using the cluster from a command line
In triton/tut/cluster-shell.rst:
Shell-2: Look at the Ngrams dataset
(Part of a series: ngrams)
First, let’s look at the Gutenberg-Fiction dataset. There is already a copy on the cluster (this is different than the personal copy you will make later).
Use the command line to look at this directory with some books in
it: /scratch/shareddata/teaching/gutenberg-fiction/. Try to
find:
What files are they?
How big are they?
Solution
$ ls /scratch/shareddata/teaching/gutenberg-fiction/
$ du -sh /scratch/shareddata/teaching/gutenberg-fiction/*
In triton/tut/cluster-shell.rst:
Shell-4: Clone the hpc-examples repository
(Part of a series: pi, ngrams)
Do the steps above to clone the hpc-examples repository. List the directory from the command line and verify it matches what you see in the view on Github repo page.
Is your home directory the right place to store a cloned git repository?
Solution
The steps are listed above. You also can check that everything is correct with
git status. Output should be something like this:$ ls io/ mpi/ postgres/ R/ scip/ gpu/ misc/ openmp/ python/ README.rst slurm/ $ git status On branch master Your branch is up to date with 'origin/master'. nothing to commit, working tree cleanNormally, large projects you are working on should be in your work directory. This is small enough we can ignore that for now (and make our exercises work on different clusters).
In triton/tut/cluster-shell.rst:
Shell-9: Practice looking at README files
(Part of a series: ngrams)
“README” files are simple text file documentation, which is good
to include with your data. Once you get new data, it’s good to
look at README files to get oriented. Check out the README files
within the Gutenberg Fiction dataset located at
/scratch/shareddata/teaching/gutenberg-fiction/ (again: you’ll
download your own copy of this later)
Solution
We haven’t given you the exact file paths, so you need to use
ls some to find them. (You can also push TAB twice to
tab-complete, which lists what is in a directory. Then, type a
few characters and push TAB again and it’ll fill in the path for
you. This is a fast way to explore directories.)
$ ls /scratch/shareddata/teaching/gutenberg-fiction/
$ ls /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction/
$ less /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction/README.md
$ less /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100/README
Data storage
Storage-1: Create a place to store our Gutenberg (ngrams) data.
We looked at the Gutenberg-Fiction dataset before, where it’s already on Triton. In this next few exercises, let’s pretend we didn’t already have it on the cluster, and practice copying the data to the cluster yourself.
Background: we will do a recurring example using public domain Project
Gutenberg books. We will compute ngrams (tuples of words that
occur in a sequence (for example (this, book, is) is a
3-gram). After computing a list of all n-grams and how often they
occur, we can understand something about the books (and sometime
generate some text).
In the next step you will use data that is a 2.6 GB zipfile (6.3GB uncompressed). Where would you store this data?
Solution
6.3 GB is small enough to fit in your home directory, but would use up most of the space.
Also, the data is downloadable from a public archive, so you don’t need to worry about backups.
You aren’t currently working with other people
So your personal Triton work directory ($WRKDIR,
/scratch/work/USER/) seems appropriate. You would make a
subdirectory within here:
$ mkdir $WRKDIR/gutenberg-fiction/
Since it’s a common dataset, it also makes sense to make one
copy for everyone. We have done this in
/scratch/shareddata/:
/scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction.zip/scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first1000.zip/scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip
Remote access to data
RemoteData-1: Copy the ngrams data over to the cluster
Download one of the following archives and upload it to Triton. The data is the same, just different numbers of books, so choose a file small enough for your internet connection to be happy:
19 MB repack of the original: https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip (recommended)
152MB repack of the original: https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first1000.zip
Full 2.6GB (original): https://zenodo.org/records/5783256
Then, upload the data to Triton from your computer, into the location decided in the previous step (some project directory within your work directory.)
Interactive jobs
In triton/tut/interactive.rst:
Interactive-2: Compute ngrams via batch jobs
Let’s compute some ngrams. This uses the code from the hpc-examples repository AND the data we have transferred, even though they are in two different directories. We want to go to the code directory and point it to the data directory
This and following examples use the data that is already downloaded
to the cluster, stored under
/scratch/shareddata/teaching/gutenberg-fiction/ (so that the
examples will just work, without having to do previous steps). You
can also give it the path to the copy of the data you downloaded.
Some things about this code: If you run with --words, it
computes word-ngrams ([“the”, “lake”]). Otherwise, it computes
character ngrams ([“t”, “h”]). The option -n specifies the n
in ngrams (like -n 2; the default is 1-gram which is simple
character/word frequencies). The -o option says where an
output file should be saved, otherwise it prints it to the screen.
The --help option tells you more or check the code on Github.
First, we try running on the login node. This is just a quick test
to make sure that nothing is really wrong. We don’t want to do
real computing here. These save the output to a file named
ngrams2, which you might want to change between examples:
$ cd hpc-examples
$ python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip -o ngrams2 -n 2
Now we do the same, but with srun to run on the cluster:
## character ngrams
$ srun python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip -o ngrams2 -n 2
## word ngrams
$ srun python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip -o ngrams2 -n 2 --words
If we compute word-ngrams for the 1000-book dataset, we see that we
run out of memory. Thus, we try again with the --mem=5G option
to see that it then works.
$ srun python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first1000.zip --words -o ngrams2 -n 2
srun: slurm_job_submit: Automatically setting partition to: batch-hsw,batch-bdw,batch-csl,batch-skl,batch-milan
srun: job 5766825 queued and waiting for resources
srun: job 5766825 has been allocated resources
Found 1000 files in /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first1000.zip
slurmstepd: error: Detected 1 oom_kill event in StepId=5766825.0. Some of the step tasks have been OOM Killed.
$ srun --mem=5G --time=0-1 python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first1000.zip --words -o ngrams2 -n 2
Serial Jobs
Serial-2: Compute ngrams via a batch jobs
Create a batch job that computes our ngrams.
Solution
Create a script file ngrams.sh with the nano editor:
$ nano ngrams.sh
#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --mem=1G
srun python3 ngrams/count.py /scratch/shareddata/teaching/gutenberg-fiction/Gutenberg-Fiction-first100.zip -o ngrams2 -n 2
We submit with:
$ sbatch ngrams.sh
Array jobs: embarassingly parallel execution
Array-1: Compute ngrams via an array jobs
Computing the n-grams over the whole Gutenberg-Fiction dataset could take a while. Using array jobs can make this calculation faster by splitting the calculation across multiple array tasks where each task does the n-gram calculation for a subset of books in the dataset.
Before jumping to the full dataset, let’s start with the smaller subset of first 100 books and test our code with that.
Type along with the following:
The following batch job computes 3-grams on the dataset in 20
batches, and saves them all to their own file (The \ at the end
of the line allows you to continue to following lines).
count-3grams-array.sh:
#!/bin/bash
#SBATCH --mem=2G
#SBATCH --array=0-19
#SBATCH --time=00:10:00
mkdir -p ngrams-output
python3 ngrams/count.py /scratch/work/$USER/gutenberg-fiction/Gutenberg-Fiction-first100.zip \
-n 3 --words \
--start=$SLURM_ARRAY_TASK_ID --step=20 \
--output=ngrams-output/ngrams3-words-array_$SLURM_ARRAY_TASK_ID.out
Submit the script with sbatch count-3grams-array.sh. It will run very fast.
We can then see there are 20 outputs:
$ ls ngrams-output/
ngrams3-words-array_0.out ngrams3-words-array_14.out ngrams3-words-array_2.out ngrams3-words-array_7.out
ngrams3-words-array_1.out ngrams3-words-array_15.out ngrams3-words-array_20.out ngrams3-words-array_8.out
...
$ head -5 ngrams-output/ngrams3-words-array_0.out
521 ["i", "don", "t"]
189 ["don", "t", "know"]
166 ["one", "of", "the"]
156 ["it", "was", "a"]
153 ["you", "don", "t"]
We can then combine the individual output files to one:
$ srun --mem=6G --time=00:10:00 python3 ngrams/combine-counts.py ngrams-output/ngrams3-words-array_* -o ngrams-output/ngrams3-words.out
This file has ngrams from all of them:
$ head -5 ngrams-output/ngrams3-words.out
5265 ["i", "don", "t"]
2848 ["one", "of", "the"]
2535 ["it", "was", "a"]
2428 ["out", "of", "the"]
2389 ["there", "was", "a"]