In language parallelization

In language parallelization assumes, that either your code (or a function you call) already knows how to parallelize its work, or that you want to use tools from within the language to run parts of your code in parallel.

We will handle both instances in this example. First, we will use functions from the language that already do parallelisation to improve runtime efficiency of an existing code. This is dependent on the actual code you use and you have to check the documentation to see, whether the libraries, packages or toolboxes you use offer this option. After this, we will show you simple ways to use general in language parallelization options and point you to the relevant documentation for your language.

Important notes

There are multiple ways to load python and python modules on triton. We strongly recommend that you have a look at the application page for python which gives a lot of information on different environments and best practices.

Important notes

When using R it is often important to specify the R version used to run. There are multiple versions of R installed on triton. to get a list use:
module spider r
then load your preferred version using:
module load r/version_of_your_choice
In addition, when using libraries, you should install them manually before running your script, as R commonly requests user confirmation before installing packages from source, and thus can get stuck if the package is not already installed. That is run an interactive job ( sinteractive , load your selected R version and install the package):
module load r/version_of_your_choice
R
> install.packages('packagename')
This will guide you to selecting a download mirror and offer you the option to install the packages in your home directory. This normally works fine, but if you need to install a lot of packages, you might run out of home-folder quota. In this case move your package directory to your work directory and replace the R directory with a symlink that points to the new location of your R package directory:
mv ~/R $WRKDIR/R
ln -s $WRKDIR/R ~/R

Important notes

Matlab writes session data, compiled code and additional toolboxes to ~/.matlab. This can quicky fill up your $HOME quota. To fix this we recommend that you replace the folder with a symlink that points to a directory in your working directory.

rsync -lrt ~/.matlab/ $WRKDIR/matlab-config/ && rm -r ~/.matlab
ln -sT $WRKDIR/matlab-config ~/.matlab
quotafix -gs --fix $WRKDIR/matlab-config

If you run parallel code in matlab, keep in mind, that matlab uses your home folder as storage for the worker files, so if you run multiple jobs you have to keep the worker folders seperate To address this, you need to specify the worker location ( the JobStorageLocation field of the parallel cluster) to a location unique to the job

% Initialize the parallel pool
c=parcluster();

% Create a temporary folder for the workers working on this job,
% in order not to conflict with other jobs.
t=tempname();
mkdir(t);

% set the worker storage location of the cluster
c.JobStorageLocation=t;

To address the latter, the number of parallel workers needs to explicitly be provided when initializing the parallel pool:

% get the number of workers based on the available CPUS from SLURM
num_workers = str2double(getenv('SLURM_CPUS_PER_TASK'));

% start the parallel pool
parpool(c,num_workers);

Here we provide a small script, that does all those steps for you.

Using existing parallelisation

To allow code to access multiple cores, we need the -c or --cpus_per_task parameter in slurm. A typical slurm script for code that is parallelized looks as follows:

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --mem=500M
#SBATCH --cpus-per-task=4
#SBATCH --output=ParallelOut

module load anaconda # use the normal anaconda environment for python
srun python parallel_fun.py

#!/bin/bash
#SBATCH --time=00:5:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=500M
#SBATCH --output=r_parallel.out

# Set the number of OpenMP-threads to 1,
# as we're using parallel for parallelization
export OMP_NUM_THREADS=1

# Load the version of R you want to use
module load r/4.0.3-python3

# Run your R script
srun Rscript parallel_fun.R

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --mem=500M
#SBATCH --cpus-per-task=4
#SBATCH --output=ParallelOut

module load matlab

srun matlab_multithread -nodisplay -r parallel_fun

To make use of this code, you need to provide the respective functions with the relevant information.

import torch 
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader
import os

maxprocesses = int(os.getenv('SLURM_CPUS_PER_TASK'))

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

train_dataloader = DataLoader(training_data, num_workers=maxprocesses, batch_size=64, shuffle=True)

for train_feature, train_label in train_dataloader:    
    print(train_label)

# Adapted from caret-parallel-train.R in Caret examples
# to Triton by Simo Tuomisto, 2017
# further adapted by Thomas Pfau, 2021

# Load relevant libraries
library(caret); 
library(doParallel)

# load data
data(BloodBrain); 

# get the number of available cores
cores <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK"))

# start the parallel cluster
cl <- makeCluster(cores)
registerDoParallel(cl) 

# train the data
fit1 <- train(bbbDescr, logBBB, "knn")

# clean up
stopCluster(cl)
registerDoSEQ()

%First, we will initialize the parallel pool using our helper function.
initParPool()

% This is derived from the ga example from the mathworks website.
% Create a function to optimize
ras = @(x, y) 20 + x.^2 + y.^2 - 10*(cos(2*pi*x) + cos(2*pi*y));

% Create optimization variables x and y. Specify that the variables are bounded by .
x = optimvar("x","LowerBound",-100,"UpperBound",100);
y = optimvar("y","LowerBound",-100,"UpperBound",100);

% Create an optimization problem with the created function
prob = optimproblem("Objective",ras(x,y));

% create the options set
options1 = optimoptions("ga","PlotFcn","gaplotbestf");

rng default % For reproducibility
tic
% Solve the problem using ga as the solver.
[sol,fval] = solve(prob,"Solver","ga","Options",options1)
toc
% create new options with useParallel = true
options2 = optimoptions("ga","PlotFcn","gaplotbestf","UseParallel",true);
rng default % For reproducibility
tic
% Solve the problem using ga as the solver.
[sol2,fval2] = solve(prob,"Solver","ga","Options",options2)
toc
exit(0)

Now your code can run on the specified number of cores, speeding up the individual computation.

Parallelizing your code

Parallel processing lets you run independent similar processes simultaneously. This allows you to make use of multiple processors, potentially reducing run-times by a factor up to the number of processors used, for the parallelized parts. However, keep in mind, that parallelization also creates an overhead. So if you have very fast methods, and don’t need to repeat them often trying to parallelize them might actually make our code slower.

Lets assume we have a function, that inverts a set of matrices, calculates means of the resulting marices, always leaving one inverted matrix out and then inverts the resulting mean matrix again.

The code looks as follows:

import multiprocessing
import numpy as np
import os
from functools import partial

def invertPart(index,matrix):  
    return np.linalg.inv(matrix[:,:,index])

def means(index, matrix):
    return np.linalg.inv(np.mean(np.delete(matrix,index,0),0))
  

def main():
    # Create random set of matrices
    randMat = np.random.rand(1000,1000,6)
    invMatrices = []
    # invert the matrices
    for i in range(6):
    	invMatrices.append(invertPart(i,randMat),)
    squaredRes = []
    # calc mean and invert again
    for i in range(6):
	squaredRes = means(i,invMatrices)   
    
    print(squaredRes)

main()    

library(pracma)

ar <- array(runif(1000*1000*6), c(1000, 1000, 6));  

invertRandom <- function(index,cmat) {
  A<-cmat[,,index];
  B<-inv(t(A));
  return(B);
}

invertMeanLOO <- function(index,cmat) {
  A<-cmat[,,-index];
  A<-apply(A,c(1,2),mean)
  B<-inv(t(A));
  return(B);
}

res = lapply(1:6,invertRandom, cmat=ar)
res = array(unlist(res),c(1000,1000,6))
res = lapply(1:6,invertMeanLOO, cmat=res)

% Create matrices to invert
mat = rand(1000,1000,6);
for i=1:size(mat,3)
    invMats(:,:,i) = inv(mat(:,:,i));
end
% And now, we proceed to build the averages of each set of inverted matrices
% each time leaving out one.
for i=1:size(invMats,3)
    usedelements = true(size(invMats,3),1);
    usedelements(i) = false;
    res(:,:,i) = inv(mean(invMats(:,:,usedelements),3));
end
% end the program
exit(0)

We can easily parallelize the following comparatively expensive steps: 1. The first matrix inversions 2. The second matrix inversions (along with the mean calculation)

Lets start with the required slurm script. Here, we will request 4 cpus, along with 500Mb of memory:

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --mem=500M
#SBATCH --cpus-per-task=4
#SBATCH --output=ParallelOut

module load anaconda # use the normal anaconda environment for python
srun python parallel.py

#!/bin/bash
#SBATCH --time=00:5:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=500M
#SBATCH --output=r_parallel.out

# Set the number of OpenMP-threads to 1,
# as we're using parallel for parallelization
export OMP_NUM_THREADS=1

# Load the version of R you want to use
module load r

# Run your R script
srun Rscript parallel.r

#!/bin/bash 
#SBATCH --time=00:15:00
#SBATCH --mem=500M
#SBATCH --cpus-per-task=4
#SBATCH --output=matlab_parallel.out

module load matlab

srun matlab_multithread -nodisplay -r parallel

Then, we need to modify this code to run in parallel.

import multiprocessing
import numpy as np
import os
from functools import partial

def invertPart(index,matrix):  
    return np.linalg.inv(matrix[:,:,index])

def means(index, matrix):
    return np.linalg.inv(np.mean(np.delete(matrix,index,0),0))
  

def main():
    # Create random set of matrices
    randMat = np.random.rand(1000,1000,6)
    # get the number of available CPUs from the SLURM environment, need to cast from string to int
    maxprocesses = int(os.getenv('SLURM_CPUS_PER_TASK'))
    # start the respective Pool
    p = multiprocessing.Pool(processes=maxprocesses)    
    # this reshapes the dimensions of the matrix, as it generates an array of the results, i.e. the 3rd dimension gets is now the first.
    invMatrices = p.map(partial(invertPart,matrix=randMat),range(6))
    
    squaredRes = p.map(partial(means, matrix=invMatrices), range(6))    
    
    print(squaredRes)
    
main()

library(pracma)
library(parallel)

ar <- array(runif(1000*1000*6), c(1000, 1000, 6));  

invertRandom <- function(index,cmat) {
  A<-cmat[,,index];
  B<-inv(t(A));
  return(B);
}

invertMeanLOO <- function(index,cmat) {
  A<-cmat[,,-index];
  A<-apply(A,c(1,2),mean)
  B<-inv(t(A));
  return(B);
}

res = mclapply(1:6,invertRandom, cmat=ar,mc.cores=Sys.getenv('SLURM_CPUS_PER_TASK'))
res = array(unlist(res),c(1000,1000,6))
res = mclapply(1:6,invertMeanLOO, cmat=res,mc.cores=Sys.getenv('SLURM_CPUS_PER_TASK'))

initParPool()
% Create matrices to invert
mat = rand(1000,1000,6);

parfor i=1:size(mat,3)
    invMats(:,:,i) = inv(mat(:,:,i))
end
% And now, we proceed to build the averages of each set of inverted matrices
% each time leaving out one.

parfor i=1:size(invMats,3)
    usedelements = true(size(invMats,3),1)
    usedelements(i) = false
    res(:,:,i) = inv(mean(invMats(:,:,usedelements),3));
end
% end the program
exit(0)