In language parallelization

In language parallelization assumes, that either your code (or a function you call) already knows how to parallelize its work, or that you want to use tools from within the language to run parts of your code in parallel.

We will handle both instances in this example. First, we will use functions from the language that already do parallelisation to improve runtime efficiency of an existing code. This is dependent on the actual code you use and you have to check the documentation to see, whether the libraries, packages or toolboxes you use offer this option. After this, we will show you simple ways to use general in language parallelization options and point you to the relevant documentation for your language.

Important notes

There are multiple ways to load python and python modules on triton. We strongly recommend that you have a look at the application page for python which gives a lot of information on different environments and best practices.

Using existing parallelisation

To allow code to access multiple cores, we need the -c or --cpus_per_task parameter in slurm. A typical slurm script for code that is parallelized looks as follows:

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --mem=500M
#SBATCH --cpus-per-task=4
#SBATCH --output=ParallelOut

module load anaconda # use the normal anaconda environment for python
srun python parallel_fun.py

To make use of this code, you need to provide the respective functions with the relevant information.

import torch 
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader
import os

maxprocesses = int(os.getenv('SLURM_CPUS_PER_TASK'))

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

train_dataloader = DataLoader(training_data, num_workers=maxprocesses, batch_size=64, shuffle=True)

for train_feature, train_label in train_dataloader:    
    print(train_label)

Now your code can run on the specified number of cores, speeding up the individual computation.

Parallelizing your code

Parallel processing lets you run independent similar processes simultaneously. This allows you to make use of multiple processors, potentially reducing run-times by a factor up to the number of processors used, for the parallelized parts. However, keep in mind, that parallelization also creates an overhead. So if you have very fast methods, and don’t need to repeat them often trying to parallelize them might actually make our code slower.

Lets assume we have a function, that inverts a set of matrices, calculates means of the resulting marices, always leaving one inverted matrix out and then inverts the resulting mean matrix again.

The code looks as follows:

import multiprocessing
import numpy as np
import os
from functools import partial

def invertPart(index,matrix):  
    return np.linalg.inv(matrix[:,:,index])

def means(index, matrix):
    return np.linalg.inv(np.mean(np.delete(matrix,index,0),0))
  

def main():
    # Create random set of matrices
    randMat = np.random.rand(1000,1000,6)
    invMatrices = []
    # invert the matrices
    for i in range(6):
    	invMatrices.append(invertPart(i,randMat),)
    squaredRes = []
    # calc mean and invert again
    for i in range(6):
	squaredRes = means(i,invMatrices)   
    
    print(squaredRes)

main()    

We can easily parallelize the following comparatively expensive steps: 1. The first matrix inversions 2. The second matrix inversions (along with the mean calculation)

Lets start with the required slurm script. Here, we will request 4 cpus, along with 500Mb of memory:

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --mem=500M
#SBATCH --cpus-per-task=4
#SBATCH --output=ParallelOut

module load anaconda # use the normal anaconda environment for python
srun python parallel.py

Then, we need to modify this code to run in parallel.


import multiprocessing
import numpy as np
import os
from functools import partial

def invertPart(index,matrix):  
    return np.linalg.inv(matrix[:,:,index])

def means(index, matrix):
    return np.linalg.inv(np.mean(np.delete(matrix,index,0),0))
  

def main():
    # Create random set of matrices
    randMat = np.random.rand(1000,1000,6)
    # get the number of available CPUs from the SLURM environment, need to cast from string to int
    maxprocesses = int(os.getenv('SLURM_CPUS_PER_TASK'))
    # start the respective Pool
    p = multiprocessing.Pool(processes=maxprocesses)    
    # this reshapes the dimensions of the matrix, as it generates an array of the results, i.e. the 3rd dimension gets is now the first.
    invMatrices = p.map(partial(invertPart,matrix=randMat),range(6))
    
    squaredRes = p.map(partial(means, matrix=invMatrices), range(6))    
    
    print(squaredRes)
    
main()