Running serial jobs in parallel

Very often you need to run a given script with multiple different sets of parameters. This is what is commonly called an embarrassingly parallel problem, because a) there are commonly many more problems than available processors, and b) it is very easy to parallelize. Here, we will show you how to adapt your code in this kind of situation, based on a non-parallel version of a problem. We will assume an individual script with fixed parameter values and modify it to adapt it’s input based on the serial job number id from slurm.

The unparallelized version

Lets assume you have a genetic algorithm optimization pipeline with a few fixed parameters.

import pygad
import numpy as np

function_inputs = np.array([4,-2,3.5,5,-11,-4.7])
desired_output = 44

def fitness_func(solution, solution_idx):
    output = np.sum(solution*function_inputs)
    fitness = 1.0 / np.abs(output - desired_output)
    return fitness

# define the parameters

fitness_function = fitness_func

num_generations = 200000
num_parents_mating = 4

sol_per_pop = 100
num_genes = len(function_inputs)

mutation_percent_genes = 10

stop_criteria="saturate_50"

ga_instance = pygad.GA(num_generations=num_generations,
                       num_parents_mating=num_parents_mating,
                       fitness_func=fitness_function,
                       sol_per_pop=sol_per_pop,
                       num_genes=num_genes,
                       mutation_percent_genes=mutation_percent_genes,
                       stop_criteria=stop_criteria)

ga_instance.run()
solution, solution_fitness, solution_idx = ga_instance.best_solution()
print("Parameters of the best solution : {solution}".format(solution=solution))
print("Fitness value of the best solution = {solution_fitness}".format(solution_fitness=solution_fitness))
prediction = np.sum(np.array(function_inputs)*solution)
print("Predicted output based on the best solution : {prediction}".format(prediction=prediction))

The parameters set in this example are: the maximum number of generations, the population size the maximum stalled generations (i.e. how many generations the algorithm should continue if it does not improve) and the mutation rate. Lets assume, we want to test how different mutation rates change the outcome of the algorithm and its runtime. We could run this within each language, for-looping over percentages from 0-100%, which can take quite some time. Alternatively, we can run 100 jobs, each determining one percentage.

Running a Slurm array job

Array jobs are defined in Slurm by the Parameter --array=XXX-YYY, where XXX is the lowest index and YYY the highest index. Each job will have access to an individual SLURM_ARRAY_TASK_ID environment variable. There are two ways how this can be incorporated into a job. Either directly in the submission script, or by retrieving it in your code. The former allows the selection of input_file names based on the array ID number e.g.:

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --mem=500M
#SBATCH --array=1-4

srun ./my_application -input input_data_${SLURM_ARRAY_TASK_ID}

In our case however, we would like to directly use it within the script we run. So we will set up the following slurm script:

#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --array=1-100
#SBATCH --mem=500M
#SBATCH --output=python_array_%a.out


module load anaconda # use the normal anaconda environment for python

srun python serial.py

Then we modify the script as follows:

import pygad
import numpy as np
import os



function_inputs = np.array([4,-2,3.5,5,-11,-4.7])
desired_output = 44

def fitness_func(solution, solution_idx):
    output = np.sum(solution*function_inputs)
    fitness = 1.0 / np.abs(output - desired_output)
    return fitness

# define the parameters

fitness_function = fitness_func

num_generations = 200000
num_parents_mating = 4

sol_per_pop = 100
num_genes = len(function_inputs)

mutation_percent_genes = int(os.getenv('SLURM_ARRAY_TASK_ID'))

stop_criteria="saturate_50"

ga_instance = pygad.GA(num_generations=num_generations,
                       num_parents_mating=num_parents_mating,
                       fitness_func=fitness_function,
                       sol_per_pop=sol_per_pop,
                       num_genes=num_genes,
                       mutation_percent_genes=mutation_percent_genes,
                       stop_criteria=stop_criteria)

ga_instance.run()
solution, solution_fitness, solution_idx = ga_instance.best_solution()
print("Parameters of the best solution : {solution}".format(solution=solution))
print("Fitness value of the best solution = {solution_fitness}".format(solution_fitness=solution_fitness))
prediction = np.sum(np.array(function_inputs)*solution)
print("Predicted output based on the best solution : {prediction}".format(prediction=prediction))

Now, our mutation rate is set based on the SLURM_ARRAY_TASK_ID environment variable.

Best Practices

In general you should try not to create too many jobs at once as this can cause unnecessary stress on the scheduler. This is particularily important if your individual array jobs only take a very short time (<30 minutes). If you have a large amount of very short array jobs, it is a good idea to group them into batches. In our example this would work as follows.

Grouping array jobs

To group jobs without extensive modification of your script, you can simply create a batch loop that repeatedly calls your script and only changes either the provided input parameters, or export the variable defined in the batch for loop and access it within the script. For the genetic algorithm example the code would need to be modified as follows. First, we need to introduce a for loop in he slurm script that runs the job a number of times based on our requests.

#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --array=1-10
#SBATCH --mem=500M
#SBATCH --output=python_array_%a.out


module load anaconda # use the normal anaconda environment for python

# size of each batch
BATCHSIZE=10
n=$SLURM_ARRAY_TASK_ID

# generate the sequence of indices used by each batch
indexes=`seq $((n*BATCHSIZE)) $(((n + 1)*BATCHSIZE - 1))`

# run your program for each value
for i in $indexes
do
   export i #to access i within the python interpreter we need to export it.
   srun python serial_array.py
done

and then we need to change the environment variable used in the script.

import pygad
import numpy as np
import os



function_inputs = np.array([4,-2,3.5,5,-11,-4.7])
desired_output = 44

def fitness_func(solution, solution_idx):
    output = np.sum(solution*function_inputs)
    fitness = 1.0 / np.abs(output - desired_output)
    return fitness

# define the parameters

fitness_function = fitness_func

num_generations = 200000
num_parents_mating = 4

sol_per_pop = 100
num_genes = len(function_inputs)

mutation_percent_genes = int(os.getenv('i'))

stop_criteria="saturate_50"

ga_instance = pygad.GA(num_generations=num_generations,
                       num_parents_mating=num_parents_mating,
                       fitness_func=fitness_function,
                       sol_per_pop=sol_per_pop,
                       num_genes=num_genes,
                       mutation_percent_genes=mutation_percent_genes,
                       stop_criteria=stop_criteria)

ga_instance.run()
solution, solution_fitness, solution_idx = ga_instance.best_solution()
print("Parameters of the best solution : {solution}".format(solution=solution))
print("Fitness value of the best solution = {solution_fitness}".format(solution_fitness=solution_fitness))
prediction = np.sum(np.array(function_inputs)*solution)
print("Predicted output based on the best solution : {prediction}".format(prediction=prediction))