Python

Python is widely used programming language where we have installed all basic packages on every node. Yet, python develops quite fast and the system provided packages are ofter not complete or getting old.

Python distributions

Python to use

How to install own packages

I don’t really care, I just want recent stuff and to not worry.

Anaconda: module load anaconda

Simple programs with common packages, not switching between Pythons often

Anaconda: module load anaconda

pip install --user

Your own conda environment

Miniconda: module load miniconda

conda environment + conda

Your own virtual environment

Module virtualenv module load py-virtualenv

virtualenv + pip + setuptools

The main version of modern Python is 3. Support for old Python 2 ended at the end of 2019. There are also different distributions: The “regular” CPython, Anaconda (a package containing CPython + a lot of other scientific software all bundled togeter), PyPy (a just-in-time compiler, which can be much faster for some use cases). Triton supports all of these.

  • For general scientific/data science use, we suggest that you use Anaconda. It comes with the most common scientific software included, and is reasonably optimized.

  • There are many other “regular” CPython versions in the module system. These are compiled and optimized for Triton, and are highly recommended. The default system Python is old and won’t be updated.

Make sure your environments are reproducible - you can recreate them from scratch. History shows you will probably have to do this eventually, and it also ensures that others can always use your code. We recommend a minimal requirements.txt (pip) or environment.yml (conda), hand-created with the minimal dependencies in there.

Quickstart

Use module load anaconda to get our Python installation.

If you have simple needs, use pip install –user to install packages. For complex needs, use anaconda + conda environments to isolate your projects.

Install your own packages easily

Warning

pip install --user can result in incompatibilities

If you do this, then the module will be shared among all your projects. It is quite likely that eventually, you will get some incompatibilities between the Python you are using and the modules installed. In that case, you are on your own (simple recommendation is to remove all modules from ~/.local/lib/pythonN.N and reinstall). If you get incompatible module errors, our first recommendation will be to remove everything installed this way and use conda/virtual environments instead. It’s not a bad idea to do this when you switch to environments anyway.

If you encounter problems, remove all your user packages:

$ rm -r ~/.local/lib/python*.*/

and reinstall everything after loading the environment you want.

Installing your own packages with pip install won’t work, since it tries to install globally for all users. Instead, you should do this (add --user) to install the package in your home directory (~/.local/lib/pythonN.N/):

$ pip install --user $package_name

This is quick and effective best used for leaf packages without many dependencies and if you don’t switch Python modules often.

Note

Example of dangers of pip install --user

Someone did pip install --user tensorflow. Some time later, they noticed that they couldn’t use Tensorflow + GPUs. We couldn’t reproduce the problem, but in the end found they had this local install that was hiding any Tensorflow in any module (forcing a CPU version on them).

Note: pip installs from the Python Package Index.

Anaconda and conda environments

Anaconda is a Python distribution by Continuum Analytics (open source, of course). It is nothing fancy, they just take a lot of useful scientific packages and their dependencies and put them all together, make sure they work, and do some optimization. They also include most of the most common computing and data science packages and non-Python compiled software and libraries. It is also all open source, and is packaged nicely so that it can easily be installed on any major OS.

To load anaconda, use the module system (you can also load specific versions):

$ module load anaconda     # python3
$ module load anaconda2    # python2

Note

Before 2020, Python3 was via the anaconda3 module (note the 3 on the end). That’s still there, but in 2020 we completely revised our Anaconda installation system, and dropped active maintenance of Python 2. All updates are in anaconda only in the future.

Conda environments

See also

Watch a Research Software Hour episode on conda for an introduction + demo.

If you encounter a situation where you need to create your own environment, we recommend that you use conda environments. When you create your own environment the packages from the base environment (default environment installed by us) will not be used, but you can choose which packages you want to install.

We nowadays recommend that you use the miniconda-module for installing these environments. Miniconda is basically a minimal Anaconda installation that can be used to create your own environments.

By default conda tries to install packages into your home folder, which can result in running out of quota. To fix this, you should run the following commands once:

$ module load miniconda

$ mkdir $WRKDIR/.conda_pkgs
$ mkdir $WRKDIR/.conda_envs

$ conda config --append pkgs_dirs ~/.conda/pkgs
$ conda config --append envs_dirs ~/.conda/envs
$ conda config --prepend pkgs_dirs $WRKDIR/.conda_pkgs
$ conda config --prepend envs_dirs $WRKDIR/.conda_envs

virtualenv does not work with Anaconda, use conda instead.

  • Load the miniconda module. You should look up the version and use load same version each time you source the environment:

    ## Load miniconda first.  This must always be done before activating the env!
    $ module load miniconda
    
  • Create an environment. This needs to be done once:

    ## create environment with the packages you require
    $ conda create -n ENV_NAME python pip ipython tensorflow-gpu pandas ...
    
  • Activate the environment. This needs to be done every time you load the environment:

    ## This must be run in each shell to set up the environment variables properly.
    ## make sure module is loaded first.
    $ source activate ENV_NAME
    
  • Activating and using the environment, installing more packages, etc. can be done either using conda install or pip install:

    ## Install more packages, either conda or pip
    $ conda search PACKAGE_NAME
    $ conda install PACKAGE_NAME
    $ pip install PACKAGE_NAME
    
  • Leaving the environment when done (optional):

    ## Deactivate the environment
    $ source deactivate
    
  • To activate an environment from a Slurm script:

    #!/bin/bash
    #SBATCH --time=00:05:00
    #SBATCH --cpus_per_task=1
    #SBATCH --mem=1G
    
    source activate ENV_NAME
    
    srun echo "This step is ran inside the activated conda environment!"
    
    source deactivate
    
  • Worst case, you have incompatibility problems. Remove everything, including the stuff installed with pip install --user. If you’ve mixed your personal stuff in with this, then you will have to separate it out.:

    ## Remove anything installed with pip install --user.
    $ rm -r ~/.local/lib/python*.*/
    

A few notes about conda environments:

  • Once you use a conda environment, everything goes into it. Don’t mix versions with, for example, local packages in your home dir and --pip install --user. Things installed (even previously) with pip install --user will be visible in the conda environment and can make your life hard! Eventually you’ll get dependency problems.

  • Often the same goes for other python based modules. We have setup many modules that do use anaconda as a backend. So, if you know what you are doing this might work.

conda init, conda activate, and source activate

We don’t recommend doing conda init like many sources recommend: this will permanently affect your .bashrc file and make hard-to-debug problems later. The main points of conda init are to a) automatically activate an environment (not good on a cluster: make it explicit so it can be more easily debugged) and b) make conda a shell function (not command) so that conda activate will work (source activate works as well in all cases, no confusion if others don’t.)

  • If you activate one environment from another, for example after loading an anaconda module, do source activate ENV_NAME like shown above (conda installation in the environment not needed).

  • If you make your own standalone conda environments, install the conda package in them, then…

  • Activate a standalone environment with conda installed in it by source PATH/TO/ENV_DIRECTORY/bin/activate (which incidentally activates just that one session for conda).

Python: virtualenv

Virtualenv is default-Python way of making environments, but does not work with Anaconda. We generally recommend using anaconda, since it includes a lot more stuff by default, but virtualenv works on other systems easily so it’s good to know about.

## Load module python
$ module load py-virtualenv

## Create environment
$ virtualenv DIR

## activate it (in each shell that uses it)
$ source DIR/bin/activate

## install more things (e.g. ipython, etc.)
$ pip install PACKAGE_NAME

## deactivate the virtualenv
$ deactivate

Anaconda/virtualenvironments in Jupyter

If you make a conda environment / virtual environment, you can use it from Triton’s JupyterHub (or your own Jupyter). See Installing kernels from virtualenvs or Anaconda environments.

IPython Parallel

ipyparallel is a tool for running embarrassingly parallel code using Python. The basic idea is that you have a controller and engines. You have a client process which is actually running your own code.

Preliminary notes: ipyparallel is installed in the anaconda{2,3}/latest modules.

Let’s say that you are doing some basic interactive work:

  • Controller: this can run on the frontend node, or you can put it on a script. To start: ipcontroller --ip="*"

  • Engines: srun -N4 ipengine: This runs the four engines in slurm interactively. You don’t need to interact with this once it is running, but remember to stop the process once it is done because it is using resources. You can start/stop this as needed.

  • Start your Python process and use things like normal:

    import os
    import ipyparallel
    client = ipyparallel.Client()
    result = client[:].apply_async(os.getpid)
    pid_map = result.get_dict()
    print(pid_map)
    

This method lets you turn on/off the engines as needed. This isn’t the most advanced way to use ipyparallel, but works for interactive use.

See also: IPython parallel for a version which goes in a slurm script.

Background: pip vs python vs anaconda vs conda vs virtualenv

Virtual environments are self-contained python environments with all of their own modules, separate from the system packages. They are great for research where you need to be agile and install whatever versions and packages you need. We highly recommend virtual environments or conda environments (below)

  • Anaconda: use conda, see below

  • Normal Python: virtualenv + pip install, see below

You often need to install your own packages. Python has its own package manager system that can do this for you. There are three important related concepts:

  • pip: the Python package installer. Installs Python packages globally, in a user’s directory (--user), or anywhere. Installs from the Python Package Index.

  • virtualenv: Creates a directory that has all self-contained packages that is manageable by the user themself. When the virtualenv is activated, all the operating-system global packages are no longer used. Instead, you install only the packages you want. This is important if you need to install specific versions of software, and also provides isolation from the rest of the system (so that you work can be uninterrupted). It also allows different projects to have different versions of things installed. virtualenv isn’t magic, it could almost be seen as just manipulating PYTHONPATH, PATH, and the like. Docs: https://docs.python-guide.org/dev/virtualenvs/

  • conda: Sort of a combination of package manager and virtual environment. However, it only installed packages into environments, and is not limited to Python packages. It can also install other libraries (c, fortran, etc) into the environment. This is extremely useful for scientific computing, and the reason it was created. Docs for envs: https://conda.io/projects/conda/en/latest/user-guide/concepts/environments.html.

So, to install packages, there is pip and conda. To make virtual environments, there is venv and conda.

Advanced users can see this rosetta stone for reference.

On Triton we have added some packages on top of the Anaconda installation, so cloning the entire Anaconda environment to local conda environment will not work (not a good idea in the first place but some users try this every now and then).

Examples

Running Python with OpenMP parallelization

Various Python packages such as Numpy, Scipy and pandas can utilize OpenMP to run on multiple CPUs. As an example, let’s run the python script python_openmp.py that calculates multiplicative inverse of five symmetric matrices of size 2000x2000.

nrounds = 5

t_start = time()

for i in range(nrounds):
    a = np.random.random([2000,2000])
    a = a + a.T
    b = np.linalg.pinv(a)

t_delta = time() - t_start

print('Seconds taken to invert %d symmetric 2000x2000 matrices: %f' % (nrounds, t_delta))

The full code for the example is in HPC examples-repository. One can run this example with srun:

wget https://raw.githubusercontent.com/AaltoSciComp/hpc-examples/master/python/python_openmp/python_openmp.py
module load anaconda/2022-01
export OMP_PROC_BIND=true
srun --cpus-per-task=2 --mem=2G --time=00:15:00 python python_openmp.py

or with sbatch by submitting python_openmp.sh:

#!/bin/bash -l
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=1G
#SBATCH -o python_openmp.out

module load anaconda/2022-01

export OMP_PROC_BIND=true

echo 'Running on: '$HOSTNAME

srun python python_openmp.py

Important

Python has a global interpreter lock (GIL), which forces some operations to be executed on only one thread and when these operations are occuring, other threads will be idle. These kinds of operations include reading files and doing print statements. Thus one should be extra careful with multithreaded code as it is easy to create seemingly parallel code that does not actually utilize multiple CPUs.

There are ways to minimize effects of GIL on your Python code and if you’re creating your own multithreaded code, we recommend that you take this into account.

Running MPI parallelized Python with mpi4py

MPI parallelized Python requires a valid MPI installation that support our SLURM scheduler. Thus anaconda is not the best option. We have installed MPI-supporting Python versions to different toolchains.

Using mpi4py is quite easy. Example is provided below.

Python MPI4py

A simple script mpi4py.py that utilizes mpi4py.

#!/usr/bin/env python
"""
Parallel Hello World
"""
from mpi4py import MPI
import sys
size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()
sys.stdout.write(
    "Hello, World! I am process %d of %d on %s.\n"
    % (rank, size, name))

Running mpi4py.py using only srun:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --ntasks=4

module load Python/2.7.11-goolf-triton-2016b
mpiexec -n $SLURM_NTASKS python mpi4py.py

Example sbatch script mpi4py.sh when running mpi4py.py through sbatch:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --ntasks=4

module load Python/2.7.11-goolf-triton-2016b
mpiexec -n $SLURM_NTASKS python mpi4py.py