Python¶
Python is widely used programming language where we have installed all basic packages on every node. Yet, python develops quite fast and the system provided packages are ofter not complete or getting old.
Python distributions¶
Python to use |
How to install own packages |
|
---|---|---|
I don’t really care, I just want recent stuff and to not worry. |
Anaconda: |
|
Simple programs with common packages, not switching between Pythons often |
Anaconda: |
|
Your own conda environment |
Miniconda: |
conda environment + conda |
Your own virtual environment |
Module virtualenv |
virtualenv + pip + setuptools |
The main version of modern Python is 3. Support for old Python 2 ended at the end of 2019. There are also different distributions: The “regular” CPython, Anaconda (a package containing CPython + a lot of other scientific software all bundled togeter), PyPy (a just-in-time compiler, which can be much faster for some use cases). Triton supports all of these.
For general scientific/data science use, we suggest that you use Anaconda. It comes with the most common scientific software included, and is reasonably optimized.
There are many other “regular” CPython versions in the module system. These are compiled and optimized for Triton, and are highly recommended. The default system Python is old and won’t be updated.
Make sure your environments are reproducible - you can recreate
them from scratch. History shows you will probably have to do this
eventually, and it also ensures that others can always use your code.
We recommend a minimal requirements.txt
(pip) or
environment.yml
(conda), hand-created with the minimal
dependencies in there.
Quickstart¶
Use module load anaconda
to get our Python installation.
If you have simple needs, use pip install –user to install packages. For complex needs, use anaconda + conda environments to isolate your projects.
Install your own packages easily¶
Warning
pip install --user
can result in incompatibilities
If you do this, then the module will be shared among all
your projects. It is quite likely that eventually, you will get some
incompatibilities between the Python you are using and the modules
installed. In that case, you are on your own (simple recommendation is
to remove all modules from ~/.local/lib/pythonN.N
and reinstall). If
you get incompatible module errors, our first recommendation will be to
remove everything installed this way and use conda/virtual
environments instead. It’s not a bad idea to do this when you
switch to environments anyway.
If you encounter problems, remove all your user packages:
rm -r ~/.local/lib/python*.*/
and reinstall everything after loading the environment you want.
Installing your own packages with pip install
won’t work, since it
tries to install globally for all users. Instead, you should do this
(add --user
) to install the package in your home directory
(~/.local/lib/pythonN.N/
):
pip install --user $package_name
This is quick and effective best used for leaf packages without many dependencies and if you don’t switch Python modules often.
Note
Example of dangers of pip install --user
Someone did pip install --user tensorflow
. Some time later,
they noticed that they couldn’t use Tensorflow + GPUs. We couldn’t
reproduce the problem, but in the end found they had this local
install that was hiding any Tensorflow in any module (forcing a CPU
version on them).
Note: pip
installs from the Python Package Index.
Anaconda and conda environments¶
Anaconda is a Python distribution by Continuum Analytics (open source, of course). It is nothing fancy, they just take a lot of useful scientific packages and their dependencies and put them all together, make sure they work, and do some optimization. They also include most of the most common computing and data science packages and non-Python compiled software and libraries. It is also all open source, and is packaged nicely so that it can easily be installed on any major OS.
To load anaconda, use the module system (you can also load specific versions):
module load anaconda # python3
module load anaconda2 # python2
Note
Before 2020, Python3 was via the anaconda3
module (note the
3
on the end). That’s still there, but in 2020 we completely
revised our Anaconda installation system, and dropped active
maintenance of Python 2. All updates are in anaconda
only in
the future.
Conda environments¶
See also
Watch a Research Software Hour episode on conda for an introduction + demo.
If you encounter a situation where you need to create your own environment, we recommend that you use conda environments. When you create your own environment the packages from the base environment (default environment installed by us) will not be used, but you can choose which packages you want to install.
We nowadays recommend that you use the miniconda
-module for installing these
environments. Miniconda is basically a minimal Anaconda installation that can be used to
create your own environments.
By default conda tries to install packages into your home folder, which can result in running out of quota. To fix this, you should run the following commands once:
module load miniconda
mkdir $WRKDIR/.conda_pkgs
mkdir $WRKDIR/.conda_envs
conda config --append pkgs_dirs ~/.conda/pkgs
conda config --append envs_dirs ~/.conda/envs
conda config --prepend pkgs_dirs $WRKDIR/.conda_pkgs
conda config --prepend envs_dirs $WRKDIR/.conda_envs
virtualenv does not work with Anaconda, use conda
instead.
Load the miniconda module. You should look up the version and use load same version each time you source the environment:
# Load miniconda first. This must always be done before activating the env! module load miniconda
Create an environment. This needs to be done once:
# create environment with the packages you require conda create -n ENV_NAME python pip ipython tensorflow-gpu pandas ...
Activate the environment. This needs to be done every time you load the environment:
# This must be run in each shell to set up the environment variables properly. # make sure module is loaded first. source activate ENV_NAME
Activating and using the environment, installing more packages, etc. can be done either using
conda install
orpip install
:# Install more packages, either conda or pip conda search PACKAGE_NAME conda install PACKAGE_NAME pip install PACKAGE_NAME
Leaving the environment when done (optional):
# Deactivate the environment source deactivate
To activate an environment from a Slurm script:
#!/bin/bash #SBATCH --time=00:05:00 #SBATCH --cpus_per_task=1 #SBATCH --mem=1G source activate ENV_NAME srun echo "This step is ran inside the activated conda environment!" source deactivate
Worst case, you have incompatibility problems. Remove everything, including the stuff installed with
pip install --user
. If you’ve mixed your personal stuff in with this, then you will have to separate it out.:# Remove anything installed with pip install --user. rm -r ~/.local/lib/python*.*/
A few notes about conda environments:
Once you use a conda environment, everything goes into it. Don’t mix versions with, for example, local packages in your home dir and
--pip install --user
. Things installed (even previously) withpip install --user
will be visible in the conda environment and can make your life hard! Eventually you’ll get dependency problems.Often the same goes for other python based modules. We have setup many modules that do use anaconda as a backend. So, if you know what you are doing this might work.
conda init
, conda activate
, and source activate
We don’t recommend doing conda init
like many sources
recommend: this will permanently affect your .bashrc
file and
make hard-to-debug problems later. The main points of conda
init
are to a) automatically activate an environment (not good on
a cluster: make it explicit so it can be more easily debugged)
and b) make conda
a shell function (not command) so that
conda activate
will work (source activate
works as well in
all cases, no confusion if others don’t.)
If you activate one environment from another, for example after loading an anaconda module, do
source activate ENV_NAME
like shown above (conda installation in the environment not needed).If you make your own standalone conda environments, install the
conda
package in them, then…Activate a standalone environment with conda installed in it by
source PATH/TO/ENV_DIRECTORY/bin/activate
(which incidentally activates just that one session for conda).
Python: virtualenv¶
Virtualenv is default-Python way of making environments, but does
not work with Anaconda. We generally recommend using anaconda,
since it includes a lot more stuff by default, but virtualenv
works on other systems easily so it’s good to know about.
# Load module python
module load py-virtualenv
# Create environment
virtualenv DIR
# activate it (in each shell that uses it)
source DIR/bin/activate
# install more things (e.g. ipython, etc.)
pip install PACKAGE_NAME
# deactivate the virtualenv
deactivate
Anaconda/virtualenvironments in Jupyter¶
If you make a conda environment / virtual environment, you can use it from Triton’s JupyterHub (or your own Jupyter). See Installing kernels from virtualenvs or Anaconda environments.
IPython Parallel¶
ipyparallel is a tool for running embarrassingly parallel code using Python. The basic idea is that you have a controller and engines. You have a client process which is actually running your own code.
Preliminary notes: ipyparallel is installed in the anaconda{2,3}/latest modules.
Let’s say that you are doing some basic interactive work:
Controller: this can run on the frontend node, or you can put it on a script. To start:
ipcontroller --ip="*"
Engines:
srun -N4 ipengine
: This runs the four engines in slurm interactively. You don’t need to interact with this once it is running, but remember to stop the process once it is done because it is using resources. You can start/stop this as needed.Start your Python process and use things like normal:
import os import ipyparallel client = ipyparallel.Client() result = client[:].apply_async(os.getpid) pid_map = result.get_dict() print(pid_map)
This method lets you turn on/off the engines as needed. This isn’t the most advanced way to use ipyparallel, but works for interactive use.
See also: IPython parallel for a version which goes in a slurm script.
Background: pip
vs python
vs anaconda
vs conda
vs virtualenv
¶
Virtual environments are self-contained python environments with all of their own modules, separate from the system packages. They are great for research where you need to be agile and install whatever versions and packages you need. We highly recommend virtual environments or conda environments (below)
Anaconda: use conda, see below
Normal Python: virtualenv + pip install, see below
You often need to install your own packages. Python has its own package manager system that can do this for you. There are three important related concepts:
pip: the Python package installer. Installs Python packages globally, in a user’s directory (
--user
), or anywhere. Installs from the Python Package Index.virtualenv: Creates a directory that has all self-contained packages that is manageable by the user themself. When the virtualenv is activated, all the operating-system global packages are no longer used. Instead, you install only the packages you want. This is important if you need to install specific versions of software, and also provides isolation from the rest of the system (so that you work can be uninterrupted). It also allows different projects to have different versions of things installed. virtualenv isn’t magic, it could almost be seen as just manipulating
PYTHONPATH
,PATH
, and the like. Docs: https://docs.python-guide.org/dev/virtualenvs/conda: Sort of a combination of package manager and virtual environment. However, it only installed packages into environments, and is not limited to Python packages. It can also install other libraries (c, fortran, etc) into the environment. This is extremely useful for scientific computing, and the reason it was created. Docs for envs: https://conda.io/projects/conda/en/latest/user-guide/concepts/environments.html.
So, to install packages, there is pip
and conda
. To make virtual
environments, there is venv
and conda
.
Advanced users can see this rosetta stone for reference.
On Triton we have added some packages on top of the Anaconda installation, so cloning the entire Anaconda environment to local conda environment will not work (not a good idea in the first place but some users try this every now and then).
Examples¶
Running Python with OpenMP parallelization¶
Various Python packages such as Numpy, Scipy and pandas can utilize OpenMP
to run on multiple CPUs. As an example, let’s run the python script
python_openmp.py
that calculates multiplicative inverse of five symmetric matrices of
size 2000x2000.
nrounds = 5
t_start = time()
for i in range(nrounds):
a = np.random.random([2000,2000])
a = a + a.T
b = np.linalg.pinv(a)
t_delta = time() - t_start
print('Seconds taken to invert %d symmetric 2000x2000 matrices: %f' % (nrounds, t_delta))
The full code for the example is in
HPC examples-repository.
One can run this example with srun
:
wget https://raw.githubusercontent.com/AaltoSciComp/hpc-examples/master/python/python_openmp/python_openmp.py
module load anaconda/2022-01
export OMP_PROC_BIND=true
srun --cpus-per-task=2 --mem=2G --time=00:15:00 python python_openmp.py
or with sbatch
by submitting
python_openmp.sh
:
#!/bin/bash -l
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=1G
#SBATCH -o python_openmp.out
module load anaconda/2022-01
export OMP_PROC_BIND=true
echo 'Running on: '$HOSTNAME
srun python python_openmp.py
Important
Python has a global interpreter lock (GIL), which forces some operations to be executed on only one thread and when these operations are occuring, other threads will be idle. These kinds of operations include reading files and doing print statements. Thus one should be extra careful with multithreaded code as it is easy to create seemingly parallel code that does not actually utilize multiple CPUs.
There are ways to minimize effects of GIL on your Python code and if you’re creating your own multithreaded code, we recommend that you take this into account.
Running MPI parallelized Python with mpi4py¶
MPI parallelized Python requires a valid MPI installation that support our SLURM scheduler. Thus anaconda is not the best option. We have installed MPI-supporting Python versions to different toolchains.
Using mpi4py is quite easy. Example is provided below.
Python MPI4py¶
A simple script mpi4py.py
that utilizes mpi4py.
#!/usr/bin/env python
"""
Parallel Hello World
"""
from mpi4py import MPI
import sys
size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()
sys.stdout.write(
"Hello, World! I am process %d of %d on %s.\n"
% (rank, size, name))
Running mpi4py.py using only srun:
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --ntasks=4
module load Python/2.7.11-goolf-triton-2016b
mpiexec -n $SLURM_NTASKS python mpi4py.py
Example sbatch script mpi4py.sh
when running mpi4py.py through
sbatch:
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --ntasks=4
module load Python/2.7.11-goolf-triton-2016b
mpiexec -n $SLURM_NTASKS python mpi4py.py