Environments with Conda¶
Conda is a popular package manager that is especially popular in data science and machine learning communities.
It is commonly used to handle complex requirements of Python and R packages.
Quick usage guide¶
First time setup¶
You can get conda by loading the miniconda
-module:
module load miniconda
By default Conda stores installed packages and environments in your home directory. However, as your home directory has a lower quota, it is a good idea to tell conda to install packages and environments into your work directory:
mkdir $WRKDIR/.conda_pkgs
mkdir $WRKDIR/.conda_envs
conda config --append pkgs_dirs ~/.conda/pkgs
conda config --append envs_dirs ~/.conda/envs
conda config --prepend pkgs_dirs $WRKDIR/.conda_pkgs
conda config --prepend envs_dirs $WRKDIR/.conda_envs
Now you’re all set up to create your first environment.
Creating a simple environment with conda¶
One can install environments from the command line itself, but a better idea
is to write an environment.yml
-file that describes the environment.
Below we have a simple environment.yml
:
name: conda-example
channels:
- conda-forge
dependencies:
- numpy
- pandas
Now we can use the conda
-command to create the environment:
module load miniconda
conda env create --file environment.yml
Once the environment is installed, you can activate it with:
source activate conda-example
conda init
, conda activate
, and source activate
We don’t recommend doing conda init
like many sources
recommend: this will permanently affect your .bashrc
file and
make hard-to-debug problems later. The main points of conda
init
are to a) automatically activate an environment (not good on
a cluster: make it explicit so it can be more easily debugged)
and b) make conda
a shell function (not command) so that
conda activate
will work (source activate
works as well in
all cases, no confusion if others don’t.)
If you activate one environment from another, for example after loading an anaconda module, do
source activate ENV_NAME
like shown above (conda installation in the environment not needed).If you make your own standalone conda environments, install the
conda
package in them, then…Activate a standalone environment with conda installed in it by
source PATH/TO/ENV_DIRECTORY/bin/activate
(which incidentally activates just that one session for conda).
Resetting conda
Sometimes it is necessary to reset your Conda configuration. So here are instructions on how to wipe all of your conda settings and existing environments. To be able to do so first activate conda. On Triton, by loading the miniconda environment:
module load miniconda
First, check where conda stores your environments:
conda config --show env_dirs
conda config --show pkgs_dirs
Delete the directories that are listed and start with /home/USERNAME
(this could e.g. be /home/<username>/.conda/envs
)
and /scratch/
( e.g. /scratch/work/USERNAME/conda_envs
). You would delete
these with rm -r DIRNAME
, but be careful you use the right paths because there
is no going back.
This will clean up all packages and environments you have installed.
Next, clean up your .bashrc
, .zshrc
, .kshrc
and .cshrc
(whichever ones exist for you).
Open these files in an editor (e.g. nano .bashrc
) and search for the line # >>> conda initialize >>>
delete everything between this line and the line # <<< conda initialize <<<
. These lines automatically
initilize conda upon login which can cause a lot of trouble on a cluster.
Finally delete the file .condarc
from your home folder ( rm ~/.condarc
) to reset your conda configuration.
After this close the current connection to triton and reconnect in a new session.
Now you should have a system that doesn’t have any remains of conda, so you can now follow the initial steps as detailed here.
Understanding the environment file¶
Conda environment files are written using YAML syntax. In an environment file one usually defines the following:
name
: Name of the desired environment.channels
: Which channels to use for packages.dependencies
: Which conda and pip packages to install.
Choosing conda channels¶
When an environment file is used to create an environment, conda looks up the list of channels (in descending priority) and it will try to find the needed packages.
Some of the most popular channels are:
conda-forge
: An open-source channel with over 18k packages. Highly recommended for new environments. Most packages inanaconda
-modules come from here.defaults
: A channel maintained by Anaconda Inc.. Free for non-commercial use. Default for anaconda distribution.r
: A channel of R packages maintained by Anaconda Inc.. Free for non-commercial use.bioconda
: A community maintained channel of bioinformatics packages.pytorch
: Official channel for PyTorch, a popular machine learning framework.
One can have multiple channels defined like in the following example:
name: pytorch-env
channels:
- nvidia
- pytorch
- conda-forge
dependencies:
- pytorch
- pytorch-cuda=11.7
- torchvision
- torchaudio
Setting package dependencies¶
Packages in environment.yml
can have version constraints and version
wildcards. One can also specify pip packages to install after conda-packages
have been installed.
For example, the following
dependency-env.yml
would install a numpy with version higher or equal
than 1.10 using conda and scipy via pip:
name: dependency-env
channels:
- conda-forge
dependencies:
- numpy>=1.10.*
- pip
- pip:
- scipy
Listing packages in an environment¶
To list packages installed in an environment, one can use:
conda list
Removing an environment¶
To remove an environment, one can use:
conda env remove --name environment_name
Do remember to deactivate the environment before trying to remove it.
Cleaning up conda cache¶
Conda uses a cache for downloaded and installed packages. This cache can get large or it can be corrupted by failed downloads.
In these situations one can use conda clean
to clean up the cache.
conda clean -i
cleans up the index cache that conda uses to find the packages.conda clean -t
cleans up downloaded package installers.conda clean -p
cleans up unused packages.conda clean -a
cleans up all of the above.
Installing new packages into an environment¶
Installing new packages into an existing environment can be done with
conda install
-command. The following command would install matplotlib
from conda-forge
into an environment.
conda install --freeze-installed --channel conda-forge matplotlib
Installing packages into an existing environment can be risky: conda uses channels given from the command line when it determines which channels it should use for the new packages.
This can cause a situation where installing a new package results in the
removal and reinstallation of multiple packages. Adding the
--freeze-installed
-flags makes already installed packages safe and by
giving explicitly the channels to use, one can make certain that the new
packages come from the same source.
It is usually a better option to create a new environment with the new
package set as an additional dependency in the environment.yml
.
This keeps the environment reproducible.
If you intend on installing packages to existing environment, adding default channels for the environment can also make installing packages easier.
Setting default channels for an environment¶
It is a good idea to store channels used when creating the environment into a configuration file that is stored within the environment. This makes it easier to install any missing packages.
For example, one could add conda-forge
into the list of default channels
with:
conda config --env --add channels conda-forge
We can check the contents of the configuration file with:
cat $CONDA_PREFIX/.condarc
Doing everything faster with mamba¶
mamba is a drop-in replacement for conda that does environment building and solving much faster than conda.
To use it, you either need to install mamba
-package from
conda-forge
-channel or use the miniconda
-module.
If you have mamba
, you can just switch from using conda
-command
to using mamba
and it should work in the same way, but faster.
For example, one could create an environment with:
mamba env create --file environment.yml
Motivation for using conda¶
When should you use conda?¶
If you need basic Python packages, you can use pre-installed
anaconda
-modules. See the Python-page for
more information.
You should use conda when you need to create your own custom environment.
Why use conda? What are its advantages?¶
Quite often Python packages are installed with Pip from the Python Package Index (PyPI). These packages contain Python code and in many cases some compiled code as well.
However, there are three problems pip cannot solve without additional tools:
How do you install multiple separate suites of packages for different use cases?
How do you handle packages that depend on some external libraries?
How do you make sure that all of the packages have are compatible with each other?
Conda tries to solve these problems with the following ways:
Conda creates environments where packages are installed. Each environment can be activated separately.
Conda installs library dependencies to the environment with the Python packages.
Conda uses a solver engine to figure out whether packages are compatible with each other.
Conda also caches installed packages so doing copies of similar environments does not use additional space.
One can also use the environment files to make the installation procedure more reproducible.
Creating an environment with CUDA toolkit¶
NVIDIA’s CUDA-toolkit is needed for working with NVIDIA’s GPUs. Many Python frameworks that work on GPUs need to have a supported CUDA toolkit installed.
Conda is often used to provide the CUDA toolkit and additional libraries such as cuDNN. However, one should choose the version of the CUDA toolkit based on what the software requires.
If the package is installed from a conda channel such as conda-forge
,
conda will automatically retreive the correct version of CUDA toolkit.
If the code requires manual compilation with CUDA, one should check the advanced documentation on Compiling CUDA code while using conda environments¶.
In other cases one can use an environment file like this
cuda-env.yml
:
name: cuda-env
channels:
- conda-forge
dependencies:
- cudatoolkit
Hint
During installation conda will try to verify what is the maximum version of CUDA installed graphics cards can support and it will install non-CUDA enabled versions by default if none are found (as is the case on the login node, where environments are normally built). This can be usually overcome by setting explicitly that the packages should be the CUDA-enabled ones. It might however happen, that the environment creation process aborts with a message similar to:
nothing provides __cuda needed by tensorflow-2.9.1-cuda112py310he87a039_0
In this instance it might be necessary to override the CUDA settings used by
conda/mamba.
To do this, prefix your environment creation command with CONDA_OVERRIDE_CUDA=CUDAVERSION
,
where CUDAVERSION is the CUDA toolkit version you intend to use as in:
CONDA_OVERRIDE_CUDA="11.2" mamba env create -f cuda-env.yml
This will allow conda to assume that the respective CUDA libraries will be present at a later point and so it will skip those requirements during installation.
For more information, see this helpful post in Conda-Forge’s documentation.
Creating an environment with GPU enabled Tensorflow¶
To create an environment with GPU enabled Tensorflow you can use an
environment file like this
tensorflow-env.yml
:
name: tensorflow-env
channels:
- conda-forge
dependencies:
- tensorflow=*=*cuda*
Here we install the latest tensorflow from conda-forge
-channel with an additional
requirement that the build version of the tensorflow
-package must contain
a reference to a CUDA toolkit. For a specific version replace the =*=*cuda*
with e.g. =2.8.1=*cuda*
for version 2.8.1
.
If you encounter errors related to CUDA while creating the environment, do note this hint on overriding CUDA during installation.
Creating an environment with GPU enabled PyTorch¶
To create an environment with GPU enabled PyTorch you can use an
environment file like this
pytorch-env.yml
:
name: pytorch-env
channels:
- nvidia
- pytorch
- conda-forge
dependencies:
- pytorch
- pytorch-cuda=11.7
- torchvision
- torchaudio
Here we install the latest pytorch version from pytorch
-channel and
the pytorch-cuda
-metapackage that makes certain that the
Additional packages required by pytorch
are installed from conda-forge
-channel.
If you encounter errors related to CUDA while creating the environment, do note this hint on overriding CUDA during installation.
Installing numpy with Intel MKL enabled BLAS¶
NumPy and other mathematical libaries utilize BLAS (Basic Linear Algebra Subprograms) implementation for speeding up many operations. Intel provides their own fast BLAS implementation in Intel MKL (Math Kernel Library). When using Intel CPUs, this library can give a significant performance boost to mathematical calculations.
One can install this library as the default BLAS by specifying
blas * mkl
as a requirement in the dependencies like in this
mkl-env.yml
:
name: mkl-env
channels:
- conda-forge
dependencies:
- blas * mkl
- numpy
Advanced usage¶
Finding available packages¶
Because conda tries to make certain that all packages in an environment are compatible with each other, there are usually tens of different versions of a single package.
One can search for a package from a channel with the following command:
mamba search --channel conda-forge tensorflow
This will return a long list of packages where each line looks something like this:
tensorflow 2.8.1 cuda112py39h01bd6f0_0 conda-forge
Here we have:
The package name (
tensorflow
).Version of the package (
2.8.1
).Package build version. This version often contains information on:
Python version needed by the package (
py39
or Python 3.9).Other libraries used by the package (
cuda112
or CUDA 11.2).
Channel where the package comes from (
conda-forge
).
Checking package dependencies¶
One can check package dependencies by adding the --info
-flag to the
search command. This can give a lot of output, so it is a good idea to
limit the search to one specific package:
mamba search --info --channel conda-forge tensorflow=2.8.1=cuda112py39h01bd6f0_0
The output looks something like this:
tensorflow 2.8.1 cuda112py39h01bd6f0_0
--------------------------------------
file name : tensorflow-2.8.1-cuda112py39h01bd6f0_0.tar.bz2
name : tensorflow
version : 2.8.1
build : cuda112py39h01bd6f0_0
build number: 0
size : 26 KB
license : Apache-2.0
subdir : linux-64
url : https://conda.anaconda.org/conda-forge/linux-64/tensorflow-2.8.1-cuda112py39h01bd6f0_0.tar.bz2
md5 : 35716504c8ce6f685ae66a1d9b084fc7
timestamp : 2022-05-21 09:09:53 UTC
dependencies:
- __cuda
- python >=3.9,<3.10.0a0
- python_abi 3.9.* *_cp39
- tensorflow-base 2.8.1 cuda112py39he716a45_0
- tensorflow-estimator 2.8.1 cuda112py39hd320b7a_0
Packages with underscores are meta-packages that should not be added to conda environment specifications. They will be solved by conda automatically.
Here we can see more info on the package, including its dependencies.
When using mamba, one can also use mamba repoquery depends
to
see the dependencies:
mamba repoquery depends --channel conda-forge tensorflow=2.8.1=cuda112py39h01bd6f0_0
Output looks something like this:
Name Version Build Channel
─────────────────────────────────────────────────────────────────────────────
tensorflow 2.8.1 cuda112py39h01bd6f0_0 conda-forge/linux-64
__cuda >>> NOT FOUND <<<
python 3.9.9 h62f1059_0_cpython conda-forge/linux-64
python_abi 3.9 2_cp39 conda-forge/linux-64
tensorflow-base 2.8.1 cuda112py39he716a45_0 conda-forge/linux-64
tensorflow-estimator 2.8.1 cuda112py39hd320b7a_0 conda-forge/linux-64
One can also print the full dependency list with
mamba repoquery depends --tree
. This will produce a really long output.
mamba repoquery depends --channel conda-forge tensorflow=2.8.1=cuda112py39h01bd6f0_0
Fixing conflicts between packages¶
Usually first step of fixing conflicts between packages is to write a new environment file and list all required packages in the file as dependencies. A fresh solve of the environment can often result in a working environment.
Sometimes there is a case where a single package does not have support for a specific version of Python or specific version of CUDA toolkit. In these cases it is usually beneficial to give more flexibility to the solver by limiting the number of specified versions.
One can also use the search commands provided by mamba
to see what
dependencies individual packages have.