3. Tool and package management

Here, we will tackle the first item on the list towards more reproducibility:

Keeping track of the used tools and their versions.

This can be achieved by using an appropriate package manager. As an example we are going to use conda with the Bioconda software channel.

3.1. Installing the Conda package manager

We will use the package/tool managing system conda to install some programs that we will use during the course. It is not installed by default, thus we need to install it first to be able to use it.

# download latest conda installer
$ curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

# run the installer
$ bash Miniconda3-latest-Linux-x86_64.sh

# delete the installer after successful run
$ rm Miniconda3-latest-Linux-x86_64.sh

Note

Should the conda installer download fail. Please find links to alternative locations on the Downloads page.

3.1.1. Update .bashrc and .zshrc config-files

Before we are able to use conda we need to tell our shell where it can find the program. We add the right path to the conda installation to our shell config files:

$ echo 'export PATH="~/miniconda3/bin:$PATH"' >> ~/.bashrc
$ echo 'export PATH="~/miniconda3/bin:$PATH"' >> ~/.zshrc

So what is actually happening here? We are appending a line to a file (either .bashrc or .zshrc). If we are starting a new command-line shell, either file gets executed first (depending on which shell you are using, either bash or zsh shells). What this line does, is to put permanently the directory ~/miniconda3/bin first on your PATH variable. The PATH variable contains directories in which our computer looks for installed programs, one directory after the other until the program you requested is found (or not, then it will complain). Through the addition of the above line we make sure that the program conda can be found any time we open a new shell.

Close shell/terminal, re-open new shell/terminal. Now, we should be able to use the conda command:

$ conda update conda

3.1.2. Installing conda channels to make tools available

Different tools are packaged in what conda calls channels. We need to add some channels to make the bioinformatics and genomics tools available for installation. In particular we need the Bioconda channel, that pre-packages many bioinformatics tools.

# Install some conda channels
# A channel is where conda looks for packages
$ conda config --add channels defaults
$ conda config --add channels conda-forge
$ conda config --add channels bioconda

3.2. Using conda to search and install tools

Let us first look for a tool, e.g. the aligner BWA:

# Look for available tools/packages
$ conda search bwa
Loading channels: done
# Name                  Version           Build  Channel
bwa                       0.5.9               0  bioconda
bwa                       0.5.9               1  bioconda
bwa                       0.6.2               0  bioconda
bwa                       0.6.2               1  bioconda
bwa                      0.7.3a               0  bioconda
...

We can see that the tool is available and several versions can be installed. To install software (here BWA) using conda, one uses the command conda install:

# install a tool into the environment
$ conda install bwa
# to install a particular version of a tool do
$ conda install bwa=0.6.2

Note

Without a version number conda tries to install the latest version for you.

While conda was in the first place not developed for bioinformatics/genomics type of tools/packages, clever people took the system and packaged bioinformatics tools into the conda system. To not confuse things with the original conda system, people are using “channels” to distribute software that is related. We already made three software “channels” available to our conda installation: conda-forge, defaults, bioconda. Specifically, the Bioconda channel is of importance to us as it makes ~3000 bioinformatics packages available to us [GRUENING2017].

3.3. Create isolated environments

While having one software manager for all your bioinformatics tools is great already, conda has one particular strength that we are going to exploit often during the course of this tutorial. Conda can create isolated environments for sets of user-defined tools. The tools and their version numbers within environments, once created, can be easily saved in a file. Using these files we can easily re-create an environment from scratch with the same tool-set with the same version numbers. Awesome!

# Create a base environment
$ conda create -n tutorial python=3
# Activate the environment
$ conda activate tutorial

So what is happening when you type conda activate tutorial in a shell. The PATH variable (mentioned above) gets temporarily manipulated and set to:

$ conda activate tutorial
# Lets look at the content of the PATH variable
(tutorial) $ echo $PATH
~/miniconda3/envs/tutorial/bin:~/miniconda3/bin:/usr/local/bin: ...

Now it will look first in your environment’s bin-directory (here ~/miniconda3/envs/tutorial/bin) and only afterwards in the general conda bin-directory (/home/manager/miniconda3/bin). So basically everything you install generally with conda install (without being in an environment) is also available to you but gets overshadowed if a similar program is in ~/miniconda3/envs/tutorial/bin and you have activated the tutorial environment.

Note

To tell if you are in the correct conda environment, look at the command-prompt. Do you see the name of the environment in round brackets at the very beginning of the prompt, e.g. (tutorial)? If not, activate the tutorial environment with conda activate tutorial before installing the tools.

To leave an environment just type:

(tutorial) $ conda deactivate
# Lets look at the content of the PATH variable
$ echo $PATH
~/miniconda3/bin:/usr/local/bin: ...

The command conda list will show you the packages that are installed within the environment:

$ conda activate tutorial
# list all installed
(tutorial) $ conda list

Looks like the tools bwa we wanted is installed.

Ok, now we want to get a snapshot of the current environment so that we could recreate it either here or on another machine running the same operating system.

# Lets export the environment into a yaml-file
(tutorial) $ conda env export > tutorial.yaml

Lets have a look into the tutorial.yaml file.

(tutorial) $ cat tutorial.yaml

To deactivate the environment again type:

# Deactivate environment
(tutorial) $ conda deactivate

Now we delete the environment, specifying the name again with -n tutorial

# Delete original "tutorial" environment
$ conda env remove -n tutorial

Now, we can use the created yaml-file to recreate the former tutorial environment, we submit the file with --file to conda env create.

# Lets recreate an environment using the tutorial.yaml file
$ conda env create -n tutorial --file tutorial.yml

# Activate environment
$ conda activate tutorial

Done! So we learned that we can create conda environments for a certain tool or toolset/packages and store the installed tools and their installed version numbers in a yaml-file that can be used to recreate the environment. This enables us in a very easy way to keep track of the tools and versions used in our analysis.

Note

It is good practice to include a yaml file of your environment in your analysis directory and submit it together with the rest of your code.

3.4. General Conda commands

# to search for packages
conda search PACKAGE

# Install
conda install PACKAGE

# To update all packages
conda update --all --yes

# List all packages installed
conda list [-n ENV]

# conda list environments
conda env list

# create new environment with packages
conda create -n ENV PACKAGE [PACKAGE] ...

# activate environment
conda activate ENV

# deavtivate environment
conda deactivate

# export env
conda env export > env.yaml

# recreate env from file
conda env create -n ENV -f env.yaml