Conda Workshop Additional Resources

Using the binder

What happens if I get a 502, 503, or 504 error?

Try clicking on the launch button again. The binder or internet connection may have timed out.

How do I save work from a binder?

Be sure to save any work/notes you took in the binder to your computer. Any new files/changes are not available when the binder session is closed!

For example, select a file, click "More", click "Export", to save individual files to your computer:

I can't copy/paste code to the RStudio Terminal. How do I run commands?

On some versions of WindowsOS, the copy/paste function may not work in the Terminal panel.

Instead, change the code language in the Source panel to Shell. Then, copy/paste code and run line by line from Source.

For the workshop, you should be able to run commands in the Source panel with the workshop_commands.sh file.

How do I make a binder like the one used in the workshop?

We used a Pangeo Binder for the workshop: https://binder.pangeo.io/. Full steps here.

To make a binder, you need to create a Github account and a Github repository to put files you want to use for the binder.

The repo should have a folder called binder. In this folder, you add files that specify the software environment for the binder. For example, the workshop's binder used this environment.yml file (with conda!) to specify installing R so we could use the Rstudio interface:

channels:
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - r-base

After adding all the files we want in the binder, go to the pangeo website and enter the Github repo URL. Optional: add the repo branch you're working on if not the main branch to point pangeo to. If you want an Rstudio interface, change "Path to a notebook file (optional)" dropdown to "URL" instead of "File". Type rstudio so the binder opens Rstudio. Then click "launch" and wait for the binder to build. Copy the binder badge text to create a clickable button.

Here are some examples for rstudio binders: https://github.com/binder-examples/r-conda

You can also make binders that open with jupyter notebooks or terminal interfaces!

Conda channels

conda-forge and bioconda are channels that contain community contributed software
Bioconda specializes in bioinformatics software (supports only 64-bit Linux and Mac OS)
- package list: https://anaconda.org/bioconda/repo
conda-forge contains many dependency packages
- package list: https://anaconda.org/conda-forge/repo
In absence of other channels, conda searches the default repository which consists of ten official repositories.
You can even install R packages with conda!

Getting started on your system

Whether you're installing conda on your own computer, a cloud instance, or high performance computer server, you'll need to consider the following:

Conda installer: Miniconda vs Anaconda
- Both are free versions with Miniconda being the light weight version
OS: Windows, MacOS, Linux
Bit-count: 32 vs 64-bit
- macOS is 64-bit only
Python version for root environment (2.x vs 3.x)
- version 3.x is the default option since it is newer
- choose 2.7 version if you have mostly 2.7 code or use packages that do not have a 3.x version (but keep in mind that python 2.x sunsetted - https://www.python.org/doc/sunset-python-2/)

We have a lesson on installing Miniconda on MacOS: Click here for lesson. You can also learn more from the conda user guide for installation guidelines: Click here for user guide.

Installer sets up two things: Conda and the root environment. The root environment contains the selected python version and some basic packages.

Image credit: Gergely Szerovay

Conda distributions

Conda package distributions

How can I make sure the latest version of a chosen software package is always installed?

To ensure conda installs the latest version of a package from any listed channel, configure conda to set the channel priority to FALSE:

conda config --set channel_priority false

Doing so allows the package version to take precedence over channel priority followed by package build number.

Package version > channel priority > package build number

How do I recreate an environment with exactly the same versions of all software packages?

Continuing from the 3 methods of creating and installing software in conda environments in the workshop, here is a 4th method.

For this approach, we export a list of the exact software package versions installed in a given environment and use it to set up new environments. This set up method won't install the latest version of a given program, for example, but it will replicate the exact environment set up you exported from.

Method 4: Install exact environment

conda activate fqc
conda list --export > packages.txt
conda deactivate

Two options -

install the exact package list into an existing environment:

conda install --file=packages.txt

set up a new environment with the exact package list:

conda env create --name qc_file --file packages.txt

How do I install R packages with conda?

Why would you install R packages with conda rather than install.packages()? Here's one use case (a comment from this blog):

Everytime I start a new project, I create a new conda environment with the R packages I am going to need. I can add more if I need them later on. It saves me from the headaches of having to deal with one unique installation of R and all the dependencies I need for the diferent projects I am working on.

This is a fantastic blog/tutorial on this topic: https://alexanderlabwhoi.github.io/post/anaconda-r-sarah/

As an example, you can open the workshop binder:

click the Terminal tab
set up/initialize conda and add channels as we did in the workshop
create a directory for our test files:

mkdir test
cd test

then create a .yml file. You can use the source Rstudio panel to make a new file, save it as renv.yml (make sure it saves a .yml and not R script file). The YAML file should have the channels and dependencies below. We are only installing the R package ggplot2 for this example, but experiment with adding others! Note that conda will automatically install any dependencies that ggplot2 needs.

name: r-test
channels:
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - r-ggplot2

create and activate the environment

conda env create -f renv.yml
conda activate r-test

open R in the terminal

in the R shell, type the following

library(ggplot2)
df <- data.frame('x'=c(1:10), 'y'=c(41:50))
ggplot(df, aes(x, y)) + geom_point() #it's ok if you get an X11 error - that's the terminal saying it can't display the plot. we're going to save the plot and then view it so this isn't an issue!
ggsave('testplot.jpg', height=5, width=5, units='in', dpi=400)

on the binder file panel, open the jpg file!

quit() #to exit the R interface
# type 'n' - do not need to save the workspace image

Beyond Conda...Mamba!

Conda is very popular and there is a lot of documentation, but it can get slow when trying to resolve compatibility for a lot of software programs. Mamba is an even faster reimplementation of conda.

This command installs mamba into the base environment, so it's available across environments:

conda install -n base -c conda-forge mamba

If using mamba, the commands usually replace conda with mamba, for example:

mamba env create -f environment.yml

However, environments are still activated using conda activate.

Conda & Snakemake

Snakemake is a Python based workflow management system to create reproducible and scalable pipelines. One neat feature of Snakemake is its integration with Conda. You can easily set-up multiple Conda environments for each of the Snakemake files. Here are two examples of such combination:

Isolated conda environments for each analysis step

A good example of a reproducible workflow is by using a Snakemake file to document the different steps as rules along with creating isolated environments with the necessary software for those analysis steps. This is an example workflow for variant calling from downloading data files all the way through generating a vcf file with variants. Note the different environment files in yml format used for the different steps: https://github.com/ngs-docs/2021-ggg-201b-lab-hw1/blob/main/Snakefile

Benchmarking a software across different versions

sourmash software package is used for identifying DNA sequence similarity between very large datasets along with k-mer based taxonomic exploration and classification. Here one of the developers of this software, uses a Snakemake file with rules for the benchmarking functions along with multiple environments in yml format for all the different versions of the sourmash software: https://github.com/luizirber/sourmash_resources

Home
Resources for Attendees
Resources for Instructors
Training Workshop Notes
- Amazon Web Services (AWS)
- Conda
- HuBMAP Tools
  - April 28, 2021
- R
- RNA-Seq Concepts, Design and Workflows
- RNA-Seq in the Cloud
  - June 21 & 23, 2021
- Snakemake Part I & II
  - May 12 & 14
- UNIX

Provide feedback

Saved searches

Use saved searches to filter your results more quickly