Skip to content

Introduction to Workflows: Snakemake Part I & II May 12 & 14th, 2021

Marisa Lim edited this page May 24, 2021 · 1 revision

Day 1

Wed, May 12, 2021, from 09:00am - 11:00pm PDT

Instructor: Dr. C Titus Brown

Moderator: Jose Sanchez

Resource Links Discussed

Description

Learn to create reproducible and scalable data analyses using the Snakemake workflow management system. Snakemake is a text-based workflow system in which workflows are defined in terms of rules that connect input and output files. Snakemake offers many desired features such as flexibility to scale without modification, integration with package manager Conda and container engine Singularity, and compatibility with cloud computing.

This free two-part workshop series will cover the basics of Snakemake including setup, installation, building, and running an example workflow along with a discussion on advanced features such as customizing computational threads, use of config file, and use of arbitrary parameters.

💻 Open a web browser - e.g., Firefox, Chrome, or Safari. Open the binder by clicking this button: Binder

Hello!

We are part of the training and engagement team for the NIH Common Fund Data Ecosystem, a project supported by the NIH to increase data reuse and cloud computing for biomedical research.

Workshop structure and plan

Day 1:

  • Introduction to workflows
  • Introduction to snakemake
  • Structure of the Snakefile
  • Linking snakemake rules (decorating the Snakefile)

Day 2:

  • Continue decorating the Snakefile
  • Running lots of rules all at once
  • Open Q&A time

How to ask questions

If you have questions at any point,

  • Drop them in the chat, or
  • Direct messages to the moderator or helpers are welcome, or
  • Unmute yourself and ask during the workshop

We're going to use the "raise hand" reaction in zoom to make sure people are on board during the hands-on activities.

Some background

  • What is a workflow?
  • Why use workflows?
  • What and why Snakemake?

The Setup

  • Open a web browser - e.g., Firefox, Chrome, or Safari. Open the binder by clicking this button: Binder

(Arrange/orientation to Rstudio panels; this is just a user interface, and is not critical for running snakemake!)

  • Click on "Terminal"
  • Shorten the command prompt to $ by typing this:
    echo "PS1='\w $ '" >> .bashrc
  • Open a new terminal to see the results

Link to tutorial with a slightly different binder experience

  • This binder comes with Conda installed. To use conda on your local machine, you will need to install it (if you don't already have conda installed)
  • Let's look at the contents of the environment.yml file which specifies all the necessary dependencies installed in the binder. Use the point and click interface.

Note that the base conda environment in the binder is defined by the environment.yml specifications. On your local machine, you can create a separate conda environment on top of your base environment with the same specification:

# create a virtual environment snaketest 
conda env create -n snaketest -f environment.yml

# activate the virtual environment
conda activate snaketest
  • Check if the setup was successful:

    samtools --version

The Snakefile

In this workshop, we will walk through the basic steps for creating a variant calling workflow with the Snakemake workflow management system.

Open the Snakefile by clicking on it. It will open up in your text editor tab. The Snakefile is a collection of rules. Each rule has this stucture:

rule rule_name:
    shell:
        # for single line commands
        # command must be enclosed in single quotes, "
        # multiline commands can be in triple quotes """
        "command"
  • Let's run a rule:
snakemake -j 2 -p map_reads

Did it run? Why or why not?

Now let's try another rule, the very first rule:

snakemake -j 2 -p download_data

In this case, the shell command downloads the raw read file from a public repository on osf.io. The -p means show the command that you're running. The -j 2 tells snakemake to run up to two tasks at the same time - this will not become important until Friday, but we still have to give snakemake a number!

Did it work? Let's ls to see what we downloaded, or look at the file tab in RStudio.

Next, let's run more rules sequentially:

snakemake -j 2 -p download_data

snakemake -j 2 -p download_genome

snakemake -j 2 -p uncompress_genome

snakemake -j 2 -p index_genome_bwa

snakemake -j 2 -p map_reads

Check the working directory again. The directory is populated by many output files including reference genome (.fa), genome index (.fa.sa, .fa.amb etc) and mapped reads (.sam) files.

Link to Tutorial

Decorating the Snakefile

It gets tedious to run each command individually, and we can do that already without Snakemake! What if we want to run all the commands at once?

By defining the inputs and outputs for each rule's command(s), Snakemake can figure out how the rules are linked together. The rule structure will now look something like this, where input:, output:, and shell: are Snakemake directives:

rule rule_name:

    input:
        # input file names must be enclosed in quotes
        # multiple inputs should be separated by commas
        # the new line for each input is optional
        "input file 1",
        "input file 2",
        "input file 3"
    output:
        # output file names must be enclosed in quotes
        # multiple outputs should be separated by commas
        "output file 1",
        "output file 2"
    shell:
        # for multi-line commands
        # commands must be enclosed in triple quotes
        """
        command_1
        command_2
        """

Here, Snakemake interprets the input: and output: sections as Python code, and the shell: section as the bash code that gets run on the command line.

Let's add inputs and outputs to the Snakefile!

(follow Titus' lead and delete all the output files from before)

Decorating the first snakemake rule

  • Add output to rule download_data and remove redundancy:
rule download_data:
    output: "SRR2584857_1.fastq.gz"
    shell:
        "wget https://osf.io/4rdza/download -O {output}"

What practical impact does this have? Hint: try running download_data twice with

snakemake -j 2 -p download_data

Now let's add an output: block to the download_genome rule, too!

Question: why don't these have inputs?


Exercise 1

Add an input and output to the uncompress_genome rule

Hint: The uncompress step is to get rid of the .gz extension. Your filename should look identical but without the .gz extension.


Let's continue decorating the Snakefile with inputs and outputs.

Inputs for Rule map_reads:

rule map_reads:
    input:
        "ecoli-rel606.fa",
        "SRR2584857_1.fastq.gz",
        "ecoli-rel606.fa.pac",
        "ecoli-rel606.fa.ann",
        "ecoli-rel606.fa.amb",
        "ecoli-rel606.fa.bwt",
        "ecoli-rel606.fa.sa"
    output: "SRR2584857.sam"
    shell:
        "bwa mem -t 4 ecoli-rel606.fa SRR2584857_1.fastq.gz > {output}"

Discussion points for Friday:

  • Adding inputs and outputs to all the rules.
  • How are the rules linked?
  • How to link the rules to run them together?

End of Day 1

For a recap of what you learned today please watch the 14 min vidlet linked to our tutorial page before the next session.

For next session, we would like you to pre-watch these two short (~16 min) vidlets.

Warning!

Be sure to save any work/notes you took in the binder to your computer. Any new files/changes are not available after the binder session is closed!

For example, select a file, click "More", click "Export":

See you on Friday!

Day 1 Snakefile

# download data from https://osf.io/vzfc6/
rule download_data:
    output: "SRR2584857_1.fastq.gz"
    shell:
        "wget https://osf.io/4rdza/download -O {output}"
# this rule downloads the genome file
rule download_genome:
    output: "ecoli-rel606.fa.gz"
    shell:
        "wget https://osf.io/8sm92/download -O {output}"
rule uncompress_genome:
    input: "ecoli-rel606.fa.gz"
    output: "ecoli-rel606.fa"
    shell:
        "gunzip {input}"
rule index_genome_bwa:
    input: "ecoli-rel606.fa"
    output: "ecoli-rel606.fa.pac",
        "ecoli-rel606.fa.ann",
        "ecoli-rel606.fa.amb",
        "ecoli-rel606.fa.bwt",
        "ecoli-rel606.fa.sa"
    shell:
        "bwa index {input}"
rule map_reads:
    input:
        genome="ecoli-rel606.fa",
        reads="SRR2584857_1.fastq.gz",
        pac="ecoli-rel606.fa.pac",
        ann="ecoli-rel606.fa.ann",
        amb="ecoli-rel606.fa.amb",
        bwt="ecoli-rel606.fa.bwt",
        sa="ecoli-rel606.fa.sa"
    output: sam="SRR2584857.sam"
    shell:
        "bwa mem -t 4 {input.genome} {input.reads} > {output.sam}"
rule index_genome_samtools:
    shell:
        "samtools faidx ecoli-rel606.fa"
rule samtools_import:
    shell:
        "samtools import ecoli-rel606.fa.fai SRR2584857.sam SRR2584857.bam"
rule samtools_sort:
    shell:
        "samtools sort SRR2584857.bam -o SRR2584857.sorted.bam"
rule samtools_index_sorted:
    shell: "samtools index SRR2584857.sorted.bam"
rule samtools_mpileup:
    shell:
        """samtools mpileup -u -t DP -f ecoli-rel606.fa SRR2584857.sorted.bam | \
    bcftools call -mv -Ob -o - > variants.raw.bcf"""
rule make_vcf:
    shell: "bcftools view variants.raw.bcf > variants.vcf"
# at end, run:
## samtools tview -p ecoli:4202391 SRR2584857.sorted.bam ecoli-rel606.fa

Bonus: Visualizing Workflow as a DAG

Snakemake also enables creation of a DAG i.e. directed acyclic graph. You can read more about the syntax information on the snakemake documentation.

To generate a png DAG file, we need Graphviz, a graph visualization software. Since the workshop binder is preloaded with conda, we can create a new conda environment with the required packages.

If you are unfamiliar with conda or would like a refresher, see click here for our materials on Conda.

# initialize and reset the bash with conda
conda init bash
source .bashrc

#create a new environment and install Graphviz
conda create -y -n graphs graphviz

# activate conda environment
conda activate graphs

Graphviz uses dot to create directed graphs. At the end of Day 1 we have completed map_reads rule. Let's try to create a DAG of all rules from Day 1.

We can specify any file with snakemake using the -s or --snakemake flag

--snakefile, -s The workflow definition in form of a snakefile.Usually, you should not need to specify this. By default, Snakemake will search for ‘Snakefile’, ‘snakefile’, ‘workflow/Snakefile’, ‘workflow/snakefile’ beneath the current working directory, in this order. Only if you definitely want a different layout, you need to use this parameter.

We have added the Snakemake modifications from Day 1 to the binder as day1_Snakefile.txt which we will use to

snakemake -s day1_Snakefile.txt --dag SRR2584857.sam | dot -Tpng > dag_map_reads.png

You can now click on the dag_map_reads.png from the Files/Plots/Packages pane to visualize the steps. The DAG contains a node for each job with the edges connecting them representing the dependencies.

You can run the above command to generate the DAG even without creating the actually input and output files from the rules. The difference between running the dag before and after running through all rules until map_reads is distinguished by solid vs dashed lines for the nodes which indicate that the output is up-to-date.



Day 2

Friday, May 14, 2021, from 09:00am - 11:00pm PDT

Instructor: Dr. C Titus Brown

Moderator: Jose Sanchez

Continue decorating the Snakefile

Note that you can use the day1_Snakefile.txt for this session but would have to appropriately update the snakemake commands by adding

snakemake -s day1_Snakefile.txt <rule_name> -j 2

Alternatively, you can update the existing Snakefile by copying all the rules up to map_reads from Day 1 and you can use the snakemake syntax without specifying the snakefile.

Perhaps simplest:

cp day1_Snakefile.txt Snakefile

Getting set up for Day 2, a quickstart

  • reset your prompt
  • copy our day 1 Snakefile over with cp day1_Snakefile.txt Snakefile
  • and then run snakemake -j 2 map_reads

and everything should be "green" with five jobs run.

Running lots of rules all at once

  • Run the rules one at a time to figure out what the output files are. Check the file time stamps (ls -lht) to track more recent output files. The inputs are specified in the shell command section.

Snakemake Dry Run You can do a dry run (--dry-run or -n) of the rule with

snakemake -n <rule name>

to check how Snakemake is interpreting the rule input(s), output(s), and shell command(s), without actually running the command(s) or creating any output(s).

  • Let's discuss inputs and outputs for rule map_reads
    • What happens when you have more than one input?
    • What to do when you have invisible input files (i.e. inputs that are not in the shell command)?
rule map_reads:

    input:
        genome = "ecoli-rel606.fa",
        reads = "SRR2584857_1.fastq.gz",
        idxfile = expand("ecoli-rel606.fa.{ext}", ext=['sa', 'amb', 'ann', 'pac', 'bwt'])
    output: "SRR2584857.sam"
    shell:
        "bwa mem -t 4 {input.genome} {input.reads} > {output}"
  • Once you've fixed the rules index_genome_bwa and map_reads, you should be able to run everything up to the rule index_genome_samtools by running
snakemake -p index_genome_samtools -j 2
  • Snakemake also has the option to delete all the inputs/outputs for a particular rule (including preceding rules), shown here by running the index_genome_samtools rule:
snakemake --delete-all-output index_genome_samtools -j 2

Using filenames instead of rule names

You don't actually need to use the rule names (this will be important later on!). Instead of rule names, you can specify the required output file in Snakemake which will trigger execution of all the upstream linked rules necessary to produce the file. So, the command below will also work to run the rule map_reads, and you don't have to remember the rule name (which can be arbitrary).

snakemake -p SRR2584857.sam -j 2

Let's update two rules together, and then update a rule on your own.

We'll work through index_genome_samtools and samtools_import rules together.

Exercise 2

Add an input and output to the samtools_sort rule

Next, we'll finish off the Snakefile together!

(adventure goes here)

Running the pipeline

Once you've decorated the entire Snakefile, you should be able to run through the whole workflow using either the rule name:

snakemake -p make_vcf -j 2

or by using the file name:

snakemake -p variants.vcf -j 2

Rule all

Recollect that Snakemake will execute the first rule in the Snakefile by default. We can use this feature by creating a default rule called all at the top of the Snakefile:

rule all:

    input: "variants.vcf"

Run it like this:

snakemake -p all -j 2

This rule runs through the entire workflow with a single command! This is much better than running each command one by one!

Output

Let's look at the vcf file!

less variants.vcf

We can look at the alignment by running the alignment viewer in samtools:

samtools tview -p ecoli:4202391 SRR2584857.sorted.bam ecoli-rel606.fa

If you run samtools view as in the last comment in the Snakefile, you can use ‘q’ to get out of it (and arrow keys to navigate)

Extras!

edit to make_vcf rule:

shell: "bcftools view --include 'INFO/DP>=20' variants.raw.bcf > variants.vcf"

specifying temp output files:

rule samtools_index_sorted:
    input: "SRR2584857.sorted.bam"
    output: temp("SRR2584857.sorted.bam.bai")
    shell: "samtools index {input}"

specifying protected file:

output: protected("variants.raw.bcf")

Day 2 Snakefile

data = ['SRR_TITUS_1', 'SRR2584857_1']

rule first_rule_of_titus:
    input:
        # fill in the expected files from the 'data' list
        expand("{s}.sorted.bam.bai", s=data),
        expand("{s}.variants.vcf", s=data),
        expand("{s}.variants.hidepth.vcf", s=data)

rule download_data:
    output: "SRR2584857.fastq.gz"
    shell:
        "wget https://osf.io/4rdza/download -O {output}"

# this rule downloads the genome file
rule download_genome:
    output: "ecoli-rel606.fa.gz"
    shell:
        "wget https://osf.io/8sm92/download -O {output}"

rule uncompress_genome:
    input: "ecoli-rel606.fa.gz"
    output: "ecoli-rel606.fa"
    shell:
        "gunzip {input}"

rule index_genome_bwa:
    input: "ecoli-rel606.fa"
    output: "ecoli-rel606.fa.pac",
        "ecoli-rel606.fa.ann",
        "ecoli-rel606.fa.amb",
        "ecoli-rel606.fa.bwt",
        "ecoli-rel606.fa.sa"
    shell:
        "bwa index {input}"

rule map_reads:
    input:
        genome="ecoli-rel606.fa",
        reads="{sample}.fastq.gz",
        pac="ecoli-rel606.fa.pac",
        ann="ecoli-rel606.fa.ann",
        amb="ecoli-rel606.fa.amb",
        bwt="ecoli-rel606.fa.bwt",
        sa="ecoli-rel606.fa.sa"
    output: sam="{sample}.sam"
    shell:
        "bwa mem -t 4 {input.genome} {input.reads} > {output.sam}"

rule index_genome_samtools:
    input: "ecoli-rel606.fa"
    output: "ecoli-rel606.fa.fai"
    shell:
        "samtools faidx ecoli-rel606.fa"

rule samtools_import:
    input:
        fai = "ecoli-rel606.fa.fai",
        sam = "{sample}.sam"
    output: "{sample}.bam"
    shell:
        "samtools view -bt {input.fai} -o {output} {input.sam}"

# TODO: add 'input' and 'output' to samtools_sort
rule samtools_sort:
    input: "{sample}.bam"
    output: "{sample}.sorted.bam"
    shell:
        "samtools sort {input} -o {output}"

rule samtools_index_sorted:
    input: "{sample}.sorted.bam"
    output: temp("{sample}.sorted.bam.bai")
    shell: "samtools index {input}"

rule samtools_mpileup:
    input:
        genome = "ecoli-rel606.fa",
        sorted_bam = "{sample}.sorted.bam",
        bai = "{sample}.sorted.bam.bai"
    output: protected("{sample}.variants.raw.bcf")
    shell:
        """samtools mpileup -u -t DP -f {input.genome} {input.sorted_bam} | \
    bcftools call -mv -Ob -o - > {output}"""

rule make_vcf:
    input: "{sample}.variants.raw.bcf"
    output: "{sample}.variants.vcf"
    shell: "bcftools view {input} > {output}"

rule make_highdepth_vcf:
    input: "{sample}.variants.raw.bcf"
    output: "{sample}.variants.hidepth.vcf"
    shell: "bcftools view --include 'INFO/DP>=20' {input} > {output}"

# at end, run:
## samtools tview -p ecoli:4202391 SRR2584857.sorted.bam ecoli-rel606.fa
Clone this wiki locally