Merge pull request #8 from MPUSP/dev

feat: various small improvements for next release
MPUSP · Sep 27, 2024 · 36f1580 · 36f1580
2 parents c378d28 + b5b609f
commit 36f1580
Show file tree

Hide file tree

Showing 25 changed files with 519 additions and 51 deletions.
diff --git a/.gitignore b/.gitignore
@@ -13,4 +13,7 @@ resources/**
 Notes.md
 .vscode/*
 .snakemake-workflow-catalog.yml
-.test/results/*
+.test/results/*
+Dockerfile
+singularity.def
+snakemake-bacterial-riboseq.sif
diff --git a/.test/config/samples.tsv b/.test/config/samples.tsv
@@ -1,3 +1,3 @@
-sample	condition	replicate	lib_prep 	data_folder	fq1
-RPF-RTP1	RPF-RTP	1	McGlincy	data	RPF-RTP1_R1_001.fastq.gz
-RPF-RTP2	RPF-RTP	2	McGlincy	data	RPF-RTP2_R1_001.fastq.gz
+sample	condition	replicate	data_folder	fq1
+RPF-RTP1	RPF-RTP	1	data	RPF-RTP1_R1_001.fastq.gz
+RPF-RTP2	RPF-RTP	2	data	RPF-RTP2_R1_001.fastq.gz
diff --git a/README.md b/README.md
@@ -2,8 +2,10 @@
 
 ![Platform](https://img.shields.io/badge/platform-all-green)
 [![Snakemake](https://img.shields.io/badge/snakemake-≥7.0.0-brightgreen.svg)](https://snakemake.github.io)
-[![GitHub actions status](https://github.com/MPUSP/snakemake-bacterial-riboseq/workflows/Tests/badge.svg?branch=main)](https://github.com/MPUSP/snakemake-bacterial-riboseq/actions?query=branch%3Amain+workflow%3ATests)
+[![Tests](https://github.com/MPUSP/snakemake-bacterial-riboseq/actions/workflows/main.yml/badge.svg)](https://github.com/MPUSP/snakemake-bacterial-riboseq/actions/workflows/main.yml)
 [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
+[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1D355C.svg?labelColor=000000)](https://sylabs.io/docs/)
+[![workflow catalog](https://img.shields.io/badge/Snakemake%20workflow%20catalog-darkgreen)](https://snakemake.github.io/snakemake-workflow-catalog)
 
 ---
 
@@ -13,7 +15,6 @@ A Snakemake workflow for the analysis of bacterial riboseq data.
   - [Usage](#usage)
   - [Workflow overview](#workflow-overview)
   - [Installation](#installation)
-    - [Additional tools](#additional-tools)
   - [Running the workflow](#running-the-workflow)
     - [Input data](#input-data)
       - [Reference genome](#reference-genome)
@@ -89,9 +90,7 @@ mamba create -c conda-forge -c bioconda -n snakemake-bacterial-riboseq snakemake
 conda activate snakemake-bacterial-riboseq
 ```
 
-### Additional tools
-
-**Important note:**
+**Note:**
 
 All other dependencies for the workflow are **automatically pulled as `conda` environments** by snakemake, when running the workflow with the `--use-conda` parameter (recommended).
 
@@ -115,16 +114,18 @@ Important requirements when using custom `*.fasta` and `*.gff` files:
 
 Ribosome footprint sequencing data in `*.fastq.gz` format. The currently supported input data are **single-end, strand-specific reads**. Input data files are supplied via a mandatory table, whose location is indicated in the `config.yml` file (default: `samples.tsv`). The sample sheet has the following layout:
 
-| sample   | condition | replicate | lib_prep | data_folder | fq1                      |
-| -------- | --------- | --------- | -------- | ----------- | ------------------------ |
-| RPF-RTP1 | RPF-RTP   | 1         | McGlincy | data        | RPF-RTP1_R1_001.fastq.gz |
-| RPF-RTP2 | RPF-RTP   | 2         | McGlincy | data        | RPF-RTP2_R1_001.fastq.gz |
+| sample   | condition | replicate | data_folder | fq1                      |
+| -------- | --------- | --------- | ----------- | ------------------------ |
+| RPF-RTP1 | RPF-RTP   | 1         | data        | RPF-RTP1_R1_001.fastq.gz |
+| RPF-RTP2 | RPF-RTP   | 2         | data        | RPF-RTP2_R1_001.fastq.gz |
 
 Some configuration parameters of the pipeline may be specific for your data and library preparation protocol. The options should be adjusted in the `config.yml` file. For example:
 
 - Minimum and maximum read length after adapter removal (see option `cutadapt: default`). Here, the test data has a minimum read length of 15 + 7 = 22 (2 nt on 5'end + 5 nt on 3'end), and a maximum of 45 + 7 = 52.
 - Unique molecular identifiers (UMIs). For example, the protocol by [McGlincy & Ingolia, 2017](https://doi.org/10.1016/J.YMETH.2017.05.028) creates a UMI that is located on both the 5'-end (2 nt) and the 3'-end (5 nt). These UMIs are extracted with `umi_tools` (see options `umi_extraction: method` and `pattern`).
 
+Example configuration files for different sequencing protocols can be found in `resources/protocols/`.
+
 ### Execution
 
 To run the workflow from command line, change the working directory.
@@ -133,7 +134,7 @@ To run the workflow from command line, change the working directory.
 cd path/to/snakemake-bacterial-riboseq
 ```
 
-Adjust the global and module-specific options in the default config file `config/config.yml`.
+Adjust options in the default config file `config/config.yml`.
 Before running the entire workflow, you can perform a dry run using:
 
 ```bash
@@ -146,8 +147,66 @@ To run the complete workflow with test files using **`conda`**, execute the foll
 snakemake --cores 10 --use-conda --directory .test
 ```
 
+To run the workflow with **singularity**, use:
+
+```bash
+snakemake --cores 10 --use-singularity --use-conda --directory .test
+```
+
 ### Parameters
 
+This table lists all parameters that can be used to run the workflow.
+
+| parameter              | type | details                                     | default                                      |
+| ---------------------- | ---- | ------------------------------------------- | -------------------------------------------- |
+| **samplesheet**        |      |                                             |                                              |
+| path                   | str  | path to samplesheet, mandatory              | "config/samples.tsv"                         |
+| **get_genome**         |      |                                             |                                              |
+| database               | str  | one of `manual`, `ncbi`                     | `ncbi`                                       |
+| assembly               | str  | RefSeq ID                                   | `GCF_000006785.2`                            |
+| fasta                  | str  | optional path to fasta file                 | Null                                         |
+| gff                    | str  | optional path to gff file                   | Null                                         |
+| gff_source_type        | str  | list of name/value pairs for GFF source     | see config file                              |
+| **cutadapt**           |      |                                             |                                              |
+| fivep_adapter          | str  | sequence of the 5' adapter                  | Null                                         |
+| threep_adapter         | str  | sequence of the 3' adapter                  | `ATCGTAGATCGGAAGAGCACACGTCTGAA`              |
+| default                | str  | additional options passed to `cutadapt`     | [`-q 10 `, `-m 22 `, `-M 52`, `--overlap=3`] |
+| **umi_extraction**     |      |                                             |                                              |
+| method                 | str  | one of `string` or `regex`, see manual      | `regex`                                      |
+| pattern                | str  | string or regular expression                | `^(?P<umi_0>.{5}).*(?P<umi_1>.{2})$`         |
+| **umi_dedup**          |      |                                             |                                              |
+| options                | str  | default options for deduplication           | see config file                              |
+| **star**               |      |                                             |                                              |
+| index                  | str  | location of genome index; if Null, is made  | Null                                         |
+| genomeSAindexNbases    | num  | length of pre-indexing string, see STAR man | 9                                            |
+| multi                  | num  | max number of loci read is allowed to map   | 10                                           |
+| sam_multi              | num  | max number of alignments reported for read  | 1                                            |
+| intron_max             | num  | max length of intron; 0 = automatic choice  | 1                                            |
+| default                | str  | default options for STAR aligner            | see config file                              |
+| **extract_features**   |      |                                             |                                              |
+| biotypes               | str  | biotypes to exclude from mapping            | [`rRNA`, `tRNA`]                             |
+| CDS                    | str  | CDS type to include for mapping             | [`protein_coding`]                           |
+| **bedtools_intersect** |      |                                             |                                              |
+| defaults               | str  | remove hits, sense strand, min overlap 20%  | [`-v `, `-s `, `-f 0.2`]                     |
+| **annotate_orfs**      |      |                                             |                                              |
+| window_size            | num  | size of 5'-UTR added to CDS                 | 30                                           |
+| **shift_reads**        |      |                                             |                                              |
+| window_size            | num  | start codon window to determine shift       | 30                                           |
+| read_length            | num  | size range of reads to use for shifting     | [27, 45]                                     |
+| end_alignment          | str  | end used for alignment of RiboSeq reads     | `3prime`                                     |
+| shift_table            | str  | optional table with offsets per read length | Null                                         |
+| export_bigwig          | str  | export shifted reads as bam file            | True                                         |
+| export_ofst            | str  | export shifted reads as ofst file           | False                                        |
+| skip_shifting          | str  | skip read shifting entirely                 | False                                        |
+| skip_length_filter     | str  | skip filtering reads by length              | False                                        |
+| **multiqc**            |      |                                             |                                              |
+| config                 | str  | path to multiqc config                      | `config/multiqc_config.yml`                  |
+| **report**             |      |                                             |                                              |
+| export_figures         | bool | export figures as `.svg` and `.png`         | True                                         |
+| export_dir             | str  | sub-directory for figure export             | `figures/`                                   |
+| figure_width           | num  | standard figure width in px                 | 875                                          |
+| figure_height          | num  | standard figure height in px                | 500                                          |
+| figure_resolution      | num  | standard figure resolution in dpi           | 125                                          |
 
 ## Authors