Skip to content

Commit

Permalink
Merge pull request #8 from MPUSP/dev
Browse files Browse the repository at this point in the history
feat: various small improvements for next release
  • Loading branch information
rabioinf authored Sep 27, 2024
2 parents c378d28 + b5b609f commit 36f1580
Show file tree
Hide file tree
Showing 25 changed files with 519 additions and 51 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,7 @@ resources/**
Notes.md
.vscode/*
.snakemake-workflow-catalog.yml
.test/results/*
.test/results/*
Dockerfile
singularity.def
snakemake-bacterial-riboseq.sif
6 changes: 3 additions & 3 deletions .test/config/samples.tsv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
sample condition replicate lib_prep data_folder fq1
RPF-RTP1 RPF-RTP 1 McGlincy data RPF-RTP1_R1_001.fastq.gz
RPF-RTP2 RPF-RTP 2 McGlincy data RPF-RTP2_R1_001.fastq.gz
sample condition replicate data_folder fq1
RPF-RTP1 RPF-RTP 1 data RPF-RTP1_R1_001.fastq.gz
RPF-RTP2 RPF-RTP 2 data RPF-RTP2_R1_001.fastq.gz
79 changes: 69 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@

![Platform](https://img.shields.io/badge/platform-all-green)
[![Snakemake](https://img.shields.io/badge/snakemake-≥7.0.0-brightgreen.svg)](https://snakemake.github.io)
[![GitHub actions status](https://github.com/MPUSP/snakemake-bacterial-riboseq/workflows/Tests/badge.svg?branch=main)](https://github.com/MPUSP/snakemake-bacterial-riboseq/actions?query=branch%3Amain+workflow%3ATests)
[![Tests](https://github.com/MPUSP/snakemake-bacterial-riboseq/actions/workflows/main.yml/badge.svg)](https://github.com/MPUSP/snakemake-bacterial-riboseq/actions/workflows/main.yml)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1D355C.svg?labelColor=000000)](https://sylabs.io/docs/)
[![workflow catalog](https://img.shields.io/badge/Snakemake%20workflow%20catalog-darkgreen)](https://snakemake.github.io/snakemake-workflow-catalog)

---

Expand All @@ -13,7 +15,6 @@ A Snakemake workflow for the analysis of bacterial riboseq data.
- [Usage](#usage)
- [Workflow overview](#workflow-overview)
- [Installation](#installation)
- [Additional tools](#additional-tools)
- [Running the workflow](#running-the-workflow)
- [Input data](#input-data)
- [Reference genome](#reference-genome)
Expand Down Expand Up @@ -89,9 +90,7 @@ mamba create -c conda-forge -c bioconda -n snakemake-bacterial-riboseq snakemake
conda activate snakemake-bacterial-riboseq
```

### Additional tools

**Important note:**
**Note:**

All other dependencies for the workflow are **automatically pulled as `conda` environments** by snakemake, when running the workflow with the `--use-conda` parameter (recommended).

Expand All @@ -115,16 +114,18 @@ Important requirements when using custom `*.fasta` and `*.gff` files:

Ribosome footprint sequencing data in `*.fastq.gz` format. The currently supported input data are **single-end, strand-specific reads**. Input data files are supplied via a mandatory table, whose location is indicated in the `config.yml` file (default: `samples.tsv`). The sample sheet has the following layout:

| sample | condition | replicate | lib_prep | data_folder | fq1 |
| -------- | --------- | --------- | -------- | ----------- | ------------------------ |
| RPF-RTP1 | RPF-RTP | 1 | McGlincy | data | RPF-RTP1_R1_001.fastq.gz |
| RPF-RTP2 | RPF-RTP | 2 | McGlincy | data | RPF-RTP2_R1_001.fastq.gz |
| sample | condition | replicate | data_folder | fq1 |
| -------- | --------- | --------- | ----------- | ------------------------ |
| RPF-RTP1 | RPF-RTP | 1 | data | RPF-RTP1_R1_001.fastq.gz |
| RPF-RTP2 | RPF-RTP | 2 | data | RPF-RTP2_R1_001.fastq.gz |

Some configuration parameters of the pipeline may be specific for your data and library preparation protocol. The options should be adjusted in the `config.yml` file. For example:

- Minimum and maximum read length after adapter removal (see option `cutadapt: default`). Here, the test data has a minimum read length of 15 + 7 = 22 (2 nt on 5'end + 5 nt on 3'end), and a maximum of 45 + 7 = 52.
- Unique molecular identifiers (UMIs). For example, the protocol by [McGlincy & Ingolia, 2017](https://doi.org/10.1016/J.YMETH.2017.05.028) creates a UMI that is located on both the 5'-end (2 nt) and the 3'-end (5 nt). These UMIs are extracted with `umi_tools` (see options `umi_extraction: method` and `pattern`).

Example configuration files for different sequencing protocols can be found in `resources/protocols/`.

### Execution

To run the workflow from command line, change the working directory.
Expand All @@ -133,7 +134,7 @@ To run the workflow from command line, change the working directory.
cd path/to/snakemake-bacterial-riboseq
```

Adjust the global and module-specific options in the default config file `config/config.yml`.
Adjust options in the default config file `config/config.yml`.
Before running the entire workflow, you can perform a dry run using:

```bash
Expand All @@ -146,8 +147,66 @@ To run the complete workflow with test files using **`conda`**, execute the foll
snakemake --cores 10 --use-conda --directory .test
```

To run the workflow with **singularity**, use:

```bash
snakemake --cores 10 --use-singularity --use-conda --directory .test
```

### Parameters

This table lists all parameters that can be used to run the workflow.

| parameter | type | details | default |
| ---------------------- | ---- | ------------------------------------------- | -------------------------------------------- |
| **samplesheet** | | | |
| path | str | path to samplesheet, mandatory | "config/samples.tsv" |
| **get_genome** | | | |
| database | str | one of `manual`, `ncbi` | `ncbi` |
| assembly | str | RefSeq ID | `GCF_000006785.2` |
| fasta | str | optional path to fasta file | Null |
| gff | str | optional path to gff file | Null |
| gff_source_type | str | list of name/value pairs for GFF source | see config file |
| **cutadapt** | | | |
| fivep_adapter | str | sequence of the 5' adapter | Null |
| threep_adapter | str | sequence of the 3' adapter | `ATCGTAGATCGGAAGAGCACACGTCTGAA` |
| default | str | additional options passed to `cutadapt` | [`-q 10 `, `-m 22 `, `-M 52`, `--overlap=3`] |
| **umi_extraction** | | | |
| method | str | one of `string` or `regex`, see manual | `regex` |
| pattern | str | string or regular expression | `^(?P<umi_0>.{5}).*(?P<umi_1>.{2})$` |
| **umi_dedup** | | | |
| options | str | default options for deduplication | see config file |
| **star** | | | |
| index | str | location of genome index; if Null, is made | Null |
| genomeSAindexNbases | num | length of pre-indexing string, see STAR man | 9 |
| multi | num | max number of loci read is allowed to map | 10 |
| sam_multi | num | max number of alignments reported for read | 1 |
| intron_max | num | max length of intron; 0 = automatic choice | 1 |
| default | str | default options for STAR aligner | see config file |
| **extract_features** | | | |
| biotypes | str | biotypes to exclude from mapping | [`rRNA`, `tRNA`] |
| CDS | str | CDS type to include for mapping | [`protein_coding`] |
| **bedtools_intersect** | | | |
| defaults | str | remove hits, sense strand, min overlap 20% | [`-v `, `-s `, `-f 0.2`] |
| **annotate_orfs** | | | |
| window_size | num | size of 5'-UTR added to CDS | 30 |
| **shift_reads** | | | |
| window_size | num | start codon window to determine shift | 30 |
| read_length | num | size range of reads to use for shifting | [27, 45] |
| end_alignment | str | end used for alignment of RiboSeq reads | `3prime` |
| shift_table | str | optional table with offsets per read length | Null |
| export_bigwig | str | export shifted reads as bam file | True |
| export_ofst | str | export shifted reads as ofst file | False |
| skip_shifting | str | skip read shifting entirely | False |
| skip_length_filter | str | skip filtering reads by length | False |
| **multiqc** | | | |
| config | str | path to multiqc config | `config/multiqc_config.yml` |
| **report** | | | |
| export_figures | bool | export figures as `.svg` and `.png` | True |
| export_dir | str | sub-directory for figure export | `figures/` |
| figure_width | num | standard figure width in px | 875 |
| figure_height | num | standard figure height in px | 500 |
| figure_resolution | num | standard figure resolution in dpi | 125 |

## Authors

Expand Down
Loading

0 comments on commit 36f1580

Please sign in to comment.