Skip to content

An efficient, documented, reproducible Snakemake methylation analysis pipeline for BS-seq and EM-seq samples, including cfDNA.

License

Notifications You must be signed in to change notification settings

semenko/serpent-methylation-pipeline

Repository files navigation

Serpent Methylation Pipeline (for Snakemake)

Snakemake License: MIT Code style: black Platform: Linux Super-Linter Documentation

Serpent Pipeline Logo

A standardized, reproducible pipeline to process WGBS bisulfite & EM-seq data. This goes from .fastq to methylation calls (via bwameth with bwa-mem2 and biscuit) and includes extensive QC and plotting, using a Snakemake pipeline.

📖 Documentation

View the complete documentation

The documentation includes:

  • Detailed installation instructions
  • Configuration guide
  • Usage examples
  • Pipeline technical details
  • Troubleshooting guide
  • API reference

Quick Start

This pipeline is designed to be straightforward:

  1. Clone this repository and open the directory:
    git clone https://github.com/semenko/serpent-methylation-pipeline.git
    cd serpent-methylation-pipeline
    
  2. Install Snakemake via mamba (or conda)
    mamba install -c bioconda -c conda-forge snakemake snakemake-storage-plugin-http
    
  3. (Optional) Create a separate conda environment for pipeline dependencies:
    mamba env create -n serpent_pipeline_env -f workflow/envs/env.yaml
    conda activate serpent_pipeline_env
    
  4. Test the pipeline:
    snakemake --cores 4 --use-conda --dry-run
    

For detailed instructions, see the Installation Guide.

Features

At a high level, this pipeline reproducibly:

Support

Contributing

We welcome contributions! Please see the Contributing Guide in our documentation. ├── goleft/ # goleft coverage plots ├── logs/ # runlogs from each pipeline component ├── methyldackel/ # mbias plots ├── raw/ │ ├── ...fastq.gz # Raw reads | ├── ...md5.txt # Checksums and validation ├── samtools/ # samtools statistics SAMPLE_02/ ... ... multiqc/ # A project-level multiqc stats across all data

Note each project also has a _subsampled directory with identical structure, which is the result of the pipeline run on only 10M reads/sample.

Production Runs

Pipeline Details

This pipeline was designed for highly reproducible, explainable alignments and analysis of epigenetic sequencing data.

Reference Genome

I chose GRCh38, with these specifics:

You can see a good explanation of the rationale for some of these components at this NCBI explainer.

Requirements

All software requirements are specified in env.yaml.

Most are relatively common, but a few are semi-unique:

biscuit was chosen after a comparison with bwa-meth and bismark — its latest version was the most flexible with extremely well annotated .bams (some critical tags are missing from bwa-meth for identifying read level methylation, and would require patching MethylDackel to extract data).

I briefly experimented with wgbs_tools (which defines nice .pat/.beta formats) but its licensing is too restrictive to use.

Trimming Approach

I chose a relatively conservative approach to trimming -- which is needed due to end-repair bias, adaptase bias, and more.

For EMseq, I trim 10 bp everywhere, after personal QC and offline discussions with NEB. See my notes here.

For BSseq, I trim 15 bp 5' R2, and 10 bp everywhere else due to adaptase bias.

For all reads, I set --trim_poly_g (due to two color bias) and set a --length_required (minimum read length) of 10 bp.

No Quality Filtering

Notably I do NOT do quality filtering here (I set --disable_quality_filtering), and save this for downstream analyses as desired.

I experimented with more stringent quality filtering early on, and found it had little yield / performance benefit.

Background & Inspiration

I strongly suggest reading work from Felix Krueger (author of Bismark) as background. In particular:

For similar pipelines and inspiration, see:

Pipeline Graph

Here's a high-level overview of the Snakemake pipeline (generated via snakemake --rulegraph | dot -Tpng > rules.png)

image

About

An efficient, documented, reproducible Snakemake methylation analysis pipeline for BS-seq and EM-seq samples, including cfDNA.

Topics

Resources

License

Stars

Watchers

Forks