Serpent Methylation Pipeline (for Snakemake)

A standardized, reproducible pipeline to process WGBS bisulfite & EM-seq data. This goes from .fastq to methylation calls (via bwameth with bwa-mem2 and biscuit) and includes extensive QC and plotting, using a Snakemake pipeline.

📖 Documentation

View the complete documentation

The documentation includes:

Detailed installation instructions
Configuration guide
Usage examples
Pipeline technical details
Troubleshooting guide
API reference

Quick Start

This pipeline is designed to be straightforward:

Clone this repository and open the directory:

git clone https://github.com/semenko/serpent-methylation-pipeline.git
cd serpent-methylation-pipeline

Install Snakemake via mamba (or conda)

mamba install -c bioconda -c conda-forge snakemake snakemake-storage-plugin-http

(Optional) Create a separate conda environment for pipeline dependencies:

mamba env create -n serpent_pipeline_env -f workflow/envs/env.yaml
conda activate serpent_pipeline_env

Test the pipeline:

snakemake --cores 4 --use-conda --dry-run

For detailed instructions, see the Installation Guide.

Features

At a high level, this pipeline reproducibly:

Builds a reference genome (GRCh38 with hs38d1 decoy, U2AF1 and ENCODE DAC masking)
Trims & filters reads using fastp
Aligns using bwameth with bwa-mem2 backend
Marks non-converted reads using mark-nonconverted-reads
Calls methylation using biscuit pileup
Generates standardized outputs & QC including:
- FastQC
- fastp statistics
- Biscuit QC
- samtools stats
- MethylDackel mbias plots
- Goleft indexcov plots
- wgbs_tools pat/beta files
- Compressed bed files and epibeds
Runs multiqc across entire projects

Support

Documentation: https://semenko.github.io/serpent-methylation-pipeline/
Issues: GitHub Issues
Discussions: GitHub Discussions

Contributing

We welcome contributions! Please see the Contributing Guide in our documentation. ├── goleft/ # goleft coverage plots ├── logs/ # runlogs from each pipeline component ├── methyldackel/ # mbias plots ├── raw/ │ ├── ...fastq.gz # Raw reads | ├── ...md5.txt # Checksums and validation ├── samtools/ # samtools statistics SAMPLE_02/ ... ... multiqc/ # A project-level multiqc stats across all data

Note each project also has a _subsampled directory with identical structure, which is the result of the pipeline run on only 10M reads/sample.

Production Runs

Pipeline Details

This pipeline was designed for highly reproducible, explainable alignments and analysis of epigenetic sequencing data.

Reference Genome

I chose GRCh38, with these specifics:

No patches
Includes the hs38d1 decoy
Includes Alt chromosomes
Applies the U2AF1 masking file
Applies the Encode DAC exclusion

You can see a good explanation of the rationale for some of these components at this NCBI explainer.

Requirements

All software requirements are specified in env.yaml.

Most are relatively common, but a few are semi-unique:

biscuit (for alignment)
NEB's mark-nonconverted-reads (to mark partially converted reads)

biscuit was chosen after a comparison with bwa-meth and bismark — its latest version was the most flexible with extremely well annotated .bams (some critical tags are missing from bwa-meth for identifying read level methylation, and would require patching MethylDackel to extract data).

I briefly experimented with wgbs_tools (which defines nice .pat/.beta formats) but its licensing is too restrictive to use.

Trimming Approach

I chose a relatively conservative approach to trimming -- which is needed due to end-repair bias, adaptase bias, and more.

For EMseq, I trim 10 bp everywhere, after personal QC and offline discussions with NEB. See my notes here.

For BSseq, I trim 15 bp 5' R2, and 10 bp everywhere else due to adaptase bias.

For all reads, I set --trim_poly_g (due to two color bias) and set a --length_required (minimum read length) of 10 bp.

No Quality Filtering

Notably I do NOT do quality filtering here (I set --disable_quality_filtering), and save this for downstream analyses as desired.

I experimented with more stringent quality filtering early on, and found it had little yield / performance benefit.

Background & Inspiration

I strongly suggest reading work from Felix Krueger (author of Bismark) as background. In particular:

TrimGalore's RRBS guide
The Babraham WGBS/RRBS tutorials

For similar pipelines and inspiration, see:

NEB's EM-seq pipeline
Felix Krueger's Nextflow WGBS Pipeline
The Snakepipes WGBS pipeline

Pipeline Graph

Here's a high-level overview of the Snakemake pipeline (generated via snakemake --rulegraph | dot -Tpng > rules.png)

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github		.github
config		config
data		data
docs		docs
workflow		workflow
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.snakemake-workflow-catalog.yml		.snakemake-workflow-catalog.yml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
serpent-logo.png		serpent-logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Serpent Methylation Pipeline (for Snakemake)

📖 Documentation

Quick Start

Features

Support

Contributing

Production Runs

Pipeline Details

Reference Genome

Requirements

Trimming Approach

No Quality Filtering

Background & Inspiration

Pipeline Graph

About

Uh oh!

Uh oh!

Languages

License

semenko/serpent-methylation-pipeline

Folders and files

Latest commit

History

Repository files navigation

Serpent Methylation Pipeline (for Snakemake)

📖 Documentation

Quick Start

Features

Support

Contributing

Production Runs

Pipeline Details

Reference Genome

Requirements

Trimming Approach

No Quality Filtering

Background & Inspiration

Pipeline Graph

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages