Skip to content

LabShengLi/nanome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NANOME pipeline (Nanopore sequencing consensus DNA methylation detection method and pipeline)

demo_gif.gif

Highlights of NANOME pipeline

Several first highlights for NANOME

Figure_pipe_comp

  • Enables users to process terabasescale Oxford Nanopore sequencing datasets.
  • Provide a one command line/web-based UI for end-to-end analyzing Nanopore sequencing methylation-callings.
  • Support various platform executions: local, HPC and CloudOS, without needs for tools' installation (NANOME support docker and singularity).
  • First standardized whole genome-wide evaluation framework, considering per-read and per-site performance for singletons/non-singletons, genic and intergenic regions, CpG islands/shores/shelves, different CG densities regions and repetitive regions.
  • The first Nextflow based DNA methylation-calling pipeline for ONT data. Please check more articles about Nextflow based workflow technology from Nature Biotechnology: https://doi.org/10.1038/s41587-020-0439-x and https://doi.org/10.1038/nbt.3820.
  • Allow add new modules/tools in simple config txt file, without need to touch the main pipeline codes, supporting rapid development and evaluation.
  • Consensus of top performers by XGBoost model, allow NA values.
  • Multi-modifications for 5mC and 5hmC.
  • Haplotype-awared phasing and allele-specific methylation detection.
  • Support Dorado basecall and methylation call.

Background

Figure1A

Survey of methylation calling tools . Timeline of publication and technological developments of Oxford Nanopore Technologies (ONT) methylation calling tools to detect DNA cytosine modifications.

Figure1B

Workflow for 5-methylcytosine (5mC) detection for nanopore sequencing.

CI/CD automation features

We use CI Automation Tools to enable the automated testing on every commit and on PRs to make sure that updates are not introducing bugs. Please check the automatic testing results on Github.

System Requirements

Hardware requirements

NANOME pipeline can be easily configured with different RAM, CPU/GPU resources schema to parallelly run methylation-calling tools. For optimal usage, we recommend running NANOME pipeline on HPC or cloud computing platform, e.g., google cloud platform (GCP). The basic hardware requirements are below:

  • GPU or CPU with 2+ cores.
  • RAM: 7+ GB per cpu.
  • Storage using HDD or SSD. Please ensure the storage before running the pipeline.

Software requirements

NANOME pipeline uses Nextflow technology. Users only need to install Nextflow (check the installation guide from https://nf-co.re/usage/installation), and have one of below commonly used environment tool:

We provide conda, docker and singularity environments that depend on below well-known open-source packages for basecalling/methylation-calling/phasing on nanopore sequencing data:

nanopolish >=0.13.2
megalodon >=2.2.9
deepsignal >=0.1.8
ont-tombo >=1.5.1
deepmod >=0.1.3
METEORE >=1.0.0
ont-pyguppy-client-lib >=4.2.2
fast5mod >=1.0.5
Clair3 >=v0.1-r11
Whatshap >=1.0
NanomethPhase bam2bis >= 1.0
GNU Parallel >=20170422

Guppy software >= 4.2.2 from ONT (Oxford Nanopore Technologies) website

Installation

Users only need to install Nextflow (https://nf-co.re/usage/installation). NANOME execution environment will be automatically configured with the support of conda, docker or singularity containers. Below is steps for installing Nextflow:

# Install nextflow
conda install -c conda-forge -c bioconda nextflow
nextflow -v

NANOME pipeline support running with various ways in different platforms:

  • Docker
  • Singularity
  • Conda
  • Local execution: running directly on default platform
  • HPC clusters with SLURM support
  • Cloud computing platform, e.g., Google Cloud Platform(GCP) with google-lifesciences support

Simple usage

Please refer to Usage and Specific Usage and NANOME options for how to use NANOME pipeline. For running on CloudOS platform (e.g., google cloud), please check Usage on CloudOS. We provide a tutorial video for running NANOME pipeline:

IMAGE ALT TEXT HERE

When you have Nextflow software, NANOME pipeline can be directly executed without any other additional installation steps:

# Run NANOME via docker
nextflow run LabShengLi/nanome\
    -profile test,docker

# Run NANOME via singularity
nextflow run LabShengLi/nanome\
    -profile test,singularity

# Run NANOME for human data
nextflow run LabShengLi/nanome\
    -profile test_human,[docker/singularity]

# Run NANOME for dorado call
nextflow run LabShengLi/nanome\
    -profile test_dorado,singularity

Please note that above commands are integrated in our CI/CD test cases. Our GitHub will automatically test and report results on every commit and PRs (https://github.com/LabShengLi/nanome/actions).

We firstly proposed the standardized whole genome-wide evaluation packages, check standardized evaluation tool usage for more detail. We do not suggest evaluating on a portion of CpGs for performance comparisons.

Train and test script for consensus model in NANOME

We train an xgboost model on top performers: Nanopolish, DeepSignal and Megalodon, for detailed input/output format of consensus model train and predict, check consensus model format.

The training script usage is below:

cs_train.py  -h
usage: cs_train (NANOME) [-h] --train TRAIN [TRAIN ...] --train-chr TRAIN_CHR
                         [TRAIN_CHR ...] --test TEST [TEST ...] --test-chr
                         TEST_CHR [TEST_CHR ...]
                         [--input-tools INPUT_TOOLS [INPUT_TOOLS ...]]
                         [--dsname DSNAME] [--model-name MODEL_NAME]
                         [--base-model BASE_MODEL] -o O [--niter NITER]
                         [--cv CV] [--scoring SCORING]
                         [--random-state RANDOM_STATE]
                         [--processors PROCESSORS] [--test-lines TEST_LINES]
                         [--show-confusion-matrix] [--apply-cutoff]
                         [--apply-cutoff-train] [--verbose]

Consensus model train on data

optional arguments:
  -h, --help            show this help message and exit
  --train TRAIN [TRAIN ...]
                        train data file
  --train-chr TRAIN_CHR [TRAIN_CHR ...]
                        train chr file
  --test TEST [TEST ...]
                        test data file
  --test-chr TEST_CHR [TEST_CHR ...]
                        train chr file
  --input-tools INPUT_TOOLS [INPUT_TOOLS ...]
                        input features for train, default is megalodon,
                        nanopolish, and deepsignal
  --dsname DSNAME       dataset name, default is NA12878
  --model-name MODEL_NAME
                        model name: basic, etc.
  --base-model BASE_MODEL
                        base model name: rf, xgboost, etc.
  -o O                  output file dir
  --niter NITER         number of iterations for random CV, default is 20
  --cv CV               number of CV, default is 3
  --scoring SCORING     optimized score name, i.e., f1, roc_auc, etc., default
                        is f1
  --random-state RANDOM_STATE
                        random state 42
  --processors PROCESSORS
                        number of processors, default is 1
  --test-lines TEST_LINES
                        test top N rows, such as 10000, default is None
  --show-confusion-matrix
                        if output verbose info
  --apply-cutoff        if apply default cutoff of tools
  --apply-cutoff-train  if apply default cutoff of tools before train
  --verbose             if output verbose info

Prediction script usage for xgboost model is below:

cs_predict.py -h
usage: cs_predict (NANOME) [-h] [-v] [-i I [I ...]] [--nanopolish NANOPOLISH]
                           [--megalodon MEGALODON] [--deepsignal DEEPSIGNAL]
                           [--feature FEATURE]
                           [--feature-readids-col FEATURE_READIDS_COL [FEATURE_READIDS_COL ...]]
                           [--feature-readids-col-order FEATURE_READIDS_COL_ORDER [FEATURE_READIDS_COL_ORDER ...]]
                           [--feature-seq-col FEATURE_SEQ_COL]
                           [--model_specific MODEL_SPECIFIC] -m M --dsname
                           DSNAME -o O [-t T [T ...]]
                           [--random-state RANDOM_STATE]
                           [--processors PROCESSORS] [--chunksize CHUNKSIZE]
                           [--inner-join] [--chrs CHRS [CHRS ...]]
                           [--interactive] [--verbose]

Consensus model predict for data

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i I [I ...]          input tsv combined data for predicting
  --nanopolish NANOPOLISH
                        input nanopolish unified read-level file
  --megalodon MEGALODON
                        input megalodon unified read-level file
  --deepsignal DEEPSIGNAL
                        input deepsignal unified read-level file
  --feature FEATURE     input feature file for DNAseq
  --feature-readids-col FEATURE_READIDS_COL [FEATURE_READIDS_COL ...]
                        column index for ID, Chr, Pos and Strand
  --feature-readids-col-order FEATURE_READIDS_COL_ORDER [FEATURE_READIDS_COL_ORDER ...]
                        column index order for ID, Chr, Pos and Strand
  --feature-seq-col FEATURE_SEQ_COL
                        column index for DNA seq feature
  --model_specific MODEL_SPECIFIC
                        specific model info
  -m M                  model file, existing model list: NANOME2T,NANOME3T,xgb
                        oost_basic,xgboost_basic_w,xgboost_basic_w_seq
  --dsname DSNAME       dataset name
  -o O                  output file name
  -t T [T ...]          tools used for prediction, default is None
  --random-state RANDOM_STATE
                        random state, default is 42
  --processors PROCESSORS
                        num of processors, default is 8
  --chunksize CHUNKSIZE
                        chunk size for load large data, default is 500000
  --inner-join          if inner join for merge data, default is outer join
  --chrs CHRS [CHRS ...]
                        chromosomes used
  --interactive         if output to console as interactive mode, quit use q/Q
  --verbose             if output verbose info

Script for read-level performance comparison (accuracy, F1-score, etc.) on joined predictions by all tools:

cs_eval_read.py  -h

                        [--model-name MODEL_NAME [MODEL_NAME ...]]
                        [--model-file MODEL_FILE [MODEL_FILE ...]] -o O
                        [--processors PROCESSORS] [--bs-cov BS_COV]
                        [--tool-cov TOOL_COV] [--eval-type EVAL_TYPE]
                        [--model-base-dir MODEL_BASE_DIR]
                        [--test-lines TEST_LINES] [--chunksize CHUNKSIZE]
                        [--force-llr2] [--verbose]

Consensus model train on data

optional arguments:
  -h, --help            show this help message and exit
  -i I [I ...]          input data file
  --dsname DSNAME       dataset name, default is NA12878
  --model-name MODEL_NAME [MODEL_NAME ...]
                        model name: rf, xgboost, etc.
  --model-file MODEL_FILE [MODEL_FILE ...]
                        model file
  -o O                  output file dir
  --processors PROCESSORS
                        number of processors, default is 1
  --bs-cov BS_COV       bs-seq coverage cutoff, default is 5
  --tool-cov TOOL_COV   ONT tool coverage cutoff, default is 1
  --eval-type EVAL_TYPE
                        evaluation type, read-level or site-level
  --model-base-dir MODEL_BASE_DIR
                        model file's base dir
  --test-lines TEST_LINES
                        test top N rows, such as 10000, default is None
  --chunksize CHUNKSIZE
                        chunk size for load large data, default is 500000
  --force-llr2          if convert megalodon llr to llr2
  --verbose             if output verbose info

Script for site-level performance comparison (MSE, PCC) on joined predictions by all tools:

cs_eval_site.py -h

                        [--processors PROCESSORS] [--bs-cov BS_COV]
                        [--tool-cov TOOL_COV] [--eval-type EVAL_TYPE]
                        [--model-base-dir MODEL_BASE_DIR]
                        [--test-lines TEST_LINES] [--chunksize CHUNKSIZE]
                        [--save-data SAVE_DATA] [--force-llr2] [--verbose]

Consensus model train on data

optional arguments:
  -h, --help            show this help message and exit
  -i I [I ...]          input data file
  --dsname DSNAME       dataset name, default is NA12878
  --model-name MODEL_NAME [MODEL_NAME ...]
                        model name: rf, xgboost, etc.
  --model-file MODEL_FILE [MODEL_FILE ...]
                        model file
  -o O                  output file dir
  --processors PROCESSORS
                        number of processors, default is 1
  --bs-cov BS_COV       bs-seq coverage cutoff, default is 5
  --tool-cov TOOL_COV   ONT tool coverage cutoff, default is 1
  --eval-type EVAL_TYPE
                        evaluation type, i.e., site-level
  --model-base-dir MODEL_BASE_DIR
                        model file's base dir
  --test-lines TEST_LINES
                        test top N rows, such as 10000, default is None
  --chunksize CHUNKSIZE
                        chunk size for load large data, default is 500000
  --save-data SAVE_DATA
                        if save prediction outputs
  --force-llr2          if convert megalodon llr to llr2
  --verbose             if output verbose info

Pipeline reports for NANOME

Benchmarking reports on our HPC using Nextflow

We constructed a set of benchmarking datasets that contain reads from 800 to about 7,200 reads for NA19240, and monitored job running timeline and resource usage on our HPC, reports generated by Nextflow workflows are: Trace file, Report and Timeline.

Our HPC hardware specifications are as follows:

  • CPU: Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
  • GPU: Tesla V100-SXM2-32GB
  • RAM: 300 GB
  • Slurm manager version: 19.05.5

Timeline figure for benchmarking experiments are below: Bench-timeline

Pipeline DAG

NanomeDag

NANOME report

Please check NANOME report for the sample report by NANOME pipeline.

NanomeReportHtml

Haplotype-aware consensus methylations

Please check phasing usage. PhasingDemo

Lifebit CloudOS report

We now support running NANOME on cloud computing platform. Lifebit is a web-based cloud computing platform, and below is the running reports:

Revision History

For release history, please visit here. For details, please go here.

Contact

If you have any questions/issues/bugs, please post them on GitHub. We will continuously update the GitHub to support famous methylation-calling tools for Oxford Nanopore sequencing.