FinaleMe Snakemake Workflow

This Snakemake workflow automates the prediction of DNA methylation from cell-free DNA (cfDNA) whole-genome sequencing (WGS) data using FinaleMe. It supports per-sample processing through methylation prediction and an optional downstream tissue-of-origin (TOO) analysis for all samples.

Key Features

FinaleMe Integration: Implements the core steps of the FinaleMe analysis pipeline:
1. Feature extraction from BAM files.
2. HMM model training.
3. CpG methylation prediction (decoding).
4. Conversion of predictions to BigWig format.
5. Optional: Tissue-of-Origin (TOO) analysis using predicted methylation levels.
Input Compatibility: Designed for coordinate-sorted and indexed BAM files.
Configurability: Workflow behavior and parameters are controlled via a YAML configuration file.
Parallelization: Snakemake enables multi-core processing for suitable steps.
SLURM Integration: Can be adapted for job submission to SLURM clusters (requires a SLURM profile for Snakemake).

Installation and Setup

Clone the Repository:

git clone https://github.com/epifluidlab/FinaleMe_workflow
cd FinaleMe_workflow

Create Conda Environment: Set up a Conda environment with the necessary software dependencies.

# Install dependencies onto a Conda environment
conda env create -f environment.yml
conda activate finaleme_workflow

Download FinaleMe JARs: Download the main FinaleMe JAR (FinaleMe-VERSION-jar-with-dependencies.jar). Auxiliary JARs can be found in the lib/ directory of this repository, which should be present upon cloning.
- Main JAR: FinaleMe v0.58.1 Release

Dependencies

This workflow relies on the following tools. It's recommended to install them via Conda (see environment.yml).

Java: (Tested with OpenJDK 1.8.0).
Snakemake: Workflow management system.
Perl: Required for the bedpredict2bw.b37.pl script (tested with v5.26.3).
UCSC Tools: Specifically bedGraphToBigWig (tested with v4).
- Install via Bioconda: conda install -c bioconda ucsc-bedgraphtobigwig
Bedtools: (Tested with v2.29.2).
- Install via Bioconda: conda install -c bioconda bedtools
Samtools: For preparing reference genome files (e.g., faidx).

Required Data

Before running the workflow, ensure you have the following data, and their paths are correctly specified in the configuration file:

Input BAM files: Coordinate-sorted and indexed (.bai) BAM files for each sample. An example has been provided in the input directory
Reference Genome (2bit format): E.g., hg19.2bit. Can be downloaded from UCSC or converted using faToTwoBit.
CpG Motif BedGraph: E.g., CG_motif.hg19.common_chr.pos_only.bedgraph. Available from .
Exclude Regions BED: Regions to mask (dark regions), e.g., wgEncodeDukeMapabilityRegionsExcludable_wgEncodeDacMapabilityConsensusExcludable.hg19.bed.
Methylation Prior BigWig: E.g., wgbs_buffyCoat_jensen2015GB.methy.hg19.bw. Available from .
Chromosome Sizes File: E.g., hg19.chrom.sizes. Can be generated using samtools faidx and awk.
FinaleMe Scripts:
- bedpredict2bw.b37.pl (for tissue of origin analysis only) - see the scripts directory
- TissueOfOriginExampleScript.R (for tissue of origin analysis only) - see the scripts directory
(Optional - For Tissue of Origin Analysis):
- autosome_1kb_intervals.UCSC.cpgIsland_plus_shore.b37.bed: Bed file with 1kb intervals: Download the UCSC.cpgIsland annotation file from UCSC genome browser, keep the autosomes, and generate 1kb non-overlapped windows
- Reference methylomes for TOO analysis . See reference_panel.bash to create reference panel methylomes

Configuration (`params.yaml`)

The workflow is controlled by a params.yaml file. Check params.yaml in this repository for an example of how to configure this workflow.

Quick Start

Prepare Data and Config:
- Organize your input BAMs, supplementary files, and FinaleMe JARs/scripts as per the paths in your params.yaml.
- Ensure your params.yaml is correctly filled out.

Run Snakemake: Navigate to the directory containing the Snakefile and params.yaml.

# Activate the conda environment
conda activate finaleme_workflow

# Dry-run to check the workflow plan
snakemake -n --configfile params.yaml

# Execute the workflow (adjust --cores and --jobs as needed)
# --cores: Total number of CPU cores Snakemake can use.
# --jobs: Maximum number of concurrent jobs (rules) to run.

# Note: FinaleMe hasn't been updated to utilize multithreading yet, so jobs should ideally match the number of cores
snakemake --configfile params.yaml --cores <number_of_cores> --jobs <number_of_jobs>

SLURM Execution (Optional): This workflow is SLURM compatible if you're using this workflow in an HPC environment as a job.

# Using the sample SLURM profile provided in this repository
snakemake --configfile params.yaml --profile slurm_profile > snakemake.log 2>&1 &

Workflow Structure and Output

Input BAMs: Located in the directory specified by input_dir.
Supplementary Files: Reference genomes, annotations, etc., are in supplement_dir.
Main Output: Processed files for each sample are written to subdirectories within output_dir/{sample_name}/.
- {sample_name}.CpgMultiMetricsStats.details.bed.gz: Extracted features (Step 1).
- {sample_name}.finaleme.model: Trained HMM model (Step 2).
- {sample_name}.finaleme.prediction.bed.gz: Raw methylation predictions (Step 3).
- {sample_name}.finaleme.cov.b37.bw: Coverage BigWig (Step 4).
- {sample_name}.finaleme.methy_count.b37.bw: Methylation count BigWig (Step 4).
Tissue of Origin Output (if too_enabled: True): Files related to the TOO analysis are placed in output_dir/tissue_of_origin/.
- tissue_of_origin_results.tsv: Final results from the R script.

Notes

Ensure sufficient memory (-Xmx Java options in params.yaml) is allocated for each Java step, especially for HMM training and decoding, based on your dataset size and system resources.
Log files for each step are generated in output/{sample_name}/logs/ or output_dir/tissue_of_origin/logs/.

Citation

Liu Y# et al. (2024) FinaleMe: Predicting DNA methylation by the fragmentation patterns of plasma cell-free DNA. Nature Communications doi: https://doi.org/10.1038/s41467-024-47196-6

Contact

Kundan Baliga: [email protected]
Ravi Bandaru: [email protected]
Yaping Liu: [email protected]

License

This project falls under an MIT license. See the included LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FinaleMe Snakemake Workflow

Key Features

Installation and Setup

Dependencies

Required Data

Configuration (`params.yaml`)

Quick Start

Workflow Structure and Output

Notes

Citation

Contact

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
input		input
lib		lib
reference		reference
scripts		scripts
slurm_profile		slurm_profile
.Rhistory		.Rhistory
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
environment.yml		environment.yml
params.yaml		params.yaml
reference_panel.bash		reference_panel.bash

License

epifluidlab/FinaleMe_workflow

Folders and files

Latest commit

History

Repository files navigation

FinaleMe Snakemake Workflow

Key Features

Installation and Setup

Dependencies

Required Data

Configuration (params.yaml)

Quick Start

Workflow Structure and Output

Notes

Citation

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Configuration (`params.yaml`)

Packages