Skip to content

A Snakemake pipeline for copy number variant calling without normal tissue samples

Notifications You must be signed in to change notification settings

pedricolino/cnakepit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cnakepit 🐍

Definition

A snake pit is, in a literal sense, a hole filled with snakes. In idiomatic speech, "snake pits" are places of horror, torture and death in European legends and fairy tales. (Wikipedia)

In the field of bioinformatics, a cnakepit is now a Copy Number Alterations detection snaKEmake PIpeline for tumor-only Targeted sequencing data. In the truest sense of the word, this horrific pit is filled with snakes such as pythons, mambas and anacondas.

What is it?

This is a pipeline to (attempt to) call CNAs / CNVs by addressing the following challenges:

  • requires no matched normal sample
  • ideally, requires no reference sample or panel of normals
  • suited for short reads (Illumina)
  • detects (and filters for) somatic CNVs
  • suitable for panels (/targeted sequencing)
  • maintained
  • publicly used (>50 citations if not too new)
  • ideally infers tumor structure and estimates tumor characteristics such as tumor ploidy & purity

This pipeline was originally forked (and later unforked) from this private repo named FFPE-panel-pipeline which pre-processes raw reads and later maps them.

Installation

  1. Install conda (usually already installed on clusters)
  2. Install a conda environment with SnakeMake: conda create -c conda-forge -c bioconda -n snakemake-vanilla snakemake=7.32.3
    Note that I specified an older version of SnakeMake because, as of January '24, the newest version is incompatible with the current SnakeMake profile of the CUBI cluster and our pipeline_job.sh.
    All other required tools and dependencies will be installed automatically by SnakeMake during the first run of the pipeline.

Usage

  1. Set your configuration in the config/config.yaml file. Most options are documented or self-explanatory.
  2. In the resources directory, add your panel design BED file and your samples (sheet) to the data subdirectory.
  3. Activate your conda environment with SnakeMake installed: conda activate snakemake-vanilla.
  4. Before actually running the pipeline, test your configuration with a dry run by adding -n to the SnakeMake command. The pipeline can be run either by submitting the job script with sbatch pipeline_job.sh to SLURM or by calling SnakeMake directly with a command like snakemake --use-conda or snakemake --use-conda -c 8 (preferably on a compute node).

General workflow

  • Read preprocessing: QC of the raw data, trimming, QC of the trimmed data
  • Mapping and QC
  • Variant detection and filtering with Mutect2 / GATK
  • CNV calling by CNVkit with different segmentation methods
  • Tumor analysis and CNV correction by PureCN:
    • either hierarchical clustering of CNVs or
    • minimal re-segmentation with PSCBS
  • For testing purposes, there is now an option to include a premade panel of normals file (RDS format) for PureCN, which will skip the coverage computation and segmentation by CNVkit and consequently any additional round of CNV calling with any other panel of normals.
flowchart TD
    subgraph Main part
    A[raw reads] -->|QC & trimming| B[trimmed reads];
    B -->|QC & mapping| C[mapped reads];
    C -->|variant calling| D[variants];
    D -->|filtering| E[germline variants];
    D -->|filtering| X[somatic variants];
    C -->|compute positional read depth| I[coverage data];
    I -->|CNV calling by CNVkit| F[provisional CNVs];
    E -->|B-allele analysis by CNVkit| F[provisional CNVs];
    F -->|CNV calls correction by PureCN| G[final CNVs];
    X -->|Tumor structure inference by PureCN| G[final CNVs];
    end
    subgraph Optional 2nd round of CNV calling w/ panel of normals
    H -->|bias correction by CNVkit| J[provisional CNVs];
    I -->|CNV calling by CNVkit| J[provisional CNVs];
    E -->|B-allele analysis by CNVkit| J[provisional CNVs];
    G -->|choose appropriate samples among cohort| H[panel of normals];
    J -->|CNV calls correction by PureCN by PureCN| K[final CNVs];
    X -->|Tumor structure inference by PureCN| K[final CNVs];
    end
Loading