Skip to content

Latest commit

 

History

History
71 lines (57 loc) · 7.73 KB

doc_fullpipeline.md

File metadata and controls

71 lines (57 loc) · 7.73 KB

Full pipeline and tutorial

We provide example BASH scripts that implements the entire pipeline of HATCHet. This script and its usage are described in detailed in a guided tutorial. The user can simply use the script for every execution of HATCHet on different data by copying the script inside the running directory and changing the corresponding paths of the required data and dependencies at the beginning of the script, as described in the guided tutorial.

Demos

Each demo is an example and guided execution of HATCHet on a dataset included in the corresponding demo's folder of this repository (inside examples). The demos are meant to illustrate how the user should apply HATCHet on different datasets characterized by different features, noise, and kind of data. In fact, the default parameters of HATCHet allow to successfully analyze most of the datasets but some of these may be characterized by special features or higher-than-expected variance of the data. Understanding the functioning of HATCHet, assessing the quality of the results, and tuning the few parameters needed to fit the unique features of the considered data thus become crucial to guarantee to always obtain the best-quality results. These are the goals of these demos.

More specifically, each demo is simultaneously a guided description of the entire example and a BASH script, which can be directly executed to run the complete demo after setting the few required paths at the beginning of the file. As such, the user can both read the guided description as a web page and run the same script to execute the demo. At this time the following demos are available (more demos will be added in the near future):

Name Demo Description
demo-WGS-sim demo-wgs-sim A demo on a typical WGS (whole-genome sequencing) multi-sample dataset with standard noise and variance of the data
demo-WGS-cancer demo-wgs-cancer A demo on a cancer WGS (whole-genome sequencing) multi-sample dataset with high noise and variance of the data
demo-WES demo-wes A demo on a cancer WES (whole-exome sequencing) multi-sample dataset, which is typycally characterized by very high variance of RDR

Custom pipelines

The repository includes custom pipelines which have been designed to adapt the complete pipeline of HATCHet to special conditions or to integrate the processed data produced by other pipelines. Each custom pipeline is a variation of the main HATCHet's pipeline, we thus recommend the user to always first carefully understand the main BASH script through the corresponding guided tutorial and to carefully understand the provided demos to properly apply HATCHet for best-quality results. Each custom pipeline also includes a specific demo which represent a guided and executable example on example data.

Name Description Script Demo Variations
GATK4-CNV Custom pipeline for segmented files from GATK4 CNV pipeline custom-gatk4-cnv.sh demo-gatk4-cnv.sh This custom pipeline takes the input the segmented files which already contain the estimated RDR and BAF. As such, the first pre-processing steps of HATCHet (binBAM, deBAF, and comBBo) are not needed; for this reason, the following depdencies SAMtools and BCFtools and the following required data, human reference genome, matched-normal sample, and BAM files, are not needed in this case.

Detailed steps

The full pipeline of HATCHet is composed of 7 sequential steps, starting from the required input data. The descrition of each step also includes the details of the corresponding input/output that are especially useful when one wants to replace or change some of the steps in the pipeline while guaranteeing the correct functioning of HATCHet.

Order Step Description
(1) binBAM This step splits the human reference genome into bins, i.e. fixed-size small genomic regions, and computes the number of sequencing reads aligned to each bin from every given tumor samples and from the matched normal sample.
(2) deBAF This step calls heterozygous germline SNPs from the matched-normal sample and counts the number of reads covering both the alleles of each identified heterozgyous SNP in every tumor sample.
(3) comBBo This step combines the read counts and the allele counts for the identified germline SNPs to compute the read-depth ratio (RDR) and B-allele frequency (BAF) of every genomic bin.
(4) cluBB This step globally clusters genomic bins along the entire genome and jointly across tumor samples, and estimate the corresponding values of RDR and BAF for every cluster in every sample.
(5) BBot This step produces informative plots concerning the computed RDRs, BAFs, and clusters. The information produced by this step are important to validate the compute clusters of genomic regions.
(6) hatchet This step computes allele-specific fractional copy numbers, solves a constrained distance-based simultaneous factorization to compute allele and clone-specific copy numbers and clone proportions, and deploys a model-selection criterion select the number of clone by explicitly considering the trade-off between subclonal copy-number aberrations and whole-genome duplication.
(7) BBeval This step analyzes the inferred copy-number states and clone proportions and produces informative plots jointly considering all samples from the same patient. In addition, this step can also combine results obtained for different patients and perform integrative analysis.

Recommendations and quality control

All the components of HATCHet's pipeline use some basic parameters that allow to deal with data characterized by different features. The default values of these parameters allow one to succesfully apply HATCHet on most datasets. However, special or noisy datasets may require to tune some parameters. The user can deal with these cases by following the recommendations reported here, reading the descriptions of the various steps, and using the informative plots to verify the results. In the following guides and recommentations, we guide the user in the interpration of HATCHet's inference, we explain how to perform quality control to guarantee the best-quality results, and we describe how the user can control and tune some of the parameters to obtain the best-fitting results. We thus split the recommendations into distinct topics with dedicated descriptions.

Recommendation Description
Analyze HATCHet inference Interpret HATCHet's inference, quality and error control, and investigate alternative solutions.
Analyze global clustering Interprent global clustering, quality and error control, and parameter tuning
Analyze different type of data Tuning parameters to better analyzing different type of data as those from WES
Improve running time Tips for improving running time of the whole pipeline