Full pipeline and tutorial

We provide example BASH scripts that implements the entire pipeline of HATCHet. This script and its usage are described in detailed in a guided tutorial. The user can simply use the script for every execution of HATCHet on different data by copying the script inside the running directory and changing the corresponding paths of the required data and dependencies at the beginning of the script, as described in the guided tutorial.

Demos

Each demo is an example and guided execution of HATCHet on a dataset included in the corresponding demo's folder of this repository (inside examples). The demos are meant to illustrate how the user should apply HATCHet on different datasets characterized by different features, noise, and kind of data. In fact, the default parameters of HATCHet allow to successfully analyze most of the datasets but some of these may be characterized by special features or higher-than-expected variance of the data. Understanding the functioning of HATCHet, assessing the quality of the results, and tuning the few parameters needed to fit the unique features of the considered data thus become crucial to guarantee to always obtain the best-quality results. These are the goals of these demos.

More specifically, each demo is simultaneously a guided description of the entire example and a BASH script, which can be directly executed to run the complete demo after setting the few required paths at the beginning of the file. As such, the user can both read the guided description as a web page and run the same script to execute the demo. At this time the following demos are available (more demos will be added in the near future):

Name	Demo	Description
demo-WGS-sim	demo-wgs-sim	A demo on a typical WGS (whole-genome sequencing) multi-sample dataset with standard noise and variance of the data
demo-WGS-cancer	demo-wgs-cancer	A demo on a cancer WGS (whole-genome sequencing) multi-sample dataset with high noise and variance of the data
demo-WES	demo-wes	A demo on a cancer WES (whole-exome sequencing) multi-sample dataset, which is typycally characterized by very high variance of RDR

Custom pipelines

The repository includes custom pipelines which have been designed to adapt the complete pipeline of HATCHet to special conditions or to integrate the processed data produced by other pipelines. Each custom pipeline is a variation of the main HATCHet's pipeline, we thus recommend the user to always first carefully understand the main BASH script through the corresponding guided tutorial and to carefully understand the provided demos to properly apply HATCHet for best-quality results. Each custom pipeline also includes a specific demo which represent a guided and executable example on example data.

Name	Description	Script	Demo	Variations
GATK4-CNV	Custom pipeline for segmented files from GATK4 CNV pipeline	custom-gatk4-cnv.sh	demo-gatk4-cnv.sh	This custom pipeline takes the input the segmented files which already contain the estimated RDR and BAF. As such, the first pre-processing steps of HATCHet (`binBAM`, `deBAF`, and `comBBo`) are not needed; for this reason, the following depdencies SAMtools and BCFtools and the following required data, human reference genome, matched-normal sample, and BAM files, are not needed in this case.

Detailed steps

The full pipeline of HATCHet is composed of 7 sequential steps, starting from the required input data. The descrition of each step also includes the details of the corresponding input/output that are especially useful when one wants to replace or change some of the steps in the pipeline while guaranteeing the correct functioning of HATCHet.

Order	Step	Description
(1)	binBAM	This step splits the human reference genome into bins, i.e. fixed-size small genomic regions, and computes the number of sequencing reads aligned to each bin from every given tumor samples and from the matched normal sample.
(2)	deBAF	This step calls heterozygous germline SNPs from the matched-normal sample and counts the number of reads covering both the alleles of each identified heterozgyous SNP in every tumor sample.
(3)	comBBo	This step combines the read counts and the allele counts for the identified germline SNPs to compute the read-depth ratio (RDR) and B-allele frequency (BAF) of every genomic bin.
(4)	cluBB	This step globally clusters genomic bins along the entire genome and jointly across tumor samples, and estimate the corresponding values of RDR and BAF for every cluster in every sample.
(5)	BBot	This step produces informative plots concerning the computed RDRs, BAFs, and clusters. The information produced by this step are important to validate the compute clusters of genomic regions.
(6)	hatchet	This step computes allele-specific fractional copy numbers, solves a constrained distance-based simultaneous factorization to compute allele and clone-specific copy numbers and clone proportions, and deploys a model-selection criterion select the number of clone by explicitly considering the trade-off between subclonal copy-number aberrations and whole-genome duplication.
(7)	BBeval	This step analyzes the inferred copy-number states and clone proportions and produces informative plots jointly considering all samples from the same patient. In addition, this step can also combine results obtained for different patients and perform integrative analysis.

Recommendations and quality control

All the components of HATCHet's pipeline use some basic parameters that allow to deal with data characterized by different features. The default values of these parameters allow one to succesfully apply HATCHet on most datasets. However, special or noisy datasets may require to tune some parameters. The user can deal with these cases by following the recommendations reported here, reading the descriptions of the various steps, and using the informative plots to verify the results. In the following guides and recommentations, we guide the user in the interpration of HATCHet's inference, we explain how to perform quality control to guarantee the best-quality results, and we describe how the user can control and tune some of the parameters to obtain the best-fitting results. We thus split the recommendations into distinct topics with dedicated descriptions.

Recommendation	Description
Analyze HATCHet inference	Interpret HATCHet's inference, quality and error control, and investigate alternative solutions.
Analyze global clustering	Interprent global clustering, quality and error control, and parameter tuning
Analyze different type of data	Tuning parameters to better analyzing different type of data as those from WES
Improve running time	Tips for improving running time of the whole pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc_fullpipeline.md

doc_fullpipeline.md

Full pipeline and tutorial

Demos

Custom pipelines

Detailed steps

Recommendations and quality control

Files

doc_fullpipeline.md

Latest commit

History

doc_fullpipeline.md

File metadata and controls

Full pipeline and tutorial

Demos

Custom pipelines

Detailed steps

Recommendations and quality control