HuGenVarDetective (HgVD) contains code and workflow for analysing human genes using the Genome Analysis Toolkit (GATK) version 4. The goal of this project was divided into two parts: [i] to leverage GATK4's powerful tools for variant discovery (SNP/MNP and INDELs) to conduct comprehensive genomic analyses and [ii] to annotate the discovered variants within our gene of interest using snpeff and snipsift.
The Genome Analysis Toolkit (GATK) is a software package developed by the Broad Institute that focuses on variant discovery and genotyping. It aims to bring together well-established tools from the GATK and Picard codebases under a streamlined framework, and to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using Apache Spark. The toolkit is widely used in the field of bioinformatics, especially for analysing high-throughput sequencing data.
Documentation on the requirements, installation/building and running/usage of the software package can be found at Broad Institute's GATK page
For the purpose of this project, GATK4 was installed
The bash script for this toolkit follows the author's best practice workflows. This project was conducted using both nanopore and illumina sequenced data.
A copy of the original directory containing reads (illumina/nanopore) is used. For this analysis, quality checks on sequenced reads and removal of adapters, poor quality and primers are recommended using QC tools (such as FastQC, fastp, trimmomatic, porechop, NanoPlot, MinionQC etc)
The overall directory structure for this GATK analysis is illustrated below:
../
experiment_name/
aligned_data/
data/
reads/
results/
supporting_files/
Within the copied experiment folder experiment_name
, the above directories are created to facilitate the pipeline analysis process.
For Nanopore sequenced data, the file structure of the input directory set up for this GATK analysis looks like:
../
experiment_name/
reads/
A1.fastq.gz
A2.fastq.gz
A3.fastq.gz
A4.fastq.gz
For Illumina sequenced data, the file structure of the input directory set up for this GATK analysis looks like:
../
experiment_name/
reads/
A1_R1.fastq.gz
A1_R2.fastq.gz
A2_R1.fastq.gz
A2_R2.fastq.gz
A3_R1.fastq.gz
A3_R2.fastq.gz
A4_R1.fastq.gz
A4_R2.fastq.gz
Note: The reads used for the analysis must be quality checked and trimmed to get accurate and high quality reads into the pipeline for analysis. The trimmed files may be labelled as .trimmed.fastq.gz
for nanopore and R1.trimmed.fastq.gz
| R2.trimmed.fastq.gz
for illumina data for easy identification of data.
Supporting files are one of the requirements for this project. They include the prefix.reference.fasta
file, a sequence dictionary from the reference file, and prefix.reference.fasta
indices produced for alignment and variant calling. These are placed in the the supporting_files
directory and an example illustration is displayed below.
mmp3.nc_000011.10.reference.dict
mmp3.nc_000011.10.reference.fasta
mmp3.nc_000011.10.reference.fasta.amb
mmp3.nc_000011.10.reference.fasta.ann
mmp3.nc_000011.10.reference.fasta.bwt
mmp3.nc_000011.10.reference.fasta.fai
mmp3.nc_000011.10.reference.fasta.pac
mmp3.nc_000011.10.reference.fasta.sa
These files were generated from the MMP3 reference file mmp3.nc_000011.10.reference.fasta
.
Check the bash script and make specific changes to it (eg. the path names for the variables that are specific to the user)
To run the script, navigate to its path and run the bash script:
cd <path/to/gatk_pipeline.sh>
bash ./gatk_pipeline.sh