Skip to content

HuGenVarDetective (HgVD) contains code and workflow for analysing human genes using the Genome Analysis Toolkit (GATK) version 4. The goal of this project was to leverage GATK4's powerful tools for variant discovery (SNP/MNP and INDELs) to conduct comprehensive genomic analyses.

Notifications You must be signed in to change notification settings

Bkwame/HuGenVarDetective

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HuGenVarDetective

HuGenVarDetective (HgVD) contains code and workflow for analysing human genes using the Genome Analysis Toolkit (GATK) version 4. The goal of this project was divided into two parts: [i] to leverage GATK4's powerful tools for variant discovery (SNP/MNP and INDELs) to conduct comprehensive genomic analyses and [ii] to annotate the discovered variants within our gene of interest using snpeff and snipsift.

Content

Aim I

About GATK4

The Genome Analysis Toolkit (GATK) is a software package developed by the Broad Institute that focuses on variant discovery and genotyping. It aims to bring together well-established tools from the GATK and Picard codebases under a streamlined framework, and to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using Apache Spark. The toolkit is widely used in the field of bioinformatics, especially for analysing high-throughput sequencing data.

Installation and Usage

Documentation on the requirements, installation/building and running/usage of the software package can be found at Broad Institute's GATK page

For the purpose of this project, GATK4 was installed

Getting Started (Project)

The bash script for this toolkit follows the author's best practice workflows. This project was conducted using both nanopore and illumina sequenced data.

Pre-requisites Prior to Analysis

A copy of the original directory containing reads (illumina/nanopore) is used. For this analysis, quality checks on sequenced reads and removal of adapters, poor quality and primers are recommended using QC tools (such as FastQC, fastp, trimmomatic, porechop, NanoPlot, MinionQC etc)

Directory Structure and Configuration Data

The overall directory structure for this GATK analysis is illustrated below:

../
    experiment_name/
        aligned_data/
        data/
        reads/
        results/
        supporting_files/

Within the copied experiment folder experiment_name, the above directories are created to facilitate the pipeline analysis process.

Input Data

For Nanopore sequenced data, the file structure of the input directory set up for this GATK analysis looks like:

../
    experiment_name/
        reads/
            A1.fastq.gz
            A2.fastq.gz
            A3.fastq.gz
            A4.fastq.gz

For Illumina sequenced data, the file structure of the input directory set up for this GATK analysis looks like:

../
    experiment_name/
        reads/
            A1_R1.fastq.gz
            A1_R2.fastq.gz
            A2_R1.fastq.gz
            A2_R2.fastq.gz
            A3_R1.fastq.gz
            A3_R2.fastq.gz
            A4_R1.fastq.gz
            A4_R2.fastq.gz

Note: The reads used for the analysis must be quality checked and trimmed to get accurate and high quality reads into the pipeline for analysis. The trimmed files may be labelled as .trimmed.fastq.gz for nanopore and R1.trimmed.fastq.gz | R2.trimmed.fastq.gz for illumina data for easy identification of data.

Supporting Files

Supporting files are one of the requirements for this project. They include the prefix.reference.fasta file, a sequence dictionary from the reference file, and prefix.reference.fasta indices produced for alignment and variant calling. These are placed in the the supporting_files directory and an example illustration is displayed below.

mmp3.nc_000011.10.reference.dict
mmp3.nc_000011.10.reference.fasta
mmp3.nc_000011.10.reference.fasta.amb
mmp3.nc_000011.10.reference.fasta.ann
mmp3.nc_000011.10.reference.fasta.bwt
mmp3.nc_000011.10.reference.fasta.fai
mmp3.nc_000011.10.reference.fasta.pac
mmp3.nc_000011.10.reference.fasta.sa

These files were generated from the MMP3 reference file mmp3.nc_000011.10.reference.fasta.

The Script

Check the bash script and make specific changes to it (eg. the path names for the variables that are specific to the user)

To run the script, navigate to its path and run the bash script:

cd <path/to/gatk_pipeline.sh>
bash ./gatk_pipeline.sh

Aim II

About SnpEff & SnpSift

About

HuGenVarDetective (HgVD) contains code and workflow for analysing human genes using the Genome Analysis Toolkit (GATK) version 4. The goal of this project was to leverage GATK4's powerful tools for variant discovery (SNP/MNP and INDELs) to conduct comprehensive genomic analyses.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages