Association studies for instance GWAS, traditionally use genotyping arrays to genotype large set of individuals and in this way determine SNPs that are significantly overrepresented in the cases compared to the controls and thus determine association with disease. Genotyping arrays are cheaper than sequencing but can only measure a tag of SNPs from the >300 million SNPs that are available. In order to increase the number of SNPs that can be used for association studies, genotype imputation is performed. Genotype imputation refers to the statistical inference of unobserved genotypes.
Some regions within the human genome, (MHC otherwise known as HLA), are highly variable and maybe difficult to impute. The HLA region has been associated to autoimmune diseases such as rheumatoid arthritis and infectious diseases such as HIV/AIDS. Accurate imputation of this region is key, as it would help increase the chances of identifying the causal variants of some autoimmune and immune mediated diseases.
Genotype imputation is a statistical process and thus needs to be assessed to ensure that the predicted genotypes are accurate.
The project focused on assessing the accuracy of imputing HLA Class I alleles in selected African populations.
Imputation accuracy was based on SNP2HLA and HIBAG imputation tools, 1kg-All, 1kg-Gwd, 1kg-Afr, H3Africa, prebuilt EUR reference panels and Illumina Omni 2.5 array, H3Africa array genotyping arrays
- Nextflow The pipeline runs using Nextflow 21.10.6
- Docker
- Singularity
N/B You do not need to install any other tool as singularity
profile will download the singularity image from https://quay.io/nanjalaruth/impute-hla
The input file must be a VCF file. As the work focuses on the HLA region, you are required to only use SNPs in chr6:29-34Mb. Thus, you can portably prepare only those SNPs in that region as input file. SNP2HLA uses hg18 data while HIBAG uses hg19 data. You are therefore required to provide both files as input.
The workflow focuses on running the data on a pre built European reference panel and reference panels that are built in the process of running the pipeline. To custom make a reference panel, a HLA type file and SNP genotype file are required. The pipeline thus requires a HLA type file and SNP genotype file to make the reference panel.
I focused on 4 custom made reference panels; 1kg-All, 1kg-Gwd, 1kg-Afr, H3Africa. The 1kg-Gwd and 1kg-Afr are a subset of 1kg-All. In case you have a similar dataset, assign the path to the file with the subpopulation rsids to the flag termed subpop_ids
. This is to bypass the need to get a vcf file that matches the subsetted population.
If you are working with populations that are not linked, provide the paths to the SNP genotype vcf file as demonstrated in the genotype_files
flag in the test.config file. If you have multiple files, you could assign them to the same flag by creating more lists like what has been done with the sample dataset. Also provide the path to the HLA type file as shown in the hlatype_files
flag within the test.config file.
N/B HLA typing was done using the Optitype tool
There are 2 ways to run the pipeline:
- Download the pipeline from GitHub
git clone [email protected]:nanjalaruth/MHC-Imputation-Accuracy.git
- Move to that folder
cd MHC-Imputation-Accuracy
- Edit the conf/test.config to suit the path to where your datasets are stored.
- Run the command below
nextflow run main.nf -c conf/test.config -profile singularity
N/B If you are using any job scheduler, be sure to include it in the profile. For example:
nextflow run main.nf -c conf/test.config -profile singularity, slurm
NextFlow
will automatically fetch the pipeline from GitHub
so you don't need to install it.
- Create your own config file locally to suit the path to your datasets. Use conf/test.config as a template.
- Run the command below
nextflow run nanjalaruth/MHC-Imputation-Accuracy -c `path to your config file` -profile singularity