VNTR-eQTL

Scripts for analyzing VNTR-eQTLs in GTEx cohort (link to the paper) and genotyping performance comparisons

Inputs of the pipeline

VNTR genotypes of the entire GTEx cohort as outputted by adVNTR in the default format (one text file for each individual) using WGS data aligned to GRCh38 (available from dbGaP phs000424.v7.p2).
Phenotype file for GTEx cohort (ethnicity and sex are used). accession: pht002742.v7.p2 and exact version used in paper: phs000424.v7.pht002742.v7.p2.c1.GTEx_Subject_Phenotypes.GRU.txt
SNP genotypes file for GTEx cohort (phg001219.v1.GTEx_v8_WGS.genotype-calls-vcf.c1). This file is used to compare effect of VNTR variants and SNPs, and identifying population structure.
RNA-expression data for GTEx cohort in different tissues as a table (phe000020.v1.GTEx_RNAseq.expression-data-matrixfmt.c1).
SRA run table for GTEx expreiments downloaded from dbGaP. (used to identify corresponding tissue for each expreiment id)

Output of the pipeline

Linear regression results showing association of the length of each VNTR with the expression level of its nearest gene in each of the 46 tissues. The pipeline will create a directory called regression_results for the output. It contains 46 subdirectories (one for each tissue) and each of them have a tab separated file for each VNTR showing linear test for the VNTR in the corresponding tissue.

regression_results
└───Whole-Blood
│   │   100374.txt
│   │   100436.txt
│   │   ...
│   
└───Lung
│   │   100374.txt
│   │   100436.txt
│   │   ...
│   
└───etc

Each text file contains 6 tab separated values:

Genename  Chromosome  VNTR Coordinate Start   Effect Size (B)   P-value   Standard Error (Bse)

For example, regression_results/Whole-Blood/423956.txt (result of the association test for VNTR 423956 in Whole Blood) will have the following content:

POMC    chr2    25161573        0.21217799507799368     1.1501572764235368e-05  0.047746755235986996

Requirements

adVNTR: this can be installed with conda config --add channels bioconda; conda install advntr
eigensoft: this can be installed with apt install eigensoft on linux or conda install -c bioconda/label/cf201901 eigensoft using conda package manager on supported systems.
PEER: to correct for non genetic factors affecting gene expression level.
Python libraries: statsmodels, pandas, numpy
These can be installed with pip install pandas numpy statsmodels using pip or with conda install -c conda-forge statsmodels using conda.
TRTools: for filtering target loci
CAVIAR: for fine-mapping the causal variant.

How to run

Preprocessing

From sra run table, we first extract the entries related to RNA-seq experiments:

zcat SraRunTable.txt.gz | grep "RNA-Seq" >  Sra_table_RNA-Seq_only

Using Sra_table_RNA-Seq_only, we then break up the rpkm expression matrix (phe000020.v1.GTEx_RNAseq.expression-data-matrixfmt.c1/GTEx_Data_20160115_v7_RNAseq_RNASeQCv1.1.8_gene_rpkm.gct) for different tissues:

python extract_expression_by_tissue.py Sra_table_RNA-Seq_only rpkm_file.gct Expression_by_Subtissue/

Expression_by_Subtissue will contain Whole Blood.rpkm, Brain - Cortex.rpkm, etc. This command usually takes an hour to complete.

To convert microarray genotypes to plink format and keep common variants to infer population structure, run following:

vcftools --gzvcf GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz --recode --maf 0.05 --remove-filtered-all --out filtered_snps
plink --vcf filtered_snps.vcf --biallelic-only --maf 0.05 --recode --out plink_866Ind_filtered_biallelic_snps_0.05maf

Since the vcf files are large, this step may take up to one day depending on the system.

Finding population structure

set GTEXDIR to parent directory of GTEx dataset in the principal_component_identification.sh scripts and run PCA and store results in PCA_results directory:

./principal_component_identification.sh PCA_results

Computing PEER factors

Run PEER factor identification scripts for each tissue (Expression_by_Subtissue directory was specified in preprocessing-part2).

python peer_factor_identification.py Expression_by_Subtissue PEER_results

This step may take 1-2 days to finish.

Running association test

This step will take less than a day to generate the regression_results

python run_regression.py VNTR_genotypes/ Expression_by_Subtissue/

VNTR_genotypes should contain 652 files (one per individual) as outputted by adVNTR for each sample.

Identifying significance threshold (5% FDR)

Run following script to read regression_results and identify tissue-specific significance thresholds.

compute_significance_cutoff.py

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
performance_comparison		performance_comparison
.gitignore		.gitignore
README.md		README.md
compute_significance_cutoff.py		compute_significance_cutoff.py
eVNTR_plots.py		eVNTR_plots.py
extract_expression_by_tissue.py		extract_expression_by_tissue.py
gene_locations.py		gene_locations.py
generate_ind_file.py		generate_ind_file.py
load_confounders.py		load_confounders.py
peer_factor_identification.py		peer_factor_identification.py
principal_component_identification.sh		principal_component_identification.sh
requirement-linux.txt		requirement-linux.txt
run_regression.py		run_regression.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VNTR-eQTL

Inputs of the pipeline

Output of the pipeline

Requirements

How to run

Preprocessing

Finding population structure

Computing PEER factors

Running association test

Identifying significance threshold (5% FDR)

About

Releases 1

Packages

Contributors 2

Languages

mehrdadbakhtiari/VNTR-eQTL

Folders and files

Latest commit

History

Repository files navigation

VNTR-eQTL

Inputs of the pipeline

Output of the pipeline

Requirements

How to run

Preprocessing

Finding population structure

Computing PEER factors

Running association test

Identifying significance threshold (5% FDR)

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages