Skip to content

Implementation codes for "A flexible Bayesian regression approach for accurate polygenic risk prediction"

License

Notifications You must be signed in to change notification settings

shuangsong0110/NeuPred

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeuPred

NeuPred is an R implementation of a unified Bayesian framework for polygenic risk scores construction.

Song, S., Hou, L., & Liu, J. A flexible Bayesian regression approach for accurate polygenic risk prediction. Bioinformatics, 2022.

2021.11.22 Update:

  • add option scale/no scale for calculating scores with standardized/no standardized (0,1,2) genotypes.

  • add a data-adaptive estimation on global shrinkage parameter

Table of contents

Getting Started

NeuPred is an R package which can be installed using the command:

devtools::install_github('shuangsong0110/NeuPred')

Download the LD reference panel:

  • Download the 1000 Genome Project reference panel:

wget https://zenodo.org/record/7768714/files/1000G_Phase3_plinkfiles.tgz?download=1

tar -xvzf 1000G_Phase3_plinkfiles.tgz

  • Users could also specify their own LD reference files with plink bfile format (.bim, .fam, .bed).
  • Note that the LD reference plink bfile data should either be a single file (merge 22 chromosomes with plink), or by each chromosome with the names ended by the number of chromosomes, e.g., 1000G.EUR.QC.1.bim (,.fam, .bed), ..., 1000G.EUR.QC.22.bim (,.fam, .bed).

Other dependencies

  • Install plink (could be ignored if is already installed)
  • We have attached ldetect files for European, African, and Asian ancestries in our software. (Berisa T, Pickrell J K. Approximately independent linkage disequilibrium blocks in human populations[J]. Bioinformatics, 2016, 32(2): 283.)

Using NeuPred / NeuPred-I

NeuPred (GWAS summary statistics based PRS approach)

library(NeuPred)
NeuPred.run(summs=SUMMARY_STATISTICS (required),
	    LDpath=PATH_TO_LD_REFERENCE,
	    LDpath_chr=PATH_TO_LD_REFERENCE_FOR_EACH_CHROMOSOME (either LDpath or LDpath_chr is required),
	    n=GWAS_SAMPLE_SIZE (required),
	    external.ld=EXTERNAL_LD_REFERENCE (T / F , required),
	    ethnic=ETHNIC ('EUR'/'AFR'/'ASN', optional, default='EUR'),
	    plinkpath=PATH_TO_PLINK (required),
	    path=OUTPUT_DIR (required),
            [prior=PRIOR_TO_BE_USED ('auto' / 'L' / 'C' / 'HS', optional, default='auto'),
            testpath=PATH_TO_TEST_DATA (optional),
  	    MCMC=MCMC_ITERATIONS (optional),
   	    BURN=BURNING_TIMES (optional),
    	    parallel=T / F (optional, default=T),
     	    cores=NUMBER_OF_RUNNING_CORES (optional),
     	    chr=CHROM (optionall, default=1:22),
	    plot=ROC_PLOT (T / F, optional),
	    tmp=TEMP_FILES (T / F, optional)])                    
  • SUMMARY_STATISTICS (required): Prepare the summary statistics in the following format (including the header line):
   CHR        SNP      A1    A2      BETA        P
    1      rs4040617    G     A      0.013     8.42e-01
    1      rs4075116    C     T     -0.308     5.18e-01
    1      rs9442385    T     G      0.001     9.87e-01
    ...

Or:

   CHR        SNP      A1    A2         OR        P
    1      rs4040617    G     A      1.013     8.42e-01
    1      rs4075116    C     T      0.970     5.18e-01
    1      rs9442385    T     G      1.001     9.87e-01
    ...

where SNP is the rs ID, A1 is the effect allele, A2 is the alternative allele, BETA/OR is the effect/odds ratio of the A1 allele, P is the p-value of the effect. In fact, BETA/OR is used to determine the direction of an association.

  • PATH_TO_LD_REFERENCE (required): Either LDpath or LDpath_chr is required. If the data are prepared as a single file, the LDpath should include the file name (without .bim/.fam/.bed) and the LDpath_chr is not required; if the data are split into chromosomes, the LDpath_chr should include the file name but not the exact number of chromosome (e.g., LDpath_chr='path/1000G.EUR.QC.').

  • GWAS_SAMPLE_SIZE (required): Sample size of the GWAS.

  • EXTERNAL_LD_REFERENCE (required): T / F, which indicated whether the LD reference is from an external LD panel (T) or from in-sample LD (F).

  • ETHNIC (optional): 'EUR' / 'AFR' / 'ASN', EUR: European ancestry; AFR: African ancestry; ASN: Asian ancestry.

  • PATH_TO_PLINK (required): Full path to the plink software.

  • OUTPUT_DIR (required): Full path for output files.

  • PRIOR_TO_BE_USED (optional): chooose from 'auto' / 'L' / 'C' / 'HS'. If not specified, when external.ld=T, the prior would be set to 'L'; when external.ld=F, meaning that the LD is accurately estimated, the prior would be set to 'auto', and the algorithm will use a CV strategy to automatically select the best-performing prior for each chromosome using only training data (details are provided in the paper).

  • PATH_TO_TEST_DATA (optional): Full path to the test data (plink bfile format). If the test data is provided, the algorithm will calculate the overlapped SNPs between test data and training data, and then present the predictive r^2 and AUC (for binary traits), and give a ROC plot. If the test data is not provided, the algorithm will derive the posterior effect sizes for all SNPs in training dataset.

  • MCMC_ITERATIONS (optional): The number of MCMC iteration times, default is 1e4.

  • BURNING_TIMES (optional): The number of MCMC burning times, default is 2000.

  • PARALLEL (optional): T / F. To decide whether to run the algorithm in parallel to speed up, default=T. Note that parallel=T will improve the computational effiency, but pay attention to the limits of memory.

  • NUMBER_OF_CORES (optional): Number of cores for running MCMC when parallel=T, the optional is 5.

  • CHROM (optional): The chromosome(s) on which the model is fitted, e.g., chr=c(1, 3, 5). The default is chr=1:22.

  • PLOT (optional): T / F. If plot=T and the outcome is binary, the software will provide an ROC plot.

  • TEMP_FILES (optional): T / F. If tmp=T, the temp files will be kept, including the LD blocks, test files, etc. If tmp=F, the temp files will be deleted after the results have been generated.

  • SCALE (optional): T / F. If the effect sizes are used for standardized genotypes, please set scale=T; if the effect sizes are used for raw genotypes (0,1,2), such as calculating scores with PLINK, please set scale=F.

NeuPred-I (Individual-level data based PRS method)

The individual-level data version, NeuPred-I can be run using the following command:

library(NeuPred)
NeuPred.I.run(trainpath=PATH_TO_TRAINING_DATA (required),
              plinkpath=PATH_TO_PLINK (required),
	      path=OUTPUT_DIR (required),
	      [prior=PRIOR_TO_BE_USED ('auto' / 'L' / 'C' / 'HS', optional),
              testpath=PATH_TO_TEST_DATA (optional),
	      MCMC=MCMC_ITERATIONS (optional),
   	      BURN=BURNING_TIMES (optional),
    	      parallel=T / F (optional),
     	      cores=NUMBER_OF_RUNNING_CORES (optional),
     	      chr=CHROM (optional),
	      plot=ROC_PLOT (T / F, optional),
	      tmp=TEMP_FILES (T / F, optional) ]),
	      scale=SCALE (T / F, optional) ])
             
  • PATH_TO_TRAINING_DATA (required): Individual-level plink bfile as the training dataset.

  • PATH_TO_PLINK (required): Full path to the plink software.

  • OUTPUT_DIR (required): Full path for output files.

  • PRIOR_TO_BE_USED (optional): chooose from 'auto' / 'L' / 'C' / 'HS'. If not specified, when external.ld=T, the prior would be set to 'L'; when external.ld=F, meaning that the LD is accurately estimated, the prior would be set to 'auto', and the algorithm will use a CV strategy to automatically select the best-performing prior for each chromosome using only training data (details are provided in the paper).

  • PATH_TO_TEST_DATA (optional): Full path to the test data (plink bfile format). If the test data is provided, the algorithm will calculate the overlapped SNPs between test data and training data, and then present the predictive r^2 and AUC (for binary traits), and give a ROC plot. If the test data is not provided, the algorithm will derive the posterior effect sizes for all SNPs in training dataset.

  • MCMC_ITERATIONS (optional): The number of MCMC iteration times, default is 1e4.

  • BURNING_TIMES (optional): The number of MCMC burning times, default is 2000.

  • PARALLEL (optional): T / F. To decide whether to run the algorithm in parallel to speed up, default=T.

  • NUMBER_OF_CORES (optional): Number of cores for running MCMC when parallel=T, the optional is 5.

  • CHROM (optional): The chromosome(s) on which the model is fitted, e.g., chr=c(1, 3, 5). The default is chr=1:22.

  • PLOT (optional): T / F. If plot=T and the outcome is binary, the software will provide an ROC plot.

  • TEMP_FILES (optional): T / F. If tmp=T, the temp files will be kept, including the LD blocks, test files, etc. If tmp=F, the temp files will be deleted after the results have been generated.

  • SCALE (optional): T / F. If the effect sizes are used for standardized genotypes, please set scale=T; if the effect sizes are used for raw genotypes (0,1,2), such as calculating scores with PLINK, please set scale=F.

Output

An R list containing:

$res: predictive r2 and the AUC (for qualitative traits)

$S: derived polygenic risk scores

$select.prior: the selected prior

The effect sizes for each variant are saved as files in folder ./posterior

A toy example

Clone this repository using the following git command:

$ git clone https://github.com/shuangsong0110/NeuPred.git

$ cd ./NeuPred

Run with R:

library(NeuPred)
path0 <- getwd()
plinkpath <- 'PATH_TO_PLINK_SOFTWARE'
path <- paste0(path0, '/example/')
dir.create(path, recursive=T)
LDpath <- paste0(path0, '/test_dat/test')
summs <- fread(paste0(path0, '/test_dat/train.txt'))
testpath <- paste0(path0, '/test_dat/test')
n <- 3000
res <- NeuPred.run(summs=summs, LDpath=LDpath, n=n, prior='L', external.ld = F, ethnic='EUR',
                    plinkpath=plinkpath, path=path, testpath=testpath,
                    chr=22, parallel=T, cores=5, MCMC=1e4, BURN=2e3, plot=T)

Maintainer

Please contact Shuang Song ([email protected]) if there are any problems or questions.

About

Implementation codes for "A flexible Bayesian regression approach for accurate polygenic risk prediction"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages