-: from , If user provided the --stat information, PRSice will now error out instead of trying to look for BETA or OR in the file. update log for previous release can be found here Caution We have now fixed window problem. But was unable to access the computer that is used for compilation due to COVID. Will try to compile it when we regain access. Caution PRSet are currently under open beta - results output are reliable but please report any specific problems to our google group (see Support below)3 R Packages Requirements To plot graphs, PRSice requires R ( version 3.2.3+ ) installed. Additional steps might be required for Mac and Windows users. Installing required R packages PRSice can automatically download all required packages, even without administrative right. You can specify the install directory using --dir . For example Rscript PRSice.R --dir . will install all required packages under the local directory. Quick Start For Quick start use, please refer to Quick Start List user options You can also type ./PRSice to view all available parameters unrelated to plotting, or Rscript PRSice.R -h to view all available parameters, including those used for plotting Output of Results You can see the expected output of PRSice here Detailed Guide You can find a more detailed document explaining the input and output of PRSice in this page Full command line options You can find all command line options of PRSice under the section Details of PRSice/PRSet Citation If you use PRSice, then please cite: Citation Choi SW, and O\u2019Reilly PF. \"PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.\" GigaScience 8, no. 7 (July 1, 2019). https://doi.org/10.1093/gigascience/giz082 . Support This wiki should contain all the basic instruction for the use of PRSice. Shall you have any problems, please feel free to start an issue here or visit our google group . You can help us to speed up the debug process by including the log file generated by PRSice. In addition, you can use the search bar in this webpage to search for specific functions. Authors For more details on the authors, see: Dr Shing Wan Choi Dr Paul O'Reilly PRSice-2 and all new functionalities are coded by: Dr Shing Wan Choi Acknowledgement PRSice is a software package written in C++ (main) and R (plotting). The code relies partially on those written in PLINK by Christopher Chang . Management of BGEN file is based on BGEN lib written by Gavin Band . We also utilize the Eigen C++ library, the gzstream library.","title":"Home"},{"location":"#executable-downloads","text":"Operating System Link Linux 64-bit v2.3.5 OS X 64-bit v2.3.5 Windows 32-bit Not available Windows 64-bit v2.3.5 Latest Update","title":"Executable downloads"},{"location":"#2021-09-20-v235","text":"This is a temporary fix for known issues. We are currently re-building PRSice with best practices and try to make sure robustness and extensibility with unit tests. Some fixes are Fixed --perm Fixed --prevalence Reduced memory usage for bgen in multi-threaded mode Increased speed of file generation when there's enough memory Some chr-id fix, though it is still rather buggy","title":"2021-09-20 (v2.3.5)"},{"location":"#2020-08-05-v233","text":"Thanks to report from @charlisech, we were able to pinpoint a bug related to sample selection when using bgen data.","title":"2020-08-05 (v2.3.3)"},{"location":"#2020-05-18-v230","text":"We now support multi-threaded clumping (separated by chromosome) Genotypes will be stored to memory during clumping (increase memory usage, significantly speed up clumping) Will only generate one .prsice file for all phenotypes .prsice file now has additional column call \"Pheno\" Introduced --chr-id which generate rs id based on user provided formula (see detail for more info) Format of --base-maf and --base-info are now changed to : from , If user provided the --stat information, PRSice will now error out instead of trying to look for BETA or OR in the file. update log for previous release can be found here Caution We have now fixed window problem. But was unable to access the computer that is used for compilation due to COVID. Will try to compile it when we regain access. Caution PRSet are currently under open beta - results output are reliable but please report any specific problems to our google group (see Support below)3","title":"2020-05-18 (v2.3.0)"},{"location":"#r-packages-requirements","text":"To plot graphs, PRSice requires R ( version 3.2.3+ ) installed. Additional steps might be required for Mac and Windows users. Installing required R packages PRSice can automatically download all required packages, even without administrative right. You can specify the install directory using --dir . For example Rscript PRSice.R --dir . will install all required packages under the local directory.","title":"R Packages Requirements"},{"location":"#quick-start","text":"For Quick start use, please refer to Quick Start List user options You can also type ./PRSice to view all available parameters unrelated to plotting, or Rscript PRSice.R -h to view all available parameters, including those used for plotting","title":"Quick Start"},{"location":"#output-of-results","text":"You can see the expected output of PRSice here","title":"Output of Results"},{"location":"#detailed-guide","text":"You can find a more detailed document explaining the input and output of PRSice in this page","title":"Detailed Guide"},{"location":"#full-command-line-options","text":"You can find all command line options of PRSice under the section Details of PRSice/PRSet","title":"Full command line options"},{"location":"#citation","text":"If you use PRSice, then please cite: Citation Choi SW, and O\u2019Reilly PF. \"PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.\" GigaScience 8, no. 7 (July 1, 2019). https://doi.org/10.1093/gigascience/giz082 .","title":"Citation"},{"location":"#support","text":"This wiki should contain all the basic instruction for the use of PRSice. Shall you have any problems, please feel free to start an issue here or visit our google group . You can help us to speed up the debug process by including the log file generated by PRSice. In addition, you can use the search bar in this webpage to search for specific functions.","title":"Support"},{"location":"#authors","text":"For more details on the authors, see: Dr Shing Wan Choi Dr Paul O'Reilly PRSice-2 and all new functionalities are coded by: Dr Shing Wan Choi","title":"Authors"},{"location":"#acknowledgement","text":"PRSice is a software package written in C++ (main) and R (plotting). The code relies partially on those written in PLINK by Christopher Chang . Management of BGEN file is based on BGEN lib written by Gavin Band . We also utilize the Eigen C++ library, the gzstream library.","title":"Acknowledgement"},{"location":"archive/","text":"You can download previous versions of PRSice here Warning We no longer support PRSice-1.x. Please use PRSice-2 unless you need specific features in PRSice-1 that isn't implemented in PRSice-2. Bugs and errors within PRSice-1.x will not be fixed nor will the script be updated. Version Software Manual Vignette 1.25 download download download 1.23 download download download 1.22 download download download 1.21 download download download 1.2 download download download Citation If you use PRSice-1, then please cite: Citation PRSice: Polygenic Risk Score software, Euesden, Lewis, O'Reilly, Bioinformatics (2015) 31 (9):1466-1468 Authors Authors of PRSice-1 are as follow: Dr Jack Euesden Professor Cathryn Lewis Dr Paul O'Reilly","title":"Archive"},{"location":"archive/#citation","text":"If you use PRSice-1, then please cite: Citation PRSice: Polygenic Risk Score software, Euesden, Lewis, O'Reilly, Bioinformatics (2015) 31 (9):1466-1468","title":"Citation"},{"location":"archive/#authors","text":"Authors of PRSice-1 are as follow: Dr Jack Euesden Professor Cathryn Lewis Dr Paul O'Reilly","title":"Authors"},{"location":"command_detail/","text":"Available Commands This page contains all command available in PRSice. Tips When constructing new parameters, we follow the following rule: if the command has effect on any file that is not the target, it will have a prefix of the file name. For example, --base-info applies INFO score filtering on the base file, --ld-info perform INFO score filtering on the LD reference file and --info applies the INFO score filtering on the target file. Base File --a1 Column header containing the effective allele . There isn't any standardized label for the effective allele, therefore extra care must be taken to ensure the correct label is provided, otherwise, the effect will be flipped. --a2 Column header containing non-effective allele . --base | -b Base (i.e. GWAS) association file. This is a whitespace delimited file containing association results for SNPs on the base phenotype. This file can be gzipped (must have the .gz suffix). For PRSice to run, the base file must contain the effective allele ( --A1 ), effect size estimates ( --stat ), p-value for association ( --pvalue ), and the SNP ID ( --snp ). --beta This flag is used to indicate if the test statistic is in the form of BETA. If set, PRSice assume the statistic is in the form BETA. Mutually exclusive from --or --bp Column header containing the coordinate of SNPs. When provided, the coordinate of the SNPs will be scrutinized between the base and target file. SNPs with mismatched coordinate will be excluded. --chr Column header containing the chromosome information. When provided, the chromosome information of the SNPs will be scrutinized between the base and target file. SNPs with mismatched chromosome information will be automatically excluded. --index If set, assume the base columns are INDEX instead of the name of the corresponding columns. Index should be 0-based (start counting from 0) --base-info Base INFO score filtering. Format should be : . SNPs with info score less than will be ignored. It is useful to perform INFO score filtering to remove SNPs with low imputation confidence score. By default, PRSice will search for the INFO column in your base file and perform info score filtering with threshold of 0.9 . You can disable this behaviour by using --no-default --base-maf Base minor allele frequency (MAF) filtering. Format should be ,: . SNPs with MAF less than will be ignored. Additional column can be provided (e.g. different filtering threshold for case and control), using the following format: :,: --no-default Remove all default options. If set, PRSice will not set any defaults. --or This flag is used to indicate if the test statistic is in the form of odd ratios. If set, PRSice assume the statistic is in the form OR. Mutually exclusive from --beta --pvalue | -p Column header containing the p-value. The p-value information must be provided --snp Column header containing the SNP ID. This is required to allow SNP matching between the base and target file. Note While it is possible to implement a feature to allow SNP matching purely based on the chromosome number and coordinate of a variant, the possibiliy of flipping and multi-allelic input complicates the matter. Therefore this feature will not be implemented until an elegant solution can be provided. --stat Column header containing the summary statistic. If --beta is set, default to BETA ; likewise, if --or is set, default to OR . Otherwise, will try and search for OR or BETA from the header of the base file. If both OR and BETA is presented in the header, PRSice will terminate. Target File --binary-target Indicate whether the target phenotype is binary or not. Either T or F should be provided where T represent a binary phenotype. For multiple phenotypes, the input should be separated by comma without space. Default: F if --beta is set and T if --or is set --geno Filter SNPs based on gentype missingness. Must be a value between 0.0 and 1.0 . --info Filter SNPs based on info score. Only used for imputed target data. The INFO score is calculated as the MaCH imputation r-squared value, represented by the following pseudo code: m=Mean of expected genotype v=variance of expected genotype p=m/2 p_a = 2p(1-p) INFO = v/p_a --keep File containing the sample(s) to be extracted from the target file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID Mutually exclusive from --remove --maf Filter SNPs based on minor allele frequency (MAF). MAF is calculated using only the founder samples Note When perform MAF filtering on dosage data, the MAF is calculated using the hard-coded genotype --nonfounders By default, PRSice will exclude all non-founders from the analysis. When this flag is set, non-founders will be included in the regression model but will still be excluded from LD estimation. --pheno | -f Tab or space delimited phenotype file containing the phenotype(s). First column must be FID of the samples and the second column must be IID of the samples. When --ignore-fid is set, first column must be the IID of the samples. Must contain a header if --pheno-col is specified --pheno-col Headers of phenotypes to be included from the phenotype file. When multiple phenotypes are provided, the phenotype name will be used as part of the file output prefix --prevalence | -k Prevalence of all binary trait. If provided, PRSice will adjust the ascertainment bias of the R2. Note When multiple binary trait is found, prevalence information must be provided for all of them (Either adjust all binary traits, or don't adjust at all). For example, if there are 3 traits A, B and C, where A and C are binary traits with population prevalence of 0.1 and 0.2 respectively. The correct input should be --binary-target T,F,T --prevalence 0.1,0.2 --remove File containing the sample(s) to be removed from the target file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID Mutually exclusive from --keep --target | -t Target genotype file. Currently support both BGEN and binary PLINK format. For multiple chromosome input, simply substitute the chromosome number with #. PRSice will automatically replace # with 1-22. A separate fam/sample file can be specified by --target , --target-list File containing prefix of target genotype files. Similar to --target but allow for more flexibility. A separate fam/sample file can be specified by --target-list , --type File type of the target file. Support bed (binary plink) and bgen format. Default: bed Dosage --allow-inter Allow the generate of intermediate file. This will speed up PRSice when using dosage data as clumping reference and for hard coding PRS calculation --dose-thres Translate any SNPs with highest genotype probability less than this threshold to missing call. For example, with --dose-thres 0.9 , sample with genotype probability of \\(P(0/0)=0.2\\) , \\(P(0/1)=0.52\\) , \\(P(1/1)=0.28\\) will be set to missing --hard-thres A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. Otherwise a missing code is saved. Default is: 0.1 The distance ( \\(D\\) ) to the nearest hardcall is calculated as: \\[ P(Ref) = 2 \\times P(HomRef) + P(Het) \\\\ P(Alt) = 2 \\times P(HomAlt) + P(Het) \\\\ D = 0.5 \\times \\left(|P\\left(Ref\\right)- round\\left(P\\left(Ref\\right)\\right)| + |P\\left(Alt\\right)- round\\left(P\\left(Alt\\right)\\right)|\\right) \\] Note If dosage data is used as a LD reference, it will always be hard coded to calculate the LD Default: 0.9 --hard When set, will use hard thresholding instead of dosage for PRS construction. Default is to use dosage. Clumping --clump-kb The distance for clumping in kb. For example, if --clump-kb 250 is provided, PRSice will clump any SNPs that is within 250kb to both end of the index SNP (therefore a 500kb window with the index SNP at the center). Now also support distance with a unit. e.g. --clump-kb 1M is a valid input. Default: 250kb for PRSice, 1mb for PRSet --clump-r2 The r 2 threshold for clumping. Default: 0.1 --clump-p The p-value threshold use for clumping. Default: 1. --ld | -L LD reference file. Use for estimation of LD during clumping. If not provided, will use the post-filtered target genotype for LD calculation. Support multiple chromosome input. Please see --target for more information. When the target sample is small (e.g. < 500) and external panel of the same population is available (e.g. 1000 genome), an external reference panel might be used to improve the LD estimation for clumping. --ld-dose-thres Translate any SNPs with highest genotype probability less than this threshold to missing call. For example, with --ld-dose-thres 0.9 , sample with genotype probability of \\(P(0/0)=0.2\\) , \\(P(0/1)=0.52\\) , \\(P(1/1)=0.28\\) will be set to missing --ld-geno Filter SNPs based on genotype missingness. Must be a value between 0.0 and 1.0 . --ld-info Filter SNPs based on info score. Only used for imputed LD reference. The INFO score is calculated as the MaCH imputation r-squared value, represented by the following pseudo code m=Mean of expected genotype v=variance of expected genotype p=m/2 p_a = 2p(1-p) INFO = v/p_a --ld-hard-thres A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. Otherwise a missing code is saved. Default is: 0.1 The distance ( \\(D\\) ) to the nearest hardcall is calculated as: \\[ P(Ref) = 2 \\times P(HomRef) + P(Het) \\\\ P(Alt) = 2 \\times P(HomAlt) + P(Het) \\\\ D = 0.5 \\times \\left(|P\\left(Ref\\right)- round\\left(P\\left(Ref\\right)\\right)| + |P\\left(Alt\\right)- round\\left(P\\left(Alt\\right)\\right)|\\right) \\] Note If dosage data is used as a LD reference, it will always be hard coded to calculate the LD Default: 0.9 --ld-keep File containing the sample(s) to be extracted from the LD reference file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID. Mutually exclusive from --ld-remove . No effect if --ld was not provided --ld-list File containing prefix of multiple LD reference files. Similar to --ld but allow more flexibility. A separate fam/sample file can be specified by --ld-list , --ld-maf Filter SNPs based on minor allele frequency (MAF) Note When perform MAF filtering on dosage data, MAF is calculated using the hard-coded genotype --ld-remove File containing the sample(s) to be removed from the LD reference file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID. Mutually exclusive from --ld-keep --ld-type File type of the LD file. Support bed (binary plink) and bgen format. Default: bed\\n\" --no-clump When set, PRSice will not perform clumping. This is useful a pre-clumped list of SNPs is available. --proxy Proxy threshold for index SNP to be considered as part of the region represented by the clumped SNP(s). e.g. --proxy 0.8 means the index SNP will represent region of any clumped SNP(s) that has r 2 =0.8 even if the index SNP does not physically locate within the region Covariate --cov | -C Covariate file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID --cov-col | -c Header of covariates. If not provided, will use all variables in the covariate file. By adding @ in front of the string, any numbers within [ and ] will be parsed. E.g. @PC[1-3] will be read as PC1,PC2,PC3 . Discontinuous input are also supported: @cov[1.3-5] will be parsed as cov1,cov3,cov4,cov5 --cov-factor Header of categorical covariate(s). Dummy variable will be automatically generated. Any items in --cov-factor must also be found in --cov-col Also accept continuous input (start with @ ). P-value Thresholding --bar-levels Level of barchart to be plotted. When --fastscore is set, PRSice will only calculate the PRS for threshold within the bar level. Levels should be comma separated without space --fastscore Only calculate threshold stated in --bar-levels --no-full By default, PRSice will include the full model, i.e. p-value threshold = 1. Setting this flag will disable that behaviour --interval | -i The step size of the threshold. Default: 5e-05 --lower | -l The starting p-value threshold. Default: 5e-08 --model Genetic model use for regression. The genetic encoding is based on the base data where the encoding represent number of the effective allele Available models include: add - Additive model, code as 0/1/2 (default) dom - Dominant model, code as 0/1/1 rec - Recessive model, code as 0/0/1 het - Heterozygous only model, code as 0/1/0 --missing Method to handle missing genotypes. Available methods include: MEAN_IMPUTE - Missing genotypes contribute an amount proportional to imputed allele frequency (default) SET_ZERO - To throw out missing observations instead CENTER - shift all scores to mean zero. --no-regress Do not perform the regression analysis and simply output all PRS. --score Method to calculate the polygenic score. Available methods include: avg - Take the average effect size (default) std - Standardize the effect size con-std - Standardize the effect size using mean and sd derived from control samples sum - Direct summation of the effect size --upper | -u The final p-value threshold. Default: 0.5 PRSet --background String to indicate a background file. This string should have the format of Name:Type where type can be bed - 0-based range with 3 column. Chr Start End range - 1-based range with 3 column. Chr Start End gene - A file contain a column of gene name As the name suggest, the background file inform PRSet of the background signal to be used for competitive p-value calculation. When a background file is not provided, PRSet will construct the background using the GTF file. However, if both the background and the GTF file isn't provided, PRSet cannot perform the set base permutation. In this case, you can use --full-back to indicate that you'd like to use the whole genome as the background set --bed | -B Bed file containing the selected regions. Name of bed file will be used as the region identifier. Warning Bed file is 0-based. --feature Feature(s) to be included from the gtf file. Default: exon,CDS,gene,protein_coding --full-back Use the whole genome as background set for competitive p-value calculation --gtf | -g GTF file containing gene boundaries. Required when --msigdb is used Tip Human Genome build GRCh38 can be downloaded from here . --msigdb | -m MSigDB file containing the pathway information. Require the gtf file. The GMT file format used by MSigDB is a simple tab/space delimited text file where each line correspond to a single gene set following by Gene IDs: [Set A] [Gene 1] [Gene 2] ... [Set B] [Gene 1] [Gene 2] ... Tip Curated MSigDB files can be downloaded from here after registration in here --set-perm The number of set base permutation to perform. This is only used for calculating the competitive p-value. 10,000 permutation nshould generally be enough. --snp-set Provide gene sets using SNP ID. Two different format is allowed: SNP Set list format: A file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set. --wind-3 Add N base(s) to the 3' region of each gene regions. Unit suffix are allowed e.g. --wind-3 1M --wind-5 Add N base(s) to the 5' region of each gene regions. Unit suffix are allowed e.g. --wind-5 1M R specific commands --prsice Location of the PRSice executable. --dir Location to install require R packages. Only require if the required packages are not installed. We require the following packages: optparse , method , tools , ggplot2 , data.table , grDevices , RColorBrewer Plotting --bar-col-high Colour of the most predicting threshold. Default: firebrick --bar-col-low Colour of the poorest predicting threshold. Default: dodgerblue --bar-col-p When set, will change the colour of bar to p-value threshold instead of the p-value from the association with phenotype --bar-palatte Colour palatte to be used for bar plotting when --bar_col_p is set. Default: YlOrRd --device Select different plotting devices. You can choose any plotting devices supported by base R. Default: png --multi-plot Plot the top N target phenotypes / gene sets in a summary plot --plot When set, will only perform plotting using existing PRSice result files. All other parameters are still required such that PRSice can correctly locate the required input files for plotting. --plot-set The default behaviour of PRSet is to plot the bar-chart, high-resolution plot and quantile plot of the \"Base\" gene set, which consider all SNPs within the genome. By using the --plot-set option, you can plot the specific set of interest. --quantile | -q Number of quantiles to plot. No quantile plot will be generated when this is not provided. --quant-break Parameter to indicate an uneven distribution of quantile. Values represent the upperbound of each quantile group. e.g. With --quantile 10 --quant-break 1,5,10 , the quantiles will be grouped into \\(0\\lt Q \\le 1\\) , \\(1\\lt Q \\le 5\\) , \\(5\\lt Q \\le 10\\) Note To use --quant-break , you must set the correct amount of quantiles. For example, if the largest value in --quant-break is 100, then you must use --quantile 100 --quant-extract | -e File containing sample ID to be plot on a separated quantile e.g. extra quantile containing only schizophrenia samples. Must contain IID. Should contain FID if --ignore-fid isn't set. Note This will only work if the base and target has a different phenotype or if the target phenotype is quantitative --quant-ref Reference quantile for quantile plot. Default is number of quantiles divided by 2 Or in the event where --quant-break is used, represent the upper bound of the reference quantile --scatter-r2 When set, will change the y-axis of the high resolution scatter plot to R2 instead Miscellaneous --all-score Output PRS for ALL threshold. Warning This will generate a huge file --exclude File contains SNPs to be excluded from the analysis. Mutually exclusive from --extract --chr-id Try to construct an RS ID for SNP based on its chromosome, coordinate, effective allele and non-effective allele. For example, c:L-aBd is translated to: :-d This ID will always be used to represent SNPs on the target file, whereas for the base file, we will still prefer to use the column provided in the --snp parameter. SNPs in base file will only be represented by the --chr-id if the RS ID is not provided. --extract File contains SNPs to be included in the analysis. Mutually exclusive from --exclude --id-delim Delimiter used to concatinate FID and IID when performing ID matching. Especially useful for BGEN file processing --ignore-fid Ignore FID for all input. When this is set, first column of all file will be assume to be IID instead of FID --keep-ambig Keep ambiguous SNPs. PRSice will only perform dosage flipping but not strand flipping on ambiguous SNPs. e.g. If your base data contain A/T, with effective allele being A, and your target data is T/A with dosage of T, then PRSice will change the dosage in target to A/T with dosage of A. Only use this option when you are certain your base and target are on the same strand. --logit-perm When performing permutation on binary phenotypes, use logistic regression instead of linear regression. This will substantially slow down PRSice. Note One problem with using --logit-perm is that some of the permuted phenotype might be suffer from perfect separation. This leads to the GLM logistic model not being able to be converge (thus terminating PRSice). If you encounter such problem, you might want to exclude the --logit-perm option. In most case, the p-value of the linear model should be similar to the logistic model --memory Maximum memory usage allowed. PRSice will try its best to honor this setting. For example, --memory 10Gb will restrict PRSice to use no more than 10Gb of memory. However, as we are not using memory pool like PLINK, it is possible for PRSice to use more than the allowed amount. PRSice will mainly check the memory usage when: Perform Clumping Perform permutation analysis Perform set-based permutation --non-cumulate Calculate non-cumulative PRS. PRS will be reset to 0 for each new P-value threshold instead ofadding up --out | -o Prefix for all file output. Note If multiple target phenotypes are included (e.g. using --pheno-col ), the phenotype will be appended to the output prefix If multiple gene set are included, the name of the set will be appended to the output prefix (after the phenotype (if any)) --perm Number of permutation to perform. This will generate the empirical p-value. Recommend to use value larger than or equal to 10,000 Note When permutation is required, PRSice will perform the following operation Perform normal PRSice across all thresholds and obtain p-value of the most significant threshold Repeat PRSice analysis N times with permuted phenotype. Count the number of time where the p-value of the most significant threshold for the permuted --print-snp Print all SNPs that remains in the analysis after clumping is performed. For PRSet, 1 indicate the SNPs falls within the gene set of interest and 0 otherwise. If only PRSice is performed, a single \"gene set\" called \"Base\" will be indicated with all entries marked as 1 --seed | -s Seed used for permutation. If not provided, system time will be used as seed. This will allow the same results to be generated when the same seed and input is used --thread | -n Number of thread use Tip Maximum number of thread can be specified by using --thread max Note PRSice will limit the maximum number of thread used to the number of core available on the system as detected by PRSice. --ultra Ultra aggressive memory managememnt. Will store all genotype into the memory after clumping is performed. This will significant speed up PRSice and PRSet at the expense of increased memory usage. --x-range Range of SNPs to be excluded from the whole analysis. It can either be a single bed file or a comma seperated list of range. Range must be in the format of chr:start-end or chr:coordinate --help | -h Display the help messages","title":"Available Commands"},{"location":"command_detail/#available-commands","text":"This page contains all command available in PRSice. Tips When constructing new parameters, we follow the following rule: if the command has effect on any file that is not the target, it will have a prefix of the file name. For example, --base-info applies INFO score filtering on the base file, --ld-info perform INFO score filtering on the LD reference file and --info applies the INFO score filtering on the target file.","title":"Available Commands"},{"location":"command_detail/#base-file","text":"--a1 Column header containing the effective allele . There isn't any standardized label for the effective allele, therefore extra care must be taken to ensure the correct label is provided, otherwise, the effect will be flipped. --a2 Column header containing non-effective allele . --base | -b Base (i.e. GWAS) association file. This is a whitespace delimited file containing association results for SNPs on the base phenotype. This file can be gzipped (must have the .gz suffix). For PRSice to run, the base file must contain the effective allele ( --A1 ), effect size estimates ( --stat ), p-value for association ( --pvalue ), and the SNP ID ( --snp ). --beta This flag is used to indicate if the test statistic is in the form of BETA. If set, PRSice assume the statistic is in the form BETA. Mutually exclusive from --or --bp Column header containing the coordinate of SNPs. When provided, the coordinate of the SNPs will be scrutinized between the base and target file. SNPs with mismatched coordinate will be excluded. --chr Column header containing the chromosome information. When provided, the chromosome information of the SNPs will be scrutinized between the base and target file. SNPs with mismatched chromosome information will be automatically excluded. --index If set, assume the base columns are INDEX instead of the name of the corresponding columns. Index should be 0-based (start counting from 0) --base-info Base INFO score filtering. Format should be : . SNPs with info score less than will be ignored. It is useful to perform INFO score filtering to remove SNPs with low imputation confidence score. By default, PRSice will search for the INFO column in your base file and perform info score filtering with threshold of 0.9 . You can disable this behaviour by using --no-default --base-maf Base minor allele frequency (MAF) filtering. Format should be ,: . SNPs with MAF less than will be ignored. Additional column can be provided (e.g. different filtering threshold for case and control), using the following format: :,: --no-default Remove all default options. If set, PRSice will not set any defaults. --or This flag is used to indicate if the test statistic is in the form of odd ratios. If set, PRSice assume the statistic is in the form OR. Mutually exclusive from --beta --pvalue | -p Column header containing the p-value. The p-value information must be provided --snp Column header containing the SNP ID. This is required to allow SNP matching between the base and target file. Note While it is possible to implement a feature to allow SNP matching purely based on the chromosome number and coordinate of a variant, the possibiliy of flipping and multi-allelic input complicates the matter. Therefore this feature will not be implemented until an elegant solution can be provided. --stat Column header containing the summary statistic. If --beta is set, default to BETA ; likewise, if --or is set, default to OR . Otherwise, will try and search for OR or BETA from the header of the base file. If both OR and BETA is presented in the header, PRSice will terminate.","title":"Base File"},{"location":"command_detail/#target-file","text":"--binary-target Indicate whether the target phenotype is binary or not. Either T or F should be provided where T represent a binary phenotype. For multiple phenotypes, the input should be separated by comma without space. Default: F if --beta is set and T if --or is set --geno Filter SNPs based on gentype missingness. Must be a value between 0.0 and 1.0 . --info Filter SNPs based on info score. Only used for imputed target data. The INFO score is calculated as the MaCH imputation r-squared value, represented by the following pseudo code: m=Mean of expected genotype v=variance of expected genotype p=m/2 p_a = 2p(1-p) INFO = v/p_a --keep File containing the sample(s) to be extracted from the target file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID Mutually exclusive from --remove --maf Filter SNPs based on minor allele frequency (MAF). MAF is calculated using only the founder samples Note When perform MAF filtering on dosage data, the MAF is calculated using the hard-coded genotype --nonfounders By default, PRSice will exclude all non-founders from the analysis. When this flag is set, non-founders will be included in the regression model but will still be excluded from LD estimation. --pheno | -f Tab or space delimited phenotype file containing the phenotype(s). First column must be FID of the samples and the second column must be IID of the samples. When --ignore-fid is set, first column must be the IID of the samples. Must contain a header if --pheno-col is specified --pheno-col Headers of phenotypes to be included from the phenotype file. When multiple phenotypes are provided, the phenotype name will be used as part of the file output prefix --prevalence | -k Prevalence of all binary trait. If provided, PRSice will adjust the ascertainment bias of the R2. Note When multiple binary trait is found, prevalence information must be provided for all of them (Either adjust all binary traits, or don't adjust at all). For example, if there are 3 traits A, B and C, where A and C are binary traits with population prevalence of 0.1 and 0.2 respectively. The correct input should be --binary-target T,F,T --prevalence 0.1,0.2 --remove File containing the sample(s) to be removed from the target file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID Mutually exclusive from --keep --target | -t Target genotype file. Currently support both BGEN and binary PLINK format. For multiple chromosome input, simply substitute the chromosome number with #. PRSice will automatically replace # with 1-22. A separate fam/sample file can be specified by --target , --target-list File containing prefix of target genotype files. Similar to --target but allow for more flexibility. A separate fam/sample file can be specified by --target-list , --type File type of the target file. Support bed (binary plink) and bgen format. Default: bed","title":"Target File"},{"location":"command_detail/#dosage","text":"--allow-inter Allow the generate of intermediate file. This will speed up PRSice when using dosage data as clumping reference and for hard coding PRS calculation --dose-thres Translate any SNPs with highest genotype probability less than this threshold to missing call. For example, with --dose-thres 0.9 , sample with genotype probability of \\(P(0/0)=0.2\\) , \\(P(0/1)=0.52\\) , \\(P(1/1)=0.28\\) will be set to missing --hard-thres A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. Otherwise a missing code is saved. Default is: 0.1 The distance ( \\(D\\) ) to the nearest hardcall is calculated as: \\[ P(Ref) = 2 \\times P(HomRef) + P(Het) \\\\ P(Alt) = 2 \\times P(HomAlt) + P(Het) \\\\ D = 0.5 \\times \\left(|P\\left(Ref\\right)- round\\left(P\\left(Ref\\right)\\right)| + |P\\left(Alt\\right)- round\\left(P\\left(Alt\\right)\\right)|\\right) \\] Note If dosage data is used as a LD reference, it will always be hard coded to calculate the LD Default: 0.9 --hard When set, will use hard thresholding instead of dosage for PRS construction. Default is to use dosage.","title":"Dosage"},{"location":"command_detail/#clumping","text":"--clump-kb The distance for clumping in kb. For example, if --clump-kb 250 is provided, PRSice will clump any SNPs that is within 250kb to both end of the index SNP (therefore a 500kb window with the index SNP at the center). Now also support distance with a unit. e.g. --clump-kb 1M is a valid input. Default: 250kb for PRSice, 1mb for PRSet --clump-r2 The r 2 threshold for clumping. Default: 0.1 --clump-p The p-value threshold use for clumping. Default: 1. --ld | -L LD reference file. Use for estimation of LD during clumping. If not provided, will use the post-filtered target genotype for LD calculation. Support multiple chromosome input. Please see --target for more information. When the target sample is small (e.g. < 500) and external panel of the same population is available (e.g. 1000 genome), an external reference panel might be used to improve the LD estimation for clumping. --ld-dose-thres Translate any SNPs with highest genotype probability less than this threshold to missing call. For example, with --ld-dose-thres 0.9 , sample with genotype probability of \\(P(0/0)=0.2\\) , \\(P(0/1)=0.52\\) , \\(P(1/1)=0.28\\) will be set to missing --ld-geno Filter SNPs based on genotype missingness. Must be a value between 0.0 and 1.0 . --ld-info Filter SNPs based on info score. Only used for imputed LD reference. The INFO score is calculated as the MaCH imputation r-squared value, represented by the following pseudo code m=Mean of expected genotype v=variance of expected genotype p=m/2 p_a = 2p(1-p) INFO = v/p_a --ld-hard-thres A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. Otherwise a missing code is saved. Default is: 0.1 The distance ( \\(D\\) ) to the nearest hardcall is calculated as: \\[ P(Ref) = 2 \\times P(HomRef) + P(Het) \\\\ P(Alt) = 2 \\times P(HomAlt) + P(Het) \\\\ D = 0.5 \\times \\left(|P\\left(Ref\\right)- round\\left(P\\left(Ref\\right)\\right)| + |P\\left(Alt\\right)- round\\left(P\\left(Alt\\right)\\right)|\\right) \\] Note If dosage data is used as a LD reference, it will always be hard coded to calculate the LD Default: 0.9 --ld-keep File containing the sample(s) to be extracted from the LD reference file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID. Mutually exclusive from --ld-remove . No effect if --ld was not provided --ld-list File containing prefix of multiple LD reference files. Similar to --ld but allow more flexibility. A separate fam/sample file can be specified by --ld-list , --ld-maf Filter SNPs based on minor allele frequency (MAF) Note When perform MAF filtering on dosage data, MAF is calculated using the hard-coded genotype --ld-remove File containing the sample(s) to be removed from the LD reference file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID. Mutually exclusive from --ld-keep --ld-type File type of the LD file. Support bed (binary plink) and bgen format. Default: bed\\n\" --no-clump When set, PRSice will not perform clumping. This is useful a pre-clumped list of SNPs is available. --proxy Proxy threshold for index SNP to be considered as part of the region represented by the clumped SNP(s). e.g. --proxy 0.8 means the index SNP will represent region of any clumped SNP(s) that has r 2 =0.8 even if the index SNP does not physically locate within the region","title":"Clumping"},{"location":"command_detail/#covariate","text":"--cov | -C Covariate file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID --cov-col | -c Header of covariates. If not provided, will use all variables in the covariate file. By adding @ in front of the string, any numbers within [ and ] will be parsed. E.g. @PC[1-3] will be read as PC1,PC2,PC3 . Discontinuous input are also supported: @cov[1.3-5] will be parsed as cov1,cov3,cov4,cov5 --cov-factor Header of categorical covariate(s). Dummy variable will be automatically generated. Any items in --cov-factor must also be found in --cov-col Also accept continuous input (start with @ ).","title":"Covariate"},{"location":"command_detail/#p-value-thresholding","text":"--bar-levels Level of barchart to be plotted. When --fastscore is set, PRSice will only calculate the PRS for threshold within the bar level. Levels should be comma separated without space --fastscore Only calculate threshold stated in --bar-levels --no-full By default, PRSice will include the full model, i.e. p-value threshold = 1. Setting this flag will disable that behaviour --interval | -i The step size of the threshold. Default: 5e-05 --lower | -l The starting p-value threshold. Default: 5e-08 --model Genetic model use for regression. The genetic encoding is based on the base data where the encoding represent number of the effective allele Available models include: add - Additive model, code as 0/1/2 (default) dom - Dominant model, code as 0/1/1 rec - Recessive model, code as 0/0/1 het - Heterozygous only model, code as 0/1/0 --missing Method to handle missing genotypes. Available methods include: MEAN_IMPUTE - Missing genotypes contribute an amount proportional to imputed allele frequency (default) SET_ZERO - To throw out missing observations instead CENTER - shift all scores to mean zero. --no-regress Do not perform the regression analysis and simply output all PRS. --score Method to calculate the polygenic score. Available methods include: avg - Take the average effect size (default) std - Standardize the effect size con-std - Standardize the effect size using mean and sd derived from control samples sum - Direct summation of the effect size --upper | -u The final p-value threshold. Default: 0.5","title":"P-value Thresholding"},{"location":"command_detail/#prset","text":"--background String to indicate a background file. This string should have the format of Name:Type where type can be bed - 0-based range with 3 column. Chr Start End range - 1-based range with 3 column. Chr Start End gene - A file contain a column of gene name As the name suggest, the background file inform PRSet of the background signal to be used for competitive p-value calculation. When a background file is not provided, PRSet will construct the background using the GTF file. However, if both the background and the GTF file isn't provided, PRSet cannot perform the set base permutation. In this case, you can use --full-back to indicate that you'd like to use the whole genome as the background set --bed | -B Bed file containing the selected regions. Name of bed file will be used as the region identifier. Warning Bed file is 0-based. --feature Feature(s) to be included from the gtf file. Default: exon,CDS,gene,protein_coding --full-back Use the whole genome as background set for competitive p-value calculation --gtf | -g GTF file containing gene boundaries. Required when --msigdb is used Tip Human Genome build GRCh38 can be downloaded from here . --msigdb | -m MSigDB file containing the pathway information. Require the gtf file. The GMT file format used by MSigDB is a simple tab/space delimited text file where each line correspond to a single gene set following by Gene IDs: [Set A] [Gene 1] [Gene 2] ... [Set B] [Gene 1] [Gene 2] ... Tip Curated MSigDB files can be downloaded from here after registration in here --set-perm The number of set base permutation to perform. This is only used for calculating the competitive p-value. 10,000 permutation nshould generally be enough. --snp-set Provide gene sets using SNP ID. Two different format is allowed: SNP Set list format: A file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set. --wind-3 Add N base(s) to the 3' region of each gene regions. Unit suffix are allowed e.g. --wind-3 1M --wind-5 Add N base(s) to the 5' region of each gene regions. Unit suffix are allowed e.g. --wind-5 1M","title":"PRSet"},{"location":"command_detail/#r-specific-commands","text":"--prsice Location of the PRSice executable. --dir Location to install require R packages. Only require if the required packages are not installed. We require the following packages: optparse , method , tools , ggplot2 , data.table , grDevices , RColorBrewer","title":"R specific commands"},{"location":"command_detail/#plotting","text":"--bar-col-high Colour of the most predicting threshold. Default: firebrick --bar-col-low Colour of the poorest predicting threshold. Default: dodgerblue --bar-col-p When set, will change the colour of bar to p-value threshold instead of the p-value from the association with phenotype --bar-palatte Colour palatte to be used for bar plotting when --bar_col_p is set. Default: YlOrRd --device Select different plotting devices. You can choose any plotting devices supported by base R. Default: png --multi-plot Plot the top N target phenotypes / gene sets in a summary plot --plot When set, will only perform plotting using existing PRSice result files. All other parameters are still required such that PRSice can correctly locate the required input files for plotting. --plot-set The default behaviour of PRSet is to plot the bar-chart, high-resolution plot and quantile plot of the \"Base\" gene set, which consider all SNPs within the genome. By using the --plot-set option, you can plot the specific set of interest. --quantile | -q Number of quantiles to plot. No quantile plot will be generated when this is not provided. --quant-break Parameter to indicate an uneven distribution of quantile. Values represent the upperbound of each quantile group. e.g. With --quantile 10 --quant-break 1,5,10 , the quantiles will be grouped into \\(0\\lt Q \\le 1\\) , \\(1\\lt Q \\le 5\\) , \\(5\\lt Q \\le 10\\) Note To use --quant-break , you must set the correct amount of quantiles. For example, if the largest value in --quant-break is 100, then you must use --quantile 100 --quant-extract | -e File containing sample ID to be plot on a separated quantile e.g. extra quantile containing only schizophrenia samples. Must contain IID. Should contain FID if --ignore-fid isn't set. Note This will only work if the base and target has a different phenotype or if the target phenotype is quantitative --quant-ref Reference quantile for quantile plot. Default is number of quantiles divided by 2 Or in the event where --quant-break is used, represent the upper bound of the reference quantile --scatter-r2 When set, will change the y-axis of the high resolution scatter plot to R2 instead","title":"Plotting"},{"location":"command_detail/#miscellaneous","text":"--all-score Output PRS for ALL threshold. Warning This will generate a huge file --exclude File contains SNPs to be excluded from the analysis. Mutually exclusive from --extract --chr-id Try to construct an RS ID for SNP based on its chromosome, coordinate, effective allele and non-effective allele. For example, c:L-aBd is translated to: :-d This ID will always be used to represent SNPs on the target file, whereas for the base file, we will still prefer to use the column provided in the --snp parameter. SNPs in base file will only be represented by the --chr-id if the RS ID is not provided. --extract File contains SNPs to be included in the analysis. Mutually exclusive from --exclude --id-delim Delimiter used to concatinate FID and IID when performing ID matching. Especially useful for BGEN file processing --ignore-fid Ignore FID for all input. When this is set, first column of all file will be assume to be IID instead of FID --keep-ambig Keep ambiguous SNPs. PRSice will only perform dosage flipping but not strand flipping on ambiguous SNPs. e.g. If your base data contain A/T, with effective allele being A, and your target data is T/A with dosage of T, then PRSice will change the dosage in target to A/T with dosage of A. Only use this option when you are certain your base and target are on the same strand. --logit-perm When performing permutation on binary phenotypes, use logistic regression instead of linear regression. This will substantially slow down PRSice. Note One problem with using --logit-perm is that some of the permuted phenotype might be suffer from perfect separation. This leads to the GLM logistic model not being able to be converge (thus terminating PRSice). If you encounter such problem, you might want to exclude the --logit-perm option. In most case, the p-value of the linear model should be similar to the logistic model --memory Maximum memory usage allowed. PRSice will try its best to honor this setting. For example, --memory 10Gb will restrict PRSice to use no more than 10Gb of memory. However, as we are not using memory pool like PLINK, it is possible for PRSice to use more than the allowed amount. PRSice will mainly check the memory usage when: Perform Clumping Perform permutation analysis Perform set-based permutation --non-cumulate Calculate non-cumulative PRS. PRS will be reset to 0 for each new P-value threshold instead ofadding up --out | -o Prefix for all file output. Note If multiple target phenotypes are included (e.g. using --pheno-col ), the phenotype will be appended to the output prefix If multiple gene set are included, the name of the set will be appended to the output prefix (after the phenotype (if any)) --perm Number of permutation to perform. This will generate the empirical p-value. Recommend to use value larger than or equal to 10,000 Note When permutation is required, PRSice will perform the following operation Perform normal PRSice across all thresholds and obtain p-value of the most significant threshold Repeat PRSice analysis N times with permuted phenotype. Count the number of time where the p-value of the most significant threshold for the permuted --print-snp Print all SNPs that remains in the analysis after clumping is performed. For PRSet, 1 indicate the SNPs falls within the gene set of interest and 0 otherwise. If only PRSice is performed, a single \"gene set\" called \"Base\" will be indicated with all entries marked as 1 --seed | -s Seed used for permutation. If not provided, system time will be used as seed. This will allow the same results to be generated when the same seed and input is used --thread | -n Number of thread use Tip Maximum number of thread can be specified by using --thread max Note PRSice will limit the maximum number of thread used to the number of core available on the system as detected by PRSice. --ultra Ultra aggressive memory managememnt. Will store all genotype into the memory after clumping is performed. This will significant speed up PRSice and PRSet at the expense of increased memory usage. --x-range Range of SNPs to be excluded from the whole analysis. It can either be a single bed file or a comma seperated list of range. Range must be in the format of chr:start-end or chr:coordinate --help | -h Display the help messages","title":"Miscellaneous"},{"location":"compilation/","text":"Introduction Here is the guideline for anyone who might want to compile PRSice from source. Prerequisites For the C++ executable 1. GCC version 7 or higher (for C++17 support) 2. CMake version 3.1 or higher (Optional) 3. Git (Optional) Note Only the C++ executable need to be built Using CMake With CMake, you can simply do the following: git clone https://github.com/choishingwan/PRSice.git cd PRSice mkdir build cd build cmake ../ make Then the PRSice executable will be located within PRSice/bin If you don't have git installed, you can still do (remember to download eigen to lib ) curl https://codeload.github.com/choishingwan/PRSice/tar.gz/2.3.3 > PRSice.tar.gz tar -xvf PRSice.tar.gz cd PRSice-2.3.3 mkdir build cd build cmake ../ make Note The above procedure was not tested on Windows Without CMake Without CMake, you will need to first download the eigen library You can then do the following git clone https://github.com/choishingwan/PRSice.git cd PRSice g++ -std=c++17 -O3 -DNDEBUG -march=native -isystem lib -isystem ${PATH_TO_EIGEN} -I inc src/*.cpp -lpthread -lz -o PRSice Then PRSice will be located in the current directory Alternatively, if you don't have git installed, you can still do curl https://codeload.github.com/choishingwan/PRSice/tar.gz/2.3.3 > PRSice.tar.gz tar -xvf PRSice.tar.gz cd PRSice-2.3.3 g++ -std=c++17 -O3 -DNDEBUG -march=native -isystem lib -isystem ${PATH_TO_EIGEN} -I inc src/*.cpp -lpthread -lz -o PRSice Intel MKL If you know how to setup the Intel \\(\\circledR\\) MKL library, you can compile PRSice with it to speed up the processing speed. You can use this to help you with the linking.","title":"Compile from Source"},{"location":"compilation/#introduction","text":"Here is the guideline for anyone who might want to compile PRSice from source.","title":"Introduction"},{"location":"compilation/#prerequisites","text":"For the C++ executable 1. GCC version 7 or higher (for C++17 support) 2. CMake version 3.1 or higher (Optional) 3. Git (Optional) Note Only the C++ executable need to be built","title":"Prerequisites"},{"location":"compilation/#using-cmake","text":"With CMake, you can simply do the following: git clone https://github.com/choishingwan/PRSice.git cd PRSice mkdir build cd build cmake ../ make Then the PRSice executable will be located within PRSice/bin If you don't have git installed, you can still do (remember to download eigen to lib ) curl https://codeload.github.com/choishingwan/PRSice/tar.gz/2.3.3 > PRSice.tar.gz tar -xvf PRSice.tar.gz cd PRSice-2.3.3 mkdir build cd build cmake ../ make Note The above procedure was not tested on Windows","title":"Using CMake"},{"location":"compilation/#without-cmake","text":"Without CMake, you will need to first download the eigen library You can then do the following git clone https://github.com/choishingwan/PRSice.git cd PRSice g++ -std=c++17 -O3 -DNDEBUG -march=native -isystem lib -isystem ${PATH_TO_EIGEN} -I inc src/*.cpp -lpthread -lz -o PRSice Then PRSice will be located in the current directory Alternatively, if you don't have git installed, you can still do curl https://codeload.github.com/choishingwan/PRSice/tar.gz/2.3.3 > PRSice.tar.gz tar -xvf PRSice.tar.gz cd PRSice-2.3.3 g++ -std=c++17 -O3 -DNDEBUG -march=native -isystem lib -isystem ${PATH_TO_EIGEN} -I inc src/*.cpp -lpthread -lz -o PRSice","title":"Without CMake"},{"location":"compilation/#intel-mkl","text":"If you know how to setup the Intel \\(\\circledR\\) MKL library, you can compile PRSice with it to speed up the processing speed. You can use this to help you with the linking.","title":"Intel MKL"},{"location":"decisions/","text":"Introduction Here, we detail some of the decisions we made during the implementatino of PRSice Support of BGEN v1.3 We have purposefully disabled support to BGEN v1.3 to avoid the inclusion of the zstd library. This is because - UKBB is v1.2 - We are not familiar with the licensing of zstd library (developed by facebook) Removal of PCA calculation The main goal of PRSice 2 is to support the polygenic score analysis on large scale data. With such data, it is very time consuming to the PCA and will require specific algorithms such as those implemented in flashPCA . In order to support the in-place PCA calculation, not only will we have to implement the flashPCA algorithm, we will also need to implement prunning, which is required prior to PCA calculation. Due to the lack of man power, we therefore decided that we will not implement the PCA calculation. Another reasoning is that we believe users should first examine the PCA results before directly applying them to the PRS analysis.","title":"Development Decisions"},{"location":"decisions/#introduction","text":"Here, we detail some of the decisions we made during the implementatino of PRSice","title":"Introduction"},{"location":"decisions/#support-of-bgen-v13","text":"We have purposefully disabled support to BGEN v1.3 to avoid the inclusion of the zstd library. This is because - UKBB is v1.2 - We are not familiar with the licensing of zstd library (developed by facebook)","title":"Support of BGEN v1.3"},{"location":"decisions/#removal-of-pca-calculation","text":"The main goal of PRSice 2 is to support the polygenic score analysis on large scale data. With such data, it is very time consuming to the PCA and will require specific algorithms such as those implemented in flashPCA . In order to support the in-place PCA calculation, not only will we have to implement the flashPCA algorithm, we will also need to implement prunning, which is required prior to PCA calculation. Due to the lack of man power, we therefore decided that we will not implement the PCA calculation. Another reasoning is that we believe users should first examine the PCA results before directly applying them to the PRS analysis.","title":"Removal of PCA calculation"},{"location":"extra_steps/","text":"Introduction After installation of R, additional steps might be require for MAC and Window users. Below are the instructions MAC Users Download and install the latest XQuartz This is because MAC no longer ship the X11 package which is required by R to perform plotting. Run xcode-select --install on your terminal This will install the required zlib package on your system, which is required by PRSice (for decompressing bgen files) Note For anyone with older Mac Versions (e.g. Mountain Lion or before), you should follow the guide here to install the require Command Line Tools . Window Users As installation of R does not automatically add it to the system path, one will need to type the full path of the R.exe and Rscript.exe in order to use PRSice. To avoid this complication, we can manually add the folder containing the R binary to the system path: For Windows 8 and 10 In Search, search for and then select: System (Control Panel) Click the Advanced system settings link Click the Advanced tab Click Environment Variables Under System Variables , select path (If you cannot find path, you can click new to make it) Click Edit Click Browse and select the location of the executable of R. If you use the default installation path, you can add C:\\Program Files\\R\\R-3.3.2\\bin , where (eg.) 3.3.2 is the version number. Some installation might also have a i384 and x64 version and either one of those will work. For Windows 7 From the desktop, right click the Computer icon. Choose Properties from the context menu. Click the Advanced system settings link. Click Environment Variables . In the section System Variables , find the PATH environment variable and select it. Click Edit. In the Edit System Variable (or New System Variable ) window, add the location of the executable of R. If you use the default installation path, you can add C:\\Program Files\\R\\R-3.3.2\\bin , where (eg.) 3.3.2 is the version number. Some installation might also have a i384 and x64 version and either one of those will work.","title":"Additional Steps for MAC and Window users"},{"location":"extra_steps/#introduction","text":"After installation of R, additional steps might be require for MAC and Window users. Below are the instructions","title":"Introduction"},{"location":"extra_steps/#mac-users","text":"Download and install the latest XQuartz This is because MAC no longer ship the X11 package which is required by R to perform plotting. Run xcode-select --install on your terminal This will install the required zlib package on your system, which is required by PRSice (for decompressing bgen files) Note For anyone with older Mac Versions (e.g. Mountain Lion or before), you should follow the guide here to install the require Command Line Tools .","title":"MAC Users"},{"location":"extra_steps/#window-users","text":"As installation of R does not automatically add it to the system path, one will need to type the full path of the R.exe and Rscript.exe in order to use PRSice. To avoid this complication, we can manually add the folder containing the R binary to the system path:","title":"Window Users"},{"location":"extra_steps/#for-windows-8-and-10","text":"In Search, search for and then select: System (Control Panel) Click the Advanced system settings link Click the Advanced tab Click Environment Variables Under System Variables , select path (If you cannot find path, you can click new to make it) Click Edit Click Browse and select the location of the executable of R. If you use the default installation path, you can add C:\\Program Files\\R\\R-3.3.2\\bin , where (eg.) 3.3.2 is the version number. Some installation might also have a i384 and x64 version and either one of those will work.","title":"For Windows 8 and 10"},{"location":"extra_steps/#for-windows-7","text":"From the desktop, right click the Computer icon. Choose Properties from the context menu. Click the Advanced system settings link. Click Environment Variables . In the section System Variables , find the PATH environment variable and select it. Click Edit. In the Edit System Variable (or New System Variable ) window, add the location of the executable of R. If you use the default installation path, you can add C:\\Program Files\\R\\R-3.3.2\\bin , where (eg.) 3.3.2 is the version number. Some installation might also have a i384 and x64 version and either one of those will work.","title":"For Windows 7"},{"location":"faq/","text":"Frequently Asked Questions We will continue to update this list to address the more common questions. I've receive the following error message, what should I do? GLM model did not converge! Please send me the DEBUG files This error message means that the logistic regression model cannot converge. This is usually caused by small sample size or caused by problem in the input file. You should first check the DEBUG file and see if that contains any NaN or Inf . These will likely be caused by un-quality controlled input, which can contain complete missingness. If that isn't the case, then you can load the DEBUG and DEBUG.y file into R and see if you can perform the logistic regression on the data (DEBUG.y is the y, whereas DEBUG contains the independent variables, including the intercept). My base/ target data do not contain the RS ID. Can I still use PRSice? As of version 2.3.x, PRSice now support chromosome ID via the parameter --chr-id . --chr-id will automatically generate an ID for each of your SNPs based on user input string, some characters are reserved: c = chromosome l = coordinates a = effective allele b = non-effective allele So --chr-id c:l-ab will construct SNP ID as :- Note It is not case sensitive","title":"Frequently Asked Questions"},{"location":"faq/#frequently-asked-questions","text":"We will continue to update this list to address the more common questions. I've receive the following error message, what should I do? GLM model did not converge! Please send me the DEBUG files This error message means that the logistic regression model cannot converge. This is usually caused by small sample size or caused by problem in the input file. You should first check the DEBUG file and see if that contains any NaN or Inf . These will likely be caused by un-quality controlled input, which can contain complete missingness. If that isn't the case, then you can load the DEBUG and DEBUG.y file into R and see if you can perform the logistic regression on the data (DEBUG.y is the y, whereas DEBUG contains the independent variables, including the intercept). My base/ target data do not contain the RS ID. Can I still use PRSice? As of version 2.3.x, PRSice now support chromosome ID via the parameter --chr-id . --chr-id will automatically generate an ID for each of your SNPs based on user input string, some characters are reserved: c = chromosome l = coordinates a = effective allele b = non-effective allele So --chr-id c:l-ab will construct SNP ID as :- Note It is not case sensitive","title":"Frequently Asked Questions"},{"location":"prset_detail/","text":"Input Data MSigDB One simple way to obtain gene sets or pathway is through the MSigDB . After registration in here , you can download different gene sets curated by the Broad Institute. Alternatively, you can also generate your own gene sets in the GMT format: [Set A] [Gene 1] [Gene 2] ... [Set B] [Gene 1] [Gene 2] ... Sometimes, the MSigDB file might store the URL for the gene set on the second column [Set A] [url for set A] [Gene 1] [Gene 2] ... And PRSet can properly handle that. Gene GTF As MSigDB file does not contain the genome boundary of the genes within the gene set, one must also provide a GTF file. A GTF file contains the genome boundary of the genetic elements within the human genome and PRSet can use the information from GTF to determine if a SNP falls within a specific gene. One can download the GTF file file Human (Genome build GTCh38.p7) here . PRSet will look for any regions with feature of exon , gene , protein_coding or CDS ( case sensitive ). Any genomic regions without these features will be ignored. Alternatively, you can specify the features using the --feature command. Note For those who are unfamiliar, different version of the genome might differ slightly in their coordinates. Therefore it is vital to ensure all the files are originated from the same genome build Bed Files In addition, PRSet also accept bed file(s) as an input. Important A bed file must contain a minimum of 3 columns: chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671). chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. PRSet will read in any number of bed files (comma separated) and use the file names as the name of the gene set. Note An annoying feature of bed file is that it starts with 0 whereas for example, the plink formats starts the coordinates at 1. So do remember to -1 from the region start when you build your own bed file from scratch. SNP Set Files Finally, PRSet also allow SNP sets, input via the --snp-set option. Two different formats are allowed SNP list format, a file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set. Clumping in PRSet In PRSice-2, clumping is performed to account for linkage disequilibrium (LD) between SNPs. However, when performing set based analysis, special care are required to perform clumping. Take the following as an example: Assume that: Light Blue fragments are the intergenic regions Dark Blue fragments are the genic regions Red fragments are the gene set regions SNPs are represented as thunder bolt, with the \"index\" SNP in clumping denoted by the green thunderbolt If we simply perform a genome wide clumping, we might remove all SNPs residing within the gene set of interest, reducing the signal: Therefore, to maximize signal within each gene set, we must perform clumping for each gene sets separately: this can be a tedious process and are prone to error. To speed up clumping, PRSice-2 adopt a \" capture the flag \" system. Each SNPs contains a flag to represent their gene set membership. If a SNP is a member for the set, it will have a flag of 1, otherwise it will have a flag of 0. For example: SNP Set A Set B Set C Set D SNP 1 1 0 1 1 SNP 2 0 0 1 1 SNP 3 1 1 0 1 If we use SNP 1 as the index SNP, then after clumping, we will have SNP Set A Set B Set C Set D SNP 1 1 0 1 1 SNP 2 0 0 0 0 SNP 3 0 1 0 0 which removes SNP 2, but will retain SNP 3. This allow us to achieve set based clumping by only performing a single pass genome wide clumping. P-value Threshold and Proxy Clumping Options Proxy PRSet One complication in PRSet is the definition of SNP membership. The default option of PRSet is to only include SNPs that are physically within the target region. However, it is also likely for SNPs outside the region to influence functions of the set. Therefore we provide the --proxy option. Essentially, this provide a soft cutoff to SNP membership. For example, when user define --proxy 0.8 , if LD between SNP A and SNP B is more than 0.8, then SNP A will be considered to be within the same regions as SNP B and vice versa. P-value thresholding By default, PRSet do not perform p-value thresholding and will simply calculate the set based PRS at P-value threshold of 1. This is because it is unclear whether the set is associated with the phenotype when the best-threshold contained only a small portion of SNPs within the gene sets. If you wish to perform p-value thresholding with PRSet, you will need to specify any of the parameters related to p-value thresholding, i.e. --interval , --lower , --upper , --fastscore or --bar-levels . Competitive P-value Calculation A challenge in Set base analysis is to obtain a competitive p-value, which indicates the level of enrichment, as opposed to the self-contained p-value which indicates the level of association. To obtain a competitive p-value, PRSet can perform a permutation analysis as follow Allocate SNPs to each gene sets Allocate SNPs to a background gene set if --full-back is specified, use the whole genome as the background if a background file is provided via the --background command, it will be used to construct the background set otherwise, will try to use the GTF file provided from --gtf command as the background (with feature filtering w.r.t --feature ) Perform set based clumping on all sets (including the background set) Obtain the p-value of association for the best threshold for each sets ( \\(P_{observed}\\) ) While PRSet allow one to perform p-value thresholding on the set scores, we recommend against it as it is difficult to interpret the result. Using an extreme example, if only one SNP is included in the best threshold for a set, should we really consider this single SNP as representative of the gene set? For each gene set with \\(N\\) post-clump SNPs Randomly select \\(N\\) post-clump SNPs from the background set and construct a null PRS Calculate the p-value of association of the null PRS to obtain a null P-value ( \\(P_{null}\\) ) Repeat 1-2 \\(M\\) times, where \\(M\\) can be set via --set-perm The competitive P-value is calculated as $$ \\text{Competitive-}P = \\frac{\\sum_{n=1}^NI(P_{null}\\lt P_observed)+1}{N+1} $$ where \\(I(.)\\) is the indicator function. Computation Algorithm Due to the number of operation required, the set based permutation are extremely time consuming. To speed up the set based permutation, we noted that in regression, $$ Y\\sim X\\beta+C+\\epsilon $$ and \\[ X\\sim Y\\beta+C+\\epsilon \\] will generate the same t-statistic for \\(X\\) in the first equation and \\(Y\\) in the second equation. Based on this observation, we can then do the following Generate a matrix \\(A\\) containing the phenotype of interest and the covariates Decompose matrix \\(A\\) For each new PRS calculated, solve \\(PRS=A\\beta+\\epsilon\\) and obtain the t-statistic. These t-statistics are then used to construct the null distribution, allow us to obtain the competitive p-value As we only need to do the decomposition once, this should significantly increase the speed of set based permutation. In our test, for the TOY data, with --set-perm 5000 , we can speed up the set-based permutation by around 20~25% Note With binary traits, unless --logit-perm is set, we will still perform linear regression as we assume linear regression and logistic regression should produce similar t-statistics Output Data PRS model-fit A file containing the PRS model fit across thresholds is named [Name].prsice , where [Name] is the output prefix name as specified by --out this is stored as Name of Set, Threshold, R2, P-value, Coefficient, Standard Error, and Number of SNPs at this threshold Scores for each individual A file containing PRS for each individual at the best-fit PRS named [Name].best is provide. This file has the format of: FID,IID, In Regression, PRS at best threshold for Set 1, PRS at best threshold for Set 2, ... Where the has phenotype column indicate whether the sample contain all the required phenotype for PRSice analysis (e.g. Samples with missing phenotype/covariate will not be included in the regression. These samples will be indicated as \"No\" under the in regression column) If --all option is used, a file named [Name].all.score is also generated Please note, if --all options is used, the PRS for each individual at all threshold will be given. In the event where the target sample size is large and a lot of threshold are tested, this file can be large. This is especially true when large number of gene sets were provided. Note PRSice also supports multiple phenotypes for target data. All output prefix will change to [Name].[Pheno] where [Pheno] is the name of the phenotype. For more details on the options used to implement this, see here . Summary Information Information of the best model fit of each phenotype and gene set is stored in [Name].summary. The summary file contain the following fields: Phenotype - Name of Phenotype Set - Name of Gene Set Threshold - Best P-value Threshold PRS.R2 - Variance explained by the PRS. If prevalence is provided, this will be adjusted for ascertainment Full.R2 - Variance explained by the full model (including the covariates). If prevalence is provided, this will be adjusted for ascertainment Null.R2 - Variance explained by the covariates. If prevalence is provided, this will be adjusted for ascertainment Prevalence - Population prevalence as indicated by the user. \"-\" if not provided. Coefficient - Regression coefficient of the model. Can provide insight of the direction of effect. P - P value of the model fit Num_SNP - Number of SNPs included in the model Empirical-P - Only provided if permutation is performed. This is the empirical p-value and should account for multiple testing and over-fitting Competitive-P - Only provided if set permutation is performed. This is the competitive p-value and should measure the enrichment of signal of the gene set Multi-Set Plot When the --multi-plot option is set, the results of the top N gene sets will be plotted. An example of the multi-set plot is: Other Figures The default behaviour of PRSet is to only plot the High-resolution plot, bar-plot and the quantile plot for the \"Base\" data. You can change this behaviour by using the --plot-set option. Log File We value reproducible research. Therefore we try our best to make replicating PRSice run easier. For every PRSice run, a log file named [Name].log is generated which contain the all the commands used for the analysis and information regarding filtering, field selected etc. This also allow users to quickly identify problems in the input dataset.","title":"PRSet"},{"location":"prset_detail/#input-data","text":"","title":"Input Data"},{"location":"prset_detail/#msigdb","text":"One simple way to obtain gene sets or pathway is through the MSigDB . After registration in here , you can download different gene sets curated by the Broad Institute. Alternatively, you can also generate your own gene sets in the GMT format: [Set A] [Gene 1] [Gene 2] ... [Set B] [Gene 1] [Gene 2] ... Sometimes, the MSigDB file might store the URL for the gene set on the second column [Set A] [url for set A] [Gene 1] [Gene 2] ... And PRSet can properly handle that.","title":"MSigDB"},{"location":"prset_detail/#gene-gtf","text":"As MSigDB file does not contain the genome boundary of the genes within the gene set, one must also provide a GTF file. A GTF file contains the genome boundary of the genetic elements within the human genome and PRSet can use the information from GTF to determine if a SNP falls within a specific gene. One can download the GTF file file Human (Genome build GTCh38.p7) here . PRSet will look for any regions with feature of exon , gene , protein_coding or CDS ( case sensitive ). Any genomic regions without these features will be ignored. Alternatively, you can specify the features using the --feature command. Note For those who are unfamiliar, different version of the genome might differ slightly in their coordinates. Therefore it is vital to ensure all the files are originated from the same genome build","title":"Gene GTF"},{"location":"prset_detail/#bed-files","text":"In addition, PRSet also accept bed file(s) as an input. Important A bed file must contain a minimum of 3 columns: chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671). chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. PRSet will read in any number of bed files (comma separated) and use the file names as the name of the gene set. Note An annoying feature of bed file is that it starts with 0 whereas for example, the plink formats starts the coordinates at 1. So do remember to -1 from the region start when you build your own bed file from scratch.","title":"Bed Files"},{"location":"prset_detail/#snp-set-files","text":"Finally, PRSet also allow SNP sets, input via the --snp-set option. Two different formats are allowed SNP list format, a file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set.","title":"SNP Set Files"},{"location":"prset_detail/#clumping-in-prset","text":"In PRSice-2, clumping is performed to account for linkage disequilibrium (LD) between SNPs. However, when performing set based analysis, special care are required to perform clumping. Take the following as an example: Assume that: Light Blue fragments are the intergenic regions Dark Blue fragments are the genic regions Red fragments are the gene set regions SNPs are represented as thunder bolt, with the \"index\" SNP in clumping denoted by the green thunderbolt If we simply perform a genome wide clumping, we might remove all SNPs residing within the gene set of interest, reducing the signal: Therefore, to maximize signal within each gene set, we must perform clumping for each gene sets separately: this can be a tedious process and are prone to error. To speed up clumping, PRSice-2 adopt a \" capture the flag \" system. Each SNPs contains a flag to represent their gene set membership. If a SNP is a member for the set, it will have a flag of 1, otherwise it will have a flag of 0. For example: SNP Set A Set B Set C Set D SNP 1 1 0 1 1 SNP 2 0 0 1 1 SNP 3 1 1 0 1 If we use SNP 1 as the index SNP, then after clumping, we will have SNP Set A Set B Set C Set D SNP 1 1 0 1 1 SNP 2 0 0 0 0 SNP 3 0 1 0 0 which removes SNP 2, but will retain SNP 3. This allow us to achieve set based clumping by only performing a single pass genome wide clumping.","title":"Clumping in PRSet"},{"location":"prset_detail/#p-value-threshold-and-proxy-clumping","text":"","title":"P-value Threshold and Proxy Clumping"},{"location":"prset_detail/#options","text":"","title":"Options"},{"location":"prset_detail/#proxy-prset","text":"One complication in PRSet is the definition of SNP membership. The default option of PRSet is to only include SNPs that are physically within the target region. However, it is also likely for SNPs outside the region to influence functions of the set. Therefore we provide the --proxy option. Essentially, this provide a soft cutoff to SNP membership. For example, when user define --proxy 0.8 , if LD between SNP A and SNP B is more than 0.8, then SNP A will be considered to be within the same regions as SNP B and vice versa.","title":"Proxy PRSet"},{"location":"prset_detail/#p-value-thresholding","text":"By default, PRSet do not perform p-value thresholding and will simply calculate the set based PRS at P-value threshold of 1. This is because it is unclear whether the set is associated with the phenotype when the best-threshold contained only a small portion of SNPs within the gene sets. If you wish to perform p-value thresholding with PRSet, you will need to specify any of the parameters related to p-value thresholding, i.e. --interval , --lower , --upper , --fastscore or --bar-levels .","title":"P-value thresholding"},{"location":"prset_detail/#competitive-p-value-calculation","text":"A challenge in Set base analysis is to obtain a competitive p-value, which indicates the level of enrichment, as opposed to the self-contained p-value which indicates the level of association. To obtain a competitive p-value, PRSet can perform a permutation analysis as follow Allocate SNPs to each gene sets Allocate SNPs to a background gene set if --full-back is specified, use the whole genome as the background if a background file is provided via the --background command, it will be used to construct the background set otherwise, will try to use the GTF file provided from --gtf command as the background (with feature filtering w.r.t --feature ) Perform set based clumping on all sets (including the background set) Obtain the p-value of association for the best threshold for each sets ( \\(P_{observed}\\) ) While PRSet allow one to perform p-value thresholding on the set scores, we recommend against it as it is difficult to interpret the result. Using an extreme example, if only one SNP is included in the best threshold for a set, should we really consider this single SNP as representative of the gene set? For each gene set with \\(N\\) post-clump SNPs Randomly select \\(N\\) post-clump SNPs from the background set and construct a null PRS Calculate the p-value of association of the null PRS to obtain a null P-value ( \\(P_{null}\\) ) Repeat 1-2 \\(M\\) times, where \\(M\\) can be set via --set-perm The competitive P-value is calculated as $$ \\text{Competitive-}P = \\frac{\\sum_{n=1}^NI(P_{null}\\lt P_observed)+1}{N+1} $$ where \\(I(.)\\) is the indicator function.","title":"Competitive P-value Calculation"},{"location":"prset_detail/#computation-algorithm","text":"Due to the number of operation required, the set based permutation are extremely time consuming. To speed up the set based permutation, we noted that in regression, $$ Y\\sim X\\beta+C+\\epsilon $$ and \\[ X\\sim Y\\beta+C+\\epsilon \\] will generate the same t-statistic for \\(X\\) in the first equation and \\(Y\\) in the second equation. Based on this observation, we can then do the following Generate a matrix \\(A\\) containing the phenotype of interest and the covariates Decompose matrix \\(A\\) For each new PRS calculated, solve \\(PRS=A\\beta+\\epsilon\\) and obtain the t-statistic. These t-statistics are then used to construct the null distribution, allow us to obtain the competitive p-value As we only need to do the decomposition once, this should significantly increase the speed of set based permutation. In our test, for the TOY data, with --set-perm 5000 , we can speed up the set-based permutation by around 20~25% Note With binary traits, unless --logit-perm is set, we will still perform linear regression as we assume linear regression and logistic regression should produce similar t-statistics","title":"Computation Algorithm"},{"location":"prset_detail/#output-data","text":"","title":"Output Data"},{"location":"prset_detail/#prs-model-fit","text":"A file containing the PRS model fit across thresholds is named [Name].prsice , where [Name] is the output prefix name as specified by --out this is stored as Name of Set, Threshold, R2, P-value, Coefficient, Standard Error, and Number of SNPs at this threshold","title":"PRS model-fit"},{"location":"prset_detail/#scores-for-each-individual","text":"A file containing PRS for each individual at the best-fit PRS named [Name].best is provide. This file has the format of: FID,IID, In Regression, PRS at best threshold for Set 1, PRS at best threshold for Set 2, ... Where the has phenotype column indicate whether the sample contain all the required phenotype for PRSice analysis (e.g. Samples with missing phenotype/covariate will not be included in the regression. These samples will be indicated as \"No\" under the in regression column) If --all option is used, a file named [Name].all.score is also generated Please note, if --all options is used, the PRS for each individual at all threshold will be given. In the event where the target sample size is large and a lot of threshold are tested, this file can be large. This is especially true when large number of gene sets were provided. Note PRSice also supports multiple phenotypes for target data. All output prefix will change to [Name].[Pheno] where [Pheno] is the name of the phenotype. For more details on the options used to implement this, see here .","title":"Scores for each individual"},{"location":"prset_detail/#summary-information","text":"Information of the best model fit of each phenotype and gene set is stored in [Name].summary. The summary file contain the following fields: Phenotype - Name of Phenotype Set - Name of Gene Set Threshold - Best P-value Threshold PRS.R2 - Variance explained by the PRS. If prevalence is provided, this will be adjusted for ascertainment Full.R2 - Variance explained by the full model (including the covariates). If prevalence is provided, this will be adjusted for ascertainment Null.R2 - Variance explained by the covariates. If prevalence is provided, this will be adjusted for ascertainment Prevalence - Population prevalence as indicated by the user. \"-\" if not provided. Coefficient - Regression coefficient of the model. Can provide insight of the direction of effect. P - P value of the model fit Num_SNP - Number of SNPs included in the model Empirical-P - Only provided if permutation is performed. This is the empirical p-value and should account for multiple testing and over-fitting Competitive-P - Only provided if set permutation is performed. This is the competitive p-value and should measure the enrichment of signal of the gene set","title":"Summary Information"},{"location":"prset_detail/#multi-set-plot","text":"When the --multi-plot option is set, the results of the top N gene sets will be plotted. An example of the multi-set plot is:","title":"Multi-Set Plot"},{"location":"prset_detail/#other-figures","text":"The default behaviour of PRSet is to only plot the High-resolution plot, bar-plot and the quantile plot for the \"Base\" data. You can change this behaviour by using the --plot-set option.","title":"Other Figures"},{"location":"prset_detail/#log-file","text":"We value reproducible research. Therefore we try our best to make replicating PRSice run easier. For every PRSice run, a log file named [Name].log is generated which contain the all the commands used for the analysis and information regarding filtering, field selected etc. This also allow users to quickly identify problems in the input dataset.","title":"Log File"},{"location":"quick_start/","text":"Preparation Before performing PRSice, quality control should be performed on the target samples. See here for an example. Input PRSice.R file: A wrapper for the PRSice executable and for plotting PRSice executable file: Perform all analysis except plotting Base data set: GWAS summary results, which the PRS is based on Target data set: Raw genotype data of target phenotype . Can be in the form of PLINK binary or BGEN Running PRSice In most case, assuming the PRSice executable is located in ($HOME)/PRSice/ and the working directory is ($HOME)/PRSice , you can run PRSice with the following commands: Note For window users, please use Rscript.exe instead of Rscript Important Do not copy codes to Microsoft Word. Word has a tendency to change characters from codes into special characters that cannot be recognized by the terminal Binary Traits For binary traits, the following command can be used (commands specific to binary traits are highlighted in yellow) Unix Rscript PRSice.R --dir . \\ --prsice ./PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --thread 1 \\ --stat OR \\ --binary-target T Windows Rscript.exe PRSice.R --dir . ^ --prsice ./PRSice.exe ^ --base TOY_BASE_GWAS.assoc ^ --target TOY_TARGET_DATA ^ --thread 1 ^ --stat OR ^ --binary-target T Quantitative Traits For quantitative traits, the following can be used instead (commands specific to quantitative traits are highlighted in yellow) Unix Rscript PRSice.R --dir . \\ --prsice ./PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --thread 1 \\ --stat BETA \\ --beta \\ --binary-target F Windows Rscript.exe PRSice.R --dir . ^ --prsice ./PRSice.exe ^ --base TOY_BASE_GWAS.assoc ^ --target TOY_TARGET_DATA ^ --thread 1 ^ --stat BETA ^ --beta ^ --binary-target F Note If the type of Effect ( --stat ) or data type ( --binary-target ) were not specified, PRSice will try to determine these information based on the header of the base file: When BETA (case insensitive) is found in the header and --stat was not provided, --beta will be added to the command, and if --binary-target was not provided, --binary-target F will be added to the command When OR (case insensitive) is found in the header and --binary-target was not provided, --or will be added to the command, and if --binary-target was not provided, --binary-target T will be added to the command PRSice cannot determine if the type of effect / data type if the base file contains both OR and BETA PRSice will detail all effective options in its log file. Quality Control of Target Samples Quality controls can be performed on the target samples using PLINK. A good starting point is (assume ($target) is the prefix of the target binary file) Unix plink --bfile ( $target ) \\ --maf 0 .05 \\ --mind 0 .1 \\ --geno 0 .1 \\ --hwe 1e-6 \\ --make-just-bim \\ --make-just-fam \\ --out ( $target ) .qc Windows plink.exe --bfile ( $target ) ^ --maf 0 .05 ^ --mind 0 .1 ^ --geno 0 .1 ^ --hwe 1e-6 ^ --make-just-bim ^ --make-just-fam ^ --out ( $target ) .qc Then, --keep ($target).qc.fam --extract ($target).qc.bim can be added to the PRSice command to filter out the samples and SNPs. You can refer to Marees et al (2018) for a more detail guide. You can also find our PRS tutorial here .","title":"PRSice"},{"location":"quick_start/#preparation","text":"Before performing PRSice, quality control should be performed on the target samples. See here for an example.","title":"Preparation"},{"location":"quick_start/#input","text":"PRSice.R file: A wrapper for the PRSice executable and for plotting PRSice executable file: Perform all analysis except plotting Base data set: GWAS summary results, which the PRS is based on Target data set: Raw genotype data of target phenotype . Can be in the form of PLINK binary or BGEN","title":"Input"},{"location":"quick_start/#running-prsice","text":"In most case, assuming the PRSice executable is located in ($HOME)/PRSice/ and the working directory is ($HOME)/PRSice , you can run PRSice with the following commands: Note For window users, please use Rscript.exe instead of Rscript Important Do not copy codes to Microsoft Word. Word has a tendency to change characters from codes into special characters that cannot be recognized by the terminal","title":"Running PRSice"},{"location":"quick_start/#binary-traits","text":"For binary traits, the following command can be used (commands specific to binary traits are highlighted in yellow) Unix Rscript PRSice.R --dir . \\ --prsice ./PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --thread 1 \\ --stat OR \\ --binary-target T Windows Rscript.exe PRSice.R --dir . ^ --prsice ./PRSice.exe ^ --base TOY_BASE_GWAS.assoc ^ --target TOY_TARGET_DATA ^ --thread 1 ^ --stat OR ^ --binary-target T","title":"Binary Traits"},{"location":"quick_start/#quantitative-traits","text":"For quantitative traits, the following can be used instead (commands specific to quantitative traits are highlighted in yellow) Unix Rscript PRSice.R --dir . \\ --prsice ./PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --thread 1 \\ --stat BETA \\ --beta \\ --binary-target F Windows Rscript.exe PRSice.R --dir . ^ --prsice ./PRSice.exe ^ --base TOY_BASE_GWAS.assoc ^ --target TOY_TARGET_DATA ^ --thread 1 ^ --stat BETA ^ --beta ^ --binary-target F Note If the type of Effect ( --stat ) or data type ( --binary-target ) were not specified, PRSice will try to determine these information based on the header of the base file: When BETA (case insensitive) is found in the header and --stat was not provided, --beta will be added to the command, and if --binary-target was not provided, --binary-target F will be added to the command When OR (case insensitive) is found in the header and --binary-target was not provided, --or will be added to the command, and if --binary-target was not provided, --binary-target T will be added to the command PRSice cannot determine if the type of effect / data type if the base file contains both OR and BETA PRSice will detail all effective options in its log file.","title":"Quantitative Traits"},{"location":"quick_start/#quality-control-of-target-samples","text":"Quality controls can be performed on the target samples using PLINK. A good starting point is (assume ($target) is the prefix of the target binary file) Unix plink --bfile ( $target ) \\ --maf 0 .05 \\ --mind 0 .1 \\ --geno 0 .1 \\ --hwe 1e-6 \\ --make-just-bim \\ --make-just-fam \\ --out ( $target ) .qc Windows plink.exe --bfile ( $target ) ^ --maf 0 .05 ^ --mind 0 .1 ^ --geno 0 .1 ^ --hwe 1e-6 ^ --make-just-bim ^ --make-just-fam ^ --out ( $target ) .qc Then, --keep ($target).qc.fam --extract ($target).qc.bim can be added to the PRSice command to filter out the samples and SNPs. You can refer to Marees et al (2018) for a more detail guide. You can also find our PRS tutorial here .","title":"Quality Control of Target Samples"},{"location":"quick_start_prset/","text":"Background A new feature of PRSice is the ability to perform set base/pathway based analysis. This new feature is called PRSet. Paper on PRSet currently under preparation. Important PRSet is currently under active development. Preparation PRSet is based on PRSice , with additional input requirements Input PRSice.R file : A wrapper for the PRSice binary and for plotting PRSice binary file : Perform all analysis except plotting Base data set : GWAS summary results, which the PRS is based on Target data set : Raw genotype data of \"target phenotype\". Can be in the form of PLINK binary or BGEN PRSet Specific Input Bed file(s) : Bed file(s) containing region of genes within a gene set; or MSigDB file : File containing name of each gene sets and the ID of genes within the gene set on each individual line. If MSigDB is provided, GTF file is required. GTF file : A file contain the genome boundary of each individual gene SNP file : A file containing SNPs constituting the gene set of interest. Can be in MSigDB (gmt) format or a file contain a single column of SNP IDs. Running PRSet In most case, assuming the PRSice binary is located in ($HOME)/PRSice/bin/ and the working directory is ($HOME)/PRSice , you can run PRSet with the following commands: With MSigDB data Assuming a MSigDB file ( set.txt ) is downloaded and a gene gtf file (gene.gtf) from Ensemble is available, PRSet can then be performed using: Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --gtf gene.gtf \\ --msigdb set.txt \\ --multi-plot 10 This will perform PRSet analysis and generate the multi-set plot with the top 10 gene sets With Bed Files Alternatively, if a list of bed files are available, e.g. A.bed,B.bed , PRSet can be performed by running Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --bed A.bed:SetA,B.bed \\ --multi-plot 10 Note Both bed and GTF+MSigDB input can be used together Tips Name of the set will be the bed file name or can be provided using --bed File:Name With SNP Set Finally, if you want to construct sets based on a list of SNPs, you can use --snp-set : Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --snp-set A.snp:A,B.snp \\ --multi-plot 10 Two different format are allowed: SNP Set list format: A file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set.","title":"PRSet"},{"location":"quick_start_prset/#background","text":"A new feature of PRSice is the ability to perform set base/pathway based analysis. This new feature is called PRSet. Paper on PRSet currently under preparation. Important PRSet is currently under active development.","title":"Background"},{"location":"quick_start_prset/#preparation","text":"PRSet is based on PRSice , with additional input requirements","title":"Preparation"},{"location":"quick_start_prset/#input","text":"PRSice.R file : A wrapper for the PRSice binary and for plotting PRSice binary file : Perform all analysis except plotting Base data set : GWAS summary results, which the PRS is based on Target data set : Raw genotype data of \"target phenotype\". Can be in the form of PLINK binary or BGEN","title":"Input"},{"location":"quick_start_prset/#prset-specific-input","text":"Bed file(s) : Bed file(s) containing region of genes within a gene set; or MSigDB file : File containing name of each gene sets and the ID of genes within the gene set on each individual line. If MSigDB is provided, GTF file is required. GTF file : A file contain the genome boundary of each individual gene SNP file : A file containing SNPs constituting the gene set of interest. Can be in MSigDB (gmt) format or a file contain a single column of SNP IDs.","title":"PRSet Specific Input"},{"location":"quick_start_prset/#running-prset","text":"In most case, assuming the PRSice binary is located in ($HOME)/PRSice/bin/ and the working directory is ($HOME)/PRSice , you can run PRSet with the following commands:","title":"Running PRSet"},{"location":"quick_start_prset/#with-msigdb-data","text":"Assuming a MSigDB file ( set.txt ) is downloaded and a gene gtf file (gene.gtf) from Ensemble is available, PRSet can then be performed using: Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --gtf gene.gtf \\ --msigdb set.txt \\ --multi-plot 10 This will perform PRSet analysis and generate the multi-set plot with the top 10 gene sets","title":"With MSigDB data"},{"location":"quick_start_prset/#with-bed-files","text":"Alternatively, if a list of bed files are available, e.g. A.bed,B.bed , PRSet can be performed by running Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --bed A.bed:SetA,B.bed \\ --multi-plot 10 Note Both bed and GTF+MSigDB input can be used together Tips Name of the set will be the bed file name or can be provided using --bed File:Name","title":"With Bed Files"},{"location":"quick_start_prset/#with-snp-set","text":"Finally, if you want to construct sets based on a list of SNPs, you can use --snp-set : Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --snp-set A.snp:A,B.snp \\ --multi-plot 10 Two different format are allowed: SNP Set list format: A file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set.","title":"With SNP Set"},{"location":"resources/","text":"Introduction PRSice relies on a number of open source projects to achieve the current performance. We also used algorithm found in other projects and translate them into C++ code for our own use. Below are number of projects we relies on Open source projects Project Developer(s) Description PLINK 2 Christopher Chang Provide the backbone of the clumping algorithm and PRS calculation BGEN lib Gavin Band Provide API to handle BGEN files. Slight modification were made to accomodate PRSice's usage Eigen C++ Ga\u00ebl Guennebaud and Beno\u00eet Jacob and others For all matrix algebra gzstream Deepak Bandyopadhyay and Lutz Kettner For reading gz files fastglm Jared Huling, Douglas Bates, Dirk Eddelbuettel, Romain Francois and Yixuan Qiu Basis of our glm class RcppEigen Douglas Bates, Dirk Eddelbuettel, Romain Francois, and Yixuan Qiu Provide the fastlm algorithm","title":"Useful Resources"},{"location":"resources/#introduction","text":"PRSice relies on a number of open source projects to achieve the current performance. We also used algorithm found in other projects and translate them into C++ code for our own use. Below are number of projects we relies on","title":"Introduction"},{"location":"resources/#open-source-projects","text":"Project Developer(s) Description PLINK 2 Christopher Chang Provide the backbone of the clumping algorithm and PRS calculation BGEN lib Gavin Band Provide API to handle BGEN files. Slight modification were made to accomodate PRSice's usage Eigen C++ Ga\u00ebl Guennebaud and Beno\u00eet Jacob and others For all matrix algebra gzstream Deepak Bandyopadhyay and Lutz Kettner For reading gz files fastglm Jared Huling, Douglas Bates, Dirk Eddelbuettel, Romain Francois and Yixuan Qiu Basis of our glm class RcppEigen Douglas Bates, Dirk Eddelbuettel, Romain Francois, and Yixuan Qiu Provide the fastlm algorithm","title":"Open source projects"},{"location":"step_by_step/","text":"Background You will need to have basic understanding of Genome Wide Association Studies (GWAS) in order to be able to perform Polygenic risk score (PRS) analyses. If you are unfamiliar with GWAS, you can consider reading this paper . Input Data Here, we briefly discuss different input files required by PRSice: Base Dataset Base (i.e. GWAS) data must be provided as a whitespace delimited file containing association analysis results for SNPs on the base phenotype. PRSice has no problem reading in a gzipped base file (need to have a .gz suffix). If PLINK output is used, then please make sure there is a column for the effective allele (A1) and specify it with --A1 option. If your base data follows other formats, then the column headers can be provided using the --chr , --A1 , --A2 , --stat , --snp , --bp , --pvalue options Important PRSice requires the base file to contain information of the effective allele ( --A1 ), effect size estimates ( --stat ), p-value for association ( --pvalue ), and the SNP ID ( --snp ). If the input file does not contain a column header, the column can be specified using their index (start counting from 0) with the --index flag. For example, with the following input format: SNP CHR BP A1 A2 OR SE P rs3094315 1 752566 A G 0.9912 0.0229 0.7009 rs3131972 1 752721 A G 1.007 0.0228 0.769 rs3131971 1 752894 T C 1.003 0.0232 0.8962 the parameters can either be --snp SNP --chr CHR --bp BP --A1 A1 --A2 A2 --stat OR --pvalue P or --snp 0 --chr 1 --bp 2 --A1 3 --A2 4 --stat 5 --pvalue 7 --index Strand flips are automatically detected and accounted for. If an imputation info score or the minor allele frequencies (MAF) are also included in the file, --base-info : and --base-maf : can be used to filter SNPs based on their INFO score and MAF respectively. For binary trait base file, SNPs can be filtered according to the MAF in case and control separately using --base-maf :,: By default, PRSice will look for the following column names automatically from the base file header if --index was not provided or if the column name of the specific arguement(s) were not provided: CHR, BP, A1, A2, SNP, P, INFO (case sensitive) and OR / BETA (case insensitive) --no-default can be used to disable all the defaults of PRSice. Note PRSice will ignore any columns that were not found in the base file (e.g. If --A2 B is specified but none of the column header is B , then PRSice will treat it as if no A2 information is presented) Target Dataset Currently two different target file format is supported by PRSice: PLINK Binary A target dataset in PLINK binary format must consist of three files: .bed , .bim , and a .fam file - where bed contains the compressed genotype data, bim contains the SNP information and fam contains the family information. Currently only SNP major PLINK format are supported (default output of the latest PLINK program). The .bed and .bim file must have the same prefix. If the .fam file follow a different prefix from the .bed and bim file, it can be specified using --target , Warning The fam file MUST contains the correct number of samples or PRSice will crash Missing phenotype data can be coded as NA, or -9 for binary traits and NA for quantitative traits. Note -9 will NOT be considered as missing for quantitative traits If the binary file is separated into individual chromosomes, then an # can be used to specify the location of the chromosome number in the file name. PRSice will automatically substitute # with 1-22 i.e. If the files are chr1. ,chr2. ,...,chr22. , just use --target chr# Note Chromosome number substitution will not be performed on the external fam file as the fam file should be the same for all chromosomes. Alternatively, if your PLINK files do not have a unified prefix, you can use --target-list to provide a file containing all prefix to PRSice. Note .pgen files are not currently supported BGEN PRSice currently support BGEN v1.1 and v1.2. To specify a BGEN file, simply add the --type bgen or --ld-type bgen to the PRSice command Note In theory, we can support BGEN v1.3, but that will require us to include zstd library, developed by facebook. You can enable the support by including the zstd library and changing the bgen_lib files. As BGEN does not store the phenotype information and sometime not even the sample ID, you must provide a phenotype file ( --pheno ). Alternatively, if you have a sample file containing the phenotype information, you can provide it with --target , Note The sample file is required even if --no-regress is set as the sample ID is required for output. This requirement might be losen in future versions With BGEN input, a number of PRSice options become effective: --hard : Normally, with BGEN format, PRS is calculated using the dosage information. But hard-thresholding can be performed by using the --hard option. SNPs will then coded as the genotype (0,1 or 2) and filtered according to threshold set by --hard-thres . If no such genotype is presented, the SNP will be coded as missing --hard-thres : A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. See here for more detail. To perform clumping on BGEN file, we need to repeatly decompress the genotype dosage and convert them into PLINK binary format. To speed up the clumping process, you can allow PRSice to generate a large intermediate file, containing the hard coded genotypes in PLINK binary format by using the --allow-inter option. Phenotype files An external phenotype file can be provided to PRSice using the --pheno parameter. This must be a tab / space delimited file and missing data must be represented by either NA or -9 (only for binary traits). The first two column of the phenotype file should be the FID and the IID, or when --ignore-fid is set, the first column should be the IID. The rest of the columns can be the phenotype(s). To specify a trait within the phenotype file, the column name for the trait can be specified using --pheno-col , providing that the phenotype file contains a header. Multiple column name can be provided via a comma separated list: e.g. --pheno-col A,B,C,D . Trait(s) not found within the phenotype file will be automatically skipped. Important The column name(s) should not contain space nor comma Note When more than one traits are provided, the column name will be appended to the output prefix. LD reference When the target sample is small (e.g. < 500 samples), an external reference panel can be used to improve the LD estimation for clumping. The LD reference follows the same notion as the target dataset. Simply use --ld to specify your LD reference panel file and --ld-type to specify the format When a LD reference file is not provided and --no-clump is not specified, the target file will be used as the LD reference panel Important Any parameters with the --ld prefix will only work on the file specified by the --ld parameter. That is, if a LD reference file is not provided, none of the --ld-* options will be used. If a different set of filtering is to be perforemd on the target file when performing LD calculation, it must be provided separately to the --ld parameter e.g. --target --ld --keep --ld-keep Note BGEN file will always be hard coded when used to estimate the LD Clumping By default, PRSice will perform Clumping to remove SNPs that are in LD with each other. Similar to PLINK, the r 2 values computed by PRSice are based on maximum likelihood haplotype frequency estimates. Both cases and controls are included in the LD calculation. Alternatively, a combination of --ld and --ld-keep / -ld-remove can be used to restrict LD calculation in control samples. Clumping parameters can be changed by using the --clump-kb , --clump-r2 and --clump-p option. Clumping can be disabled using --no-clump PRS calculation PRSice allow different genetic models to be specified (e.g. add, dom, het, rec), and the polygenic score of each of those are calculated differently Assuming \\(S\\) is the summary statistic for the effective allele and \\(G\\) is the number of the effective allele observed, then the main difference between the models is how the genotypes are coded: For additive model (add) \\[ G = G \\] For dominant model (with respect to the effective allele of the base file) \\[ G = \\begin{cases} 0 & \\text{if $G$ = 0} \\\\ 1 & \\text{otherwise} \\end{cases} \\] For recessive model (with respect to the effective allele of the base file) \\[ G = \\begin{cases} 1 & \\text{if $G$ = 2} \\\\ 0 & \\text{otherwise} \\end{cases} \\] For heterozygous model \\[ G = \\begin{cases} 1 & \\text{if $G$ = 1} \\\\ 0 & \\text{otherwise} \\end{cases} \\] Then depending on the --score option, the PRS is calculated as (assuming \\(M_j\\) is the number of Alleles included in the PRS of the \\(j^{th}\\) individual) --score avg (default): $$ PRS_j = \\sum_i{\\frac{S_i\\times G_{ij}}{M_j}} $$ --score sum : $$ PRS_j = \\sum_i{S_i\\times G_{ij}} $$ --score std : $$ PRS_j = \\frac{\\sum_i({S_i\\times G_{ij}}) - \\text{Mean}(PRS)}{\\text{SD}(PRS)} $$ --score con-std : $$ PRS_j = \\frac{\\sum_i({S_i\\times G_{ij}}) - \\text{Mean}(PRS in control)}{\\text{SD}(PRS in control)} $$ Sometimes, sample can have missing genotype. The --missing option is used to determine how PRSice handle the missingness. When not specified, the Minor Allele Frequency (MAF) in the target sample will be used as the genotype as the sample with missing genotype. If --missing SET_ZERO is set, the SNP for the missing samples will be excluded. Alternatively, if --missing CENTER is set, all PRS calculated will be minused by the MAF of the SNP (therefore, missing samples will have PRS of 0). Note Missingness imputation is usually based on the target samples. If you would like to impute the missingness using the reference sample, you can use --use-ref-maf parameter to specify all MAF to be calculated using the reference samples. Empirical P-value calculation All approaches to PRS calculation involve parameter optimisation and are therefore overfitted. There are a few methods to account for the overfitting: Evaluate performance in an independent validation sample Cross validation Calculate an empirical P-value In, PRSice-2, we have implemented permutation procedure to calculate the empirical P-value. Permutation Procedure To calculate the empirical P-value, PRSice-2 perform the following Perform standard PRSice analysis Obtain the p-value of association of the best p-value threshold ( \\(P_o\\) ) Randomly shuffle the phenotype and repeat the PRSice analysis Obtain the p-value of association of the best p-value threshold under the null ( \\(P_{null}\\) ) Repeat step-2 \\(N=10,000\\) times (for --perm 10000 ) The empirical p-value can then be calculated as \\[ \\text{Empirical-}P = \\frac{\\sum_{n=1}^NI(P_{null}\\lt P_o)+1}{N+1} \\] where \\(I(.)\\) is the indicator function. Warning While the empirical p-value for association will be controlled for Type 1 error, the observed phenotypic variance explained, R 2 , remains unadjusted and is affected by overfitting. Therefore, it is imperative to perform out-of-samp,le prediction, or cross-validation to evaluate the predictive accuracy of PRS. Computation Algorithm In reality, PRSice-2 exploit certain property of random number generation to speed up the permutation analysis. To generate random numbers, a random seed is required. When the same seed is provided, the same sequence of random number will always be generated. PRSice-2 exploit this property, such that the permutation analysis is performed as follow Generate the random seed or set the random seed to the user provided random seed ( \\(S\\) ) For each p-value threshold Calculate the observed p-value Seed the random number generator with \\(S\\) For Quantitative trait, (and binary trait, unless --logit-perm is set), decompose the matrix of the independent variables ( \\(Intercept+PRS+Covariates\\) ) Generate N copies of random phenotypes via random shuffling. Calculate the p-value association for each null phenotype For each permutation, check if the current null p-value is the most significant. Replace the previous \"best\" p-value if the current null p-value is more significant Calculate the empirical p-value once all p-value thresholds have been processed As we re-seed the random number generator for each p-value threshold, we ensure the random phenotypes generated in each p-value thresholds are identical, allowing us to reuse the calculated PRS and the decomosed matrix, which leads to significant speed up of the permutation process. Note With binary traits, unless --logit-perm is set, we will still perform linear regression as we assume linear regression and logistic regression should produce similar t-statistics Output of Results Bar Plot Note Hereon, [Name] is assumed to be the output prefix specified using --out and [date] is the date when the analysis was performed. PRSice will always generate a bar plot displaying the model fit of the PRS at P-value threshold as indicated by --bar-levels The plot will be named as [Name]_BARPLOT_[date].png . An example bar plot: High Resolution Plot If --fastscore is not specified, a high-resolution plot named [Name]_HIGH-RES_PLOT_[date].png will be generated. This plot present the model fit of PRS calculated at all P-value thresholds. Important The model fit is defined as the \\(R^2\\) of the Full model - the \\(R^2\\) of the Null model For example, if Sex is a covariate in the PRSice calculation, then model fit = \\(R^2\\) of \\(Pheno\\sim PRS+Sex\\) - \\(R^2\\) of \\(Pheno\\sim Sex\\) A green line connects points showing the model fit at the broad P-value thresholds used in the corresponding bar plot are also added. An example high-resolution plot: Quantile Plots If --quantile [number of quantile] is specified, a quantile plot named [Name]_QUANTILE_PLOT_[date].png will be generated. The quantile plot provide an illustration of the effect of increasing PRS on predicted risk of phenotype. An example quantile plot: Specifically, the quantile plot is generated by the following steps Distribute samples into user specified number of quantiles based on their PRS Treat the quantiles as a factor, where the --quant-ref is the base factor Perform regression with \\(Pheno \\sim Quantile + Covariates\\) (use logistic regression if phenotype is binary, and linear regression otherwise) Set the reference quantile to have coefficient of 1 (if binary) or 0 (otherwise) The point of each quantile is their OR (if binary) or coefficient (otherwise) from the regression analysis A text file [Name]_QUANTILE\\_[date].txt is also produced, which provides all the data used for the plotting. Moreover, uneven distribution of quantiles can be specified using the --quant-break function, which will generate the strata plot. For example, to replicate the quantile break from Natarajan et al (2015): Percentile of PRS, % All studies in iCOGS excluding pKARMA OR (95% CI) pKARMA only OR (95% CI) <1 0.29 (0.23 to 0.37) 0.48 (0.28 to 0.83) >1\u20135 0.42 (0.37 to 0.47) 0.48 (0.36 to 0.63) 5\u201310 0.55 (0.50 to 0.61) 0.58 (0.45 to 0.74) 10\u201320 0.65 (0.60 to 0.70) 0.68 (0.57 to 0.81) 20\u201340 0.80 (0.76 to 0.85) 0.81 (0.71 to 0.94) 40\u201360 1 (referent) 1 (referent) 60\u201380 1.18 (1.12 to 1.24) 1.35 (1.19 to 1.54) 80\u201390 1.48 (1.39 to 1.57) 1.56 (1.34 to 1.82) 90\u201395 1.69 (1.56 to 1.82) 2.05 (1.70 to 2.47) 95\u201399 2.20 (2.03 to 2.38) 2.12 (1.73 to 2.59) >99 2.81 (2.43 to 3.24) 3.06 (2.16 to 4.34) The following command can be added to PRSice command: --quantile 100 \\ --quant-break 1,5,10,20,40,60,80,90,95,99,100 \\ --quant-ref 60 Specifically, --quant-break indicates the upper bound of each group and --quant-ref specify the upper bound of the reference quantiles Note The quantile boundaries are non-overlapping, with the inclusive upper bound and exclusive lower bound Note Usually, you will need --quantile 100 together with --quant-break PRS model-fit A file containing the PRS model fit across thresholds is named [Name].prsice ; this is stored as Set, Threshold, \\(R^2\\) , P-value, Coefficient, Standard Deviation and Number of SNPs at this threshold Important \\(R^2\\) reported in the prsice file is the \\(R^2\\) of the Full model - the \\(R^2\\) of the Null model Scores for each individual A file containing PRS for each individual at the best-fit PRS named [Name].best is provide. This file has the format of: FID,IID,In_Regression, PRS at best threshold of first set, PRS at best threshold of second set, ... Where the In_Regression column indicate whether the sample is included in the regression model performed by PRSice. If --all-score option is used, a file named [Name].all.score is also generated This file has the format of FID, IID, PRS for first set at first threshold, PRS for first set at second threshold, ... If --all-score is used, the PRS for each individual at all threshold and all sets will be given. In the event where the target sample size is large and a lot of threshold are tested, this file can be large. Summary Information Information of the best model fit of each phenotype and gene set is stored in [Name].summary . The summary file contain the following fields: Phenotype - Name of Phenotype Set - Name of Gene Set Threshold - Best P-value Threshold PRS.R2 - Variance explained by the PRS. If prevalence is provided, this will be adjusted for ascertainment Full.R2 - Variance explained by the full model (including the covariates). If prevalence is provided, this will be adjusted for ascertainment Null.R2 - Variance explained by the covariates. If prevalence is provided, this will be adjusted for ascertainment Prevalence - Population prevalence as indicated by the user. \"-\" if not provided. Coefficient - Regression coefficient of the model. Can provide insight of the direction of effect. P - P value of the model fit Num_SNP - Number of SNPs included in the model Empirical-P - Only provided if permutation is performed. This is the empirical p-value and should account for multiple testing and over-fitting Only one summary file will be generated for each PRSice run (disregarding the number of target phenotype used) Log File To allow for easy replication, a log file named [Name].log is generated for each PRSice run, which contain the all the commands used for the analysis and information regarding filtering, field selected etc. This also allow easy identification of problems and should always be included in the bug report.","title":"PRSice"},{"location":"step_by_step/#background","text":"You will need to have basic understanding of Genome Wide Association Studies (GWAS) in order to be able to perform Polygenic risk score (PRS) analyses. If you are unfamiliar with GWAS, you can consider reading this paper .","title":"Background"},{"location":"step_by_step/#input-data","text":"Here, we briefly discuss different input files required by PRSice:","title":"Input Data"},{"location":"step_by_step/#base-dataset","text":"Base (i.e. GWAS) data must be provided as a whitespace delimited file containing association analysis results for SNPs on the base phenotype. PRSice has no problem reading in a gzipped base file (need to have a .gz suffix). If PLINK output is used, then please make sure there is a column for the effective allele (A1) and specify it with --A1 option. If your base data follows other formats, then the column headers can be provided using the --chr , --A1 , --A2 , --stat , --snp , --bp , --pvalue options Important PRSice requires the base file to contain information of the effective allele ( --A1 ), effect size estimates ( --stat ), p-value for association ( --pvalue ), and the SNP ID ( --snp ). If the input file does not contain a column header, the column can be specified using their index (start counting from 0) with the --index flag. For example, with the following input format: SNP CHR BP A1 A2 OR SE P rs3094315 1 752566 A G 0.9912 0.0229 0.7009 rs3131972 1 752721 A G 1.007 0.0228 0.769 rs3131971 1 752894 T C 1.003 0.0232 0.8962 the parameters can either be --snp SNP --chr CHR --bp BP --A1 A1 --A2 A2 --stat OR --pvalue P or --snp 0 --chr 1 --bp 2 --A1 3 --A2 4 --stat 5 --pvalue 7 --index Strand flips are automatically detected and accounted for. If an imputation info score or the minor allele frequencies (MAF) are also included in the file, --base-info : and --base-maf : can be used to filter SNPs based on their INFO score and MAF respectively. For binary trait base file, SNPs can be filtered according to the MAF in case and control separately using --base-maf :,: By default, PRSice will look for the following column names automatically from the base file header if --index was not provided or if the column name of the specific arguement(s) were not provided: CHR, BP, A1, A2, SNP, P, INFO (case sensitive) and OR / BETA (case insensitive) --no-default can be used to disable all the defaults of PRSice. Note PRSice will ignore any columns that were not found in the base file (e.g. If --A2 B is specified but none of the column header is B , then PRSice will treat it as if no A2 information is presented)","title":"Base Dataset"},{"location":"step_by_step/#target-dataset","text":"Currently two different target file format is supported by PRSice:","title":"Target Dataset"},{"location":"step_by_step/#plink-binary","text":"A target dataset in PLINK binary format must consist of three files: .bed , .bim , and a .fam file - where bed contains the compressed genotype data, bim contains the SNP information and fam contains the family information. Currently only SNP major PLINK format are supported (default output of the latest PLINK program). The .bed and .bim file must have the same prefix. If the .fam file follow a different prefix from the .bed and bim file, it can be specified using --target , Warning The fam file MUST contains the correct number of samples or PRSice will crash Missing phenotype data can be coded as NA, or -9 for binary traits and NA for quantitative traits. Note -9 will NOT be considered as missing for quantitative traits If the binary file is separated into individual chromosomes, then an # can be used to specify the location of the chromosome number in the file name. PRSice will automatically substitute # with 1-22 i.e. If the files are chr1. ,chr2. ,...,chr22. , just use --target chr# Note Chromosome number substitution will not be performed on the external fam file as the fam file should be the same for all chromosomes. Alternatively, if your PLINK files do not have a unified prefix, you can use --target-list to provide a file containing all prefix to PRSice. Note .pgen files are not currently supported","title":"PLINK Binary"},{"location":"step_by_step/#bgen","text":"PRSice currently support BGEN v1.1 and v1.2. To specify a BGEN file, simply add the --type bgen or --ld-type bgen to the PRSice command Note In theory, we can support BGEN v1.3, but that will require us to include zstd library, developed by facebook. You can enable the support by including the zstd library and changing the bgen_lib files. As BGEN does not store the phenotype information and sometime not even the sample ID, you must provide a phenotype file ( --pheno ). Alternatively, if you have a sample file containing the phenotype information, you can provide it with --target , Note The sample file is required even if --no-regress is set as the sample ID is required for output. This requirement might be losen in future versions With BGEN input, a number of PRSice options become effective: --hard : Normally, with BGEN format, PRS is calculated using the dosage information. But hard-thresholding can be performed by using the --hard option. SNPs will then coded as the genotype (0,1 or 2) and filtered according to threshold set by --hard-thres . If no such genotype is presented, the SNP will be coded as missing --hard-thres : A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. See here for more detail. To perform clumping on BGEN file, we need to repeatly decompress the genotype dosage and convert them into PLINK binary format. To speed up the clumping process, you can allow PRSice to generate a large intermediate file, containing the hard coded genotypes in PLINK binary format by using the --allow-inter option.","title":"BGEN"},{"location":"step_by_step/#phenotype-files","text":"An external phenotype file can be provided to PRSice using the --pheno parameter. This must be a tab / space delimited file and missing data must be represented by either NA or -9 (only for binary traits). The first two column of the phenotype file should be the FID and the IID, or when --ignore-fid is set, the first column should be the IID. The rest of the columns can be the phenotype(s). To specify a trait within the phenotype file, the column name for the trait can be specified using --pheno-col , providing that the phenotype file contains a header. Multiple column name can be provided via a comma separated list: e.g. --pheno-col A,B,C,D . Trait(s) not found within the phenotype file will be automatically skipped. Important The column name(s) should not contain space nor comma Note When more than one traits are provided, the column name will be appended to the output prefix.","title":"Phenotype files"},{"location":"step_by_step/#ld-reference","text":"When the target sample is small (e.g. < 500 samples), an external reference panel can be used to improve the LD estimation for clumping. The LD reference follows the same notion as the target dataset. Simply use --ld to specify your LD reference panel file and --ld-type to specify the format When a LD reference file is not provided and --no-clump is not specified, the target file will be used as the LD reference panel Important Any parameters with the --ld prefix will only work on the file specified by the --ld parameter. That is, if a LD reference file is not provided, none of the --ld-* options will be used. If a different set of filtering is to be perforemd on the target file when performing LD calculation, it must be provided separately to the --ld parameter e.g. --target --ld --keep --ld-keep Note BGEN file will always be hard coded when used to estimate the LD","title":"LD reference"},{"location":"step_by_step/#clumping","text":"By default, PRSice will perform Clumping to remove SNPs that are in LD with each other. Similar to PLINK, the r 2 values computed by PRSice are based on maximum likelihood haplotype frequency estimates. Both cases and controls are included in the LD calculation. Alternatively, a combination of --ld and --ld-keep / -ld-remove can be used to restrict LD calculation in control samples. Clumping parameters can be changed by using the --clump-kb , --clump-r2 and --clump-p option. Clumping can be disabled using --no-clump","title":"Clumping"},{"location":"step_by_step/#prs-calculation","text":"PRSice allow different genetic models to be specified (e.g. add, dom, het, rec), and the polygenic score of each of those are calculated differently Assuming \\(S\\) is the summary statistic for the effective allele and \\(G\\) is the number of the effective allele observed, then the main difference between the models is how the genotypes are coded: For additive model (add) \\[ G = G \\] For dominant model (with respect to the effective allele of the base file) \\[ G = \\begin{cases} 0 & \\text{if $G$ = 0} \\\\ 1 & \\text{otherwise} \\end{cases} \\] For recessive model (with respect to the effective allele of the base file) \\[ G = \\begin{cases} 1 & \\text{if $G$ = 2} \\\\ 0 & \\text{otherwise} \\end{cases} \\] For heterozygous model \\[ G = \\begin{cases} 1 & \\text{if $G$ = 1} \\\\ 0 & \\text{otherwise} \\end{cases} \\] Then depending on the --score option, the PRS is calculated as (assuming \\(M_j\\) is the number of Alleles included in the PRS of the \\(j^{th}\\) individual) --score avg (default): $$ PRS_j = \\sum_i{\\frac{S_i\\times G_{ij}}{M_j}} $$ --score sum : $$ PRS_j = \\sum_i{S_i\\times G_{ij}} $$ --score std : $$ PRS_j = \\frac{\\sum_i({S_i\\times G_{ij}}) - \\text{Mean}(PRS)}{\\text{SD}(PRS)} $$ --score con-std : $$ PRS_j = \\frac{\\sum_i({S_i\\times G_{ij}}) - \\text{Mean}(PRS in control)}{\\text{SD}(PRS in control)} $$ Sometimes, sample can have missing genotype. The --missing option is used to determine how PRSice handle the missingness. When not specified, the Minor Allele Frequency (MAF) in the target sample will be used as the genotype as the sample with missing genotype. If --missing SET_ZERO is set, the SNP for the missing samples will be excluded. Alternatively, if --missing CENTER is set, all PRS calculated will be minused by the MAF of the SNP (therefore, missing samples will have PRS of 0). Note Missingness imputation is usually based on the target samples. If you would like to impute the missingness using the reference sample, you can use --use-ref-maf parameter to specify all MAF to be calculated using the reference samples.","title":"PRS calculation"},{"location":"step_by_step/#empirical-p-value-calculation","text":"All approaches to PRS calculation involve parameter optimisation and are therefore overfitted. There are a few methods to account for the overfitting: Evaluate performance in an independent validation sample Cross validation Calculate an empirical P-value In, PRSice-2, we have implemented permutation procedure to calculate the empirical P-value.","title":"Empirical P-value calculation"},{"location":"step_by_step/#permutation-procedure","text":"To calculate the empirical P-value, PRSice-2 perform the following Perform standard PRSice analysis Obtain the p-value of association of the best p-value threshold ( \\(P_o\\) ) Randomly shuffle the phenotype and repeat the PRSice analysis Obtain the p-value of association of the best p-value threshold under the null ( \\(P_{null}\\) ) Repeat step-2 \\(N=10,000\\) times (for --perm 10000 ) The empirical p-value can then be calculated as \\[ \\text{Empirical-}P = \\frac{\\sum_{n=1}^NI(P_{null}\\lt P_o)+1}{N+1} \\] where \\(I(.)\\) is the indicator function. Warning While the empirical p-value for association will be controlled for Type 1 error, the observed phenotypic variance explained, R 2 , remains unadjusted and is affected by overfitting. Therefore, it is imperative to perform out-of-samp,le prediction, or cross-validation to evaluate the predictive accuracy of PRS.","title":"Permutation Procedure"},{"location":"step_by_step/#computation-algorithm","text":"In reality, PRSice-2 exploit certain property of random number generation to speed up the permutation analysis. To generate random numbers, a random seed is required. When the same seed is provided, the same sequence of random number will always be generated. PRSice-2 exploit this property, such that the permutation analysis is performed as follow Generate the random seed or set the random seed to the user provided random seed ( \\(S\\) ) For each p-value threshold Calculate the observed p-value Seed the random number generator with \\(S\\) For Quantitative trait, (and binary trait, unless --logit-perm is set), decompose the matrix of the independent variables ( \\(Intercept+PRS+Covariates\\) ) Generate N copies of random phenotypes via random shuffling. Calculate the p-value association for each null phenotype For each permutation, check if the current null p-value is the most significant. Replace the previous \"best\" p-value if the current null p-value is more significant Calculate the empirical p-value once all p-value thresholds have been processed As we re-seed the random number generator for each p-value threshold, we ensure the random phenotypes generated in each p-value thresholds are identical, allowing us to reuse the calculated PRS and the decomosed matrix, which leads to significant speed up of the permutation process. Note With binary traits, unless --logit-perm is set, we will still perform linear regression as we assume linear regression and logistic regression should produce similar t-statistics","title":"Computation Algorithm"},{"location":"step_by_step/#output-of-results","text":"","title":"Output of Results"},{"location":"step_by_step/#bar-plot","text":"Note Hereon, [Name] is assumed to be the output prefix specified using --out and [date] is the date when the analysis was performed. PRSice will always generate a bar plot displaying the model fit of the PRS at P-value threshold as indicated by --bar-levels The plot will be named as [Name]_BARPLOT_[date].png . An example bar plot:","title":"Bar Plot"},{"location":"step_by_step/#high-resolution-plot","text":"If --fastscore is not specified, a high-resolution plot named [Name]_HIGH-RES_PLOT_[date].png will be generated. This plot present the model fit of PRS calculated at all P-value thresholds. Important The model fit is defined as the \\(R^2\\) of the Full model - the \\(R^2\\) of the Null model For example, if Sex is a covariate in the PRSice calculation, then model fit = \\(R^2\\) of \\(Pheno\\sim PRS+Sex\\) - \\(R^2\\) of \\(Pheno\\sim Sex\\) A green line connects points showing the model fit at the broad P-value thresholds used in the corresponding bar plot are also added. An example high-resolution plot:","title":"High Resolution Plot"},{"location":"step_by_step/#quantile-plots","text":"If --quantile [number of quantile] is specified, a quantile plot named [Name]_QUANTILE_PLOT_[date].png will be generated. The quantile plot provide an illustration of the effect of increasing PRS on predicted risk of phenotype. An example quantile plot: Specifically, the quantile plot is generated by the following steps Distribute samples into user specified number of quantiles based on their PRS Treat the quantiles as a factor, where the --quant-ref is the base factor Perform regression with \\(Pheno \\sim Quantile + Covariates\\) (use logistic regression if phenotype is binary, and linear regression otherwise) Set the reference quantile to have coefficient of 1 (if binary) or 0 (otherwise) The point of each quantile is their OR (if binary) or coefficient (otherwise) from the regression analysis A text file [Name]_QUANTILE\\_[date].txt is also produced, which provides all the data used for the plotting. Moreover, uneven distribution of quantiles can be specified using the --quant-break function, which will generate the strata plot. For example, to replicate the quantile break from Natarajan et al (2015): Percentile of PRS, % All studies in iCOGS excluding pKARMA OR (95% CI) pKARMA only OR (95% CI) <1 0.29 (0.23 to 0.37) 0.48 (0.28 to 0.83) >1\u20135 0.42 (0.37 to 0.47) 0.48 (0.36 to 0.63) 5\u201310 0.55 (0.50 to 0.61) 0.58 (0.45 to 0.74) 10\u201320 0.65 (0.60 to 0.70) 0.68 (0.57 to 0.81) 20\u201340 0.80 (0.76 to 0.85) 0.81 (0.71 to 0.94) 40\u201360 1 (referent) 1 (referent) 60\u201380 1.18 (1.12 to 1.24) 1.35 (1.19 to 1.54) 80\u201390 1.48 (1.39 to 1.57) 1.56 (1.34 to 1.82) 90\u201395 1.69 (1.56 to 1.82) 2.05 (1.70 to 2.47) 95\u201399 2.20 (2.03 to 2.38) 2.12 (1.73 to 2.59) >99 2.81 (2.43 to 3.24) 3.06 (2.16 to 4.34) The following command can be added to PRSice command: --quantile 100 \\ --quant-break 1,5,10,20,40,60,80,90,95,99,100 \\ --quant-ref 60 Specifically, --quant-break indicates the upper bound of each group and --quant-ref specify the upper bound of the reference quantiles Note The quantile boundaries are non-overlapping, with the inclusive upper bound and exclusive lower bound Note Usually, you will need --quantile 100 together with --quant-break","title":"Quantile Plots"},{"location":"step_by_step/#prs-model-fit","text":"A file containing the PRS model fit across thresholds is named [Name].prsice ; this is stored as Set, Threshold, \\(R^2\\) , P-value, Coefficient, Standard Deviation and Number of SNPs at this threshold Important \\(R^2\\) reported in the prsice file is the \\(R^2\\) of the Full model - the \\(R^2\\) of the Null model","title":"PRS model-fit"},{"location":"step_by_step/#scores-for-each-individual","text":"A file containing PRS for each individual at the best-fit PRS named [Name].best is provide. This file has the format of: FID,IID,In_Regression, PRS at best threshold of first set, PRS at best threshold of second set, ... Where the In_Regression column indicate whether the sample is included in the regression model performed by PRSice. If --all-score option is used, a file named [Name].all.score is also generated This file has the format of FID, IID, PRS for first set at first threshold, PRS for first set at second threshold, ... If --all-score is used, the PRS for each individual at all threshold and all sets will be given. In the event where the target sample size is large and a lot of threshold are tested, this file can be large.","title":"Scores for each individual"},{"location":"step_by_step/#summary-information","text":"Information of the best model fit of each phenotype and gene set is stored in [Name].summary . The summary file contain the following fields: Phenotype - Name of Phenotype Set - Name of Gene Set Threshold - Best P-value Threshold PRS.R2 - Variance explained by the PRS. If prevalence is provided, this will be adjusted for ascertainment Full.R2 - Variance explained by the full model (including the covariates). If prevalence is provided, this will be adjusted for ascertainment Null.R2 - Variance explained by the covariates. If prevalence is provided, this will be adjusted for ascertainment Prevalence - Population prevalence as indicated by the user. \"-\" if not provided. Coefficient - Regression coefficient of the model. Can provide insight of the direction of effect. P - P value of the model fit Num_SNP - Number of SNPs included in the model Empirical-P - Only provided if permutation is performed. This is the empirical p-value and should account for multiple testing and over-fitting Only one summary file will be generated for each PRSice run (disregarding the number of target phenotype used)","title":"Summary Information"},{"location":"step_by_step/#log-file","text":"To allow for easy replication, a log file named [Name].log is generated for each PRSice run, which contain the all the commands used for the analysis and information regarding filtering, field selected etc. This also allow easy identification of problems and should always be included in the bug report.","title":"Log File"},{"location":"update_log/","text":"From now on, I will try to archive our update log here. 2020-08-05 ( v2.3.3 ) Thanks to report from @charlisech, we were able to pinpoint a bug related to sample selection when using bgen data. 2020-07-15 ( v2.3.2 ) Fix off by one error in PRSet best score output Fix problem for bgen file when sample selection is performed on bgen files containing sample information 2020-05-30 ( v2.3.1.e ) Fix bug where SNPs without missingness will be wrongly considered as having 100% missingness Fix error log where PRSice should now correct stat if a parameter is missing the required arguments 2020-05-29 ( v2.3.1.d ) Fix segmentation fault when --ld is used 2020-05-28 ( v2.3.1.c ) Fix problem with missing covariate Fix Rscript such that it properly read in phenotype file when --pheno-co l is specified 2020-05-26 ( v2.3.1.b ) Fix best score output when --ignore-fid is used Also fix Rscript covariate and phenotype file read when handling IDs star t with 00 and when --ignore-fid is used 2020-05-26 ( v2.3.1.a ) Fix bar plot with covariate. Was plotting the full R2 instead of the PRS.R2 2020-05-23 ( v2.3.1 ) Update Rscript such that it match features in executable (thus avoid problem in plotting) Fix a bug where PRSice will crash when there are missing covariates 2020-05-21 ( v2.3.0.e ) Fix Rscript bar plot problem 2020-05-21 ( v2.3.0.d ) Fix problem introduced by previous fix. Was hoping 2.3.0's unit test will help reducing the amount of bugs. Sorry for the troubles. 2020-05-20 ( v2.3.0.c ) Fix all score output format Fix problem with --no-regress . Might still have problem with --no-regress --score con-std 2020-05-19 ( v2.3.0.b ) Fix error where sample selection will distort phenotype loading, loading the wrong phenotype to wrong sample. As this is a major bug, we deleted the previous 2 releases. Sorry for the troubles. 2020-05-19 ( v2.3.0.a ) Fix output error where we always say 0 valid phenotype were included for continuous trait Fix problem with permutation where PRSice will crash if input are rank deficient Fix problem when provide a binary phenotype file with a fam file containing -9 as phenotype, PRSice will wrongly state that there are no phenotype presented Fix problem in Rscript where if sample ID is numeric and starts with 0, the best file will not merge with the phenotype file, causing 0 valid PRS to be observed 2020-05-18 ( v2.3.0 ) We now support multi-threaded clumping (separated by chromosome) Genotypes will be stored to memory during clumping (increase memory usage, significantly speed up clumping) Will only generate one .prsice file for all phenotypes .prsice file now has additional column call \"Pheno\" Introduced --chr-id which generate rs id based on user provided formula (see detail for more info) Format of --base-maf and --base-info are now changed to : from , Fix a bug related to ambiguous allele dosage flipping when --keep-ambig is used Better mismatch handling. For example, if your base file only provide the effective allele A without the non-effective allele information, PRSice will now do dosage flipping if your target file has G/C as effective allele and A /T as an non-effective allele (whereas previous this SNP will be considered as a mismatch) Fix bug in 2.2.13 where PRSice won't output the error message during command parsing stage If user provided the --stat information, PRSice will now error out instead of trying to look for BETA or OR in the file. PRSice should now better recognize if phenotype file contains a header various small bug fix","title":"Update Log"},{"location":"update_log/#2020-08-05-v233","text":"Thanks to report from @charlisech, we were able to pinpoint a bug related to sample selection when using bgen data.","title":"2020-08-05 (v2.3.3)"},{"location":"update_log/#2020-07-15-v232","text":"Fix off by one error in PRSet best score output Fix problem for bgen file when sample selection is performed on bgen files containing sample information","title":"2020-07-15 (v2.3.2)"},{"location":"update_log/#2020-05-30-v231e","text":"Fix bug where SNPs without missingness will be wrongly considered as having 100% missingness Fix error log where PRSice should now correct stat if a parameter is missing the required arguments","title":"2020-05-30 (v2.3.1.e)"},{"location":"update_log/#2020-05-29-v231d","text":"Fix segmentation fault when --ld is used","title":"2020-05-29 (v2.3.1.d)"},{"location":"update_log/#2020-05-28-v231c","text":"Fix problem with missing covariate Fix Rscript such that it properly read in phenotype file when --pheno-co l is specified","title":"2020-05-28 (v2.3.1.c)"},{"location":"update_log/#2020-05-26-v231b","text":"Fix best score output when --ignore-fid is used Also fix Rscript covariate and phenotype file read when handling IDs star t with 00 and when --ignore-fid is used","title":"2020-05-26 (v2.3.1.b)"},{"location":"update_log/#2020-05-26-v231a","text":"Fix bar plot with covariate. Was plotting the full R2 instead of the PRS.R2","title":"2020-05-26 (v2.3.1.a)"},{"location":"update_log/#2020-05-23-v231","text":"Update Rscript such that it match features in executable (thus avoid problem in plotting) Fix a bug where PRSice will crash when there are missing covariates","title":"2020-05-23 (v2.3.1)"},{"location":"update_log/#2020-05-21-v230e","text":"Fix Rscript bar plot problem","title":"2020-05-21 (v2.3.0.e)"},{"location":"update_log/#2020-05-21-v230d","text":"Fix problem introduced by previous fix. Was hoping 2.3.0's unit test will help reducing the amount of bugs. Sorry for the troubles.","title":"2020-05-21 (v2.3.0.d)"},{"location":"update_log/#2020-05-20-v230c","text":"Fix all score output format Fix problem with --no-regress . Might still have problem with --no-regress --score con-std","title":"2020-05-20 (v2.3.0.c)"},{"location":"update_log/#2020-05-19-v230b","text":"Fix error where sample selection will distort phenotype loading, loading the wrong phenotype to wrong sample. As this is a major bug, we deleted the previous 2 releases. Sorry for the troubles.","title":"2020-05-19 (v2.3.0.b)"},{"location":"update_log/#2020-05-19-v230a","text":"Fix output error where we always say 0 valid phenotype were included for continuous trait Fix problem with permutation where PRSice will crash if input are rank deficient Fix problem when provide a binary phenotype file with a fam file containing -9 as phenotype, PRSice will wrongly state that there are no phenotype presented Fix problem in Rscript where if sample ID is numeric and starts with 0, the best file will not merge with the phenotype file, causing 0 valid PRS to be observed","title":"2020-05-19 (v2.3.0.a)"},{"location":"update_log/#2020-05-18-v230","text":"We now support multi-threaded clumping (separated by chromosome) Genotypes will be stored to memory during clumping (increase memory usage, significantly speed up clumping) Will only generate one .prsice file for all phenotypes .prsice file now has additional column call \"Pheno\" Introduced --chr-id which generate rs id based on user provided formula (see detail for more info) Format of --base-maf and --base-info are now changed to : from , Fix a bug related to ambiguous allele dosage flipping when --keep-ambig is used Better mismatch handling. For example, if your base file only provide the effective allele A without the non-effective allele information, PRSice will now do dosage flipping if your target file has G/C as effective allele and A /T as an non-effective allele (whereas previous this SNP will be considered as a mismatch) Fix bug in 2.2.13 where PRSice won't output the error message during command parsing stage If user provided the --stat information, PRSice will now error out instead of trying to look for BETA or OR in the file. PRSice should now better recognize if phenotype file contains a header various small bug fix","title":"2020-05-18 (v2.3.0)"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"PRS Workshop 2024 in Japan!!! PRS Workshop 2024 in Japan!!! PRS Workshop 2024 in Japan!!! Please see details about a 2-day training workshop on the theory and application of PRS that we are running with the Okada Lab in Tokyo: (see the website ) PRSice-2: Polygenic Risk Score software PRSice (pronounced 'precise') is a Polygenic Risk Score software for calculating, applying, evaluating and plotting the results of polygenic risk scores (PRS) analyses. Some of the features include: High-resolution scoring (PRS calculated across a large number of P-value thresholds) Identify Most predictive PRS Empirical P-values output (not subject to over-fitting) Genotyped (PLINK binary) and imputed (Oxford bgen v1.2) data input Biobank-scale genotyped data can be analysed within hours Incorporation of covariates Application across multiple target traits simultaneously Results plotted in several formats (bar plots, high-res plots, quantile plots) PRSet: function for calculating PRS across user-defined pathways / gene sets Executable downloads Operating System Link Linux 64-bit v2.3.5 OS X 64-bit v2.3.5 Windows 32-bit Not available Windows 64-bit v2.3.5 Latest Update 2021-09-20 (v2.3.5) This is a temporary fix for known issues. We are currently re-building PRSice with best practices and try to make sure robustness and extensibility with unit tests. Some fixes are Fixed --perm Fixed --prevalence Reduced memory usage for bgen in multi-threaded mode Increased speed of file generation when there's enough memory Some chr-id fix, though it is still rather buggy 2020-08-05 (v2.3.3) Thanks to report from @charlisech, we were able to pinpoint a bug related to sample selection when using bgen data. 2020-05-18 (v2.3.0) We now support multi-threaded clumping (separated by chromosome) Genotypes will be stored to memory during clumping (increase memory usage, significantly speed up clumping) Will only generate one .prsice file for all phenotypes .prsice file now has additional column call \"Pheno\" Introduced --chr-id which generate rs id based on user provided formula (see detail for more info) Format of --base-maf and --base-info are now changed to : from , If user provided the --stat information, PRSice will now error out instead of trying to look for BETA or OR in the file. update log for previous release can be found here Caution We have now fixed window problem. But was unable to access the computer that is used for compilation due to COVID. Will try to compile it when we regain access. Caution PRSet are currently under open beta - results output are reliable but please report any specific problems to our google group (see Support below)3 R Packages Requirements To plot graphs, PRSice requires R ( version 3.2.3+ ) installed. Additional steps might be required for Mac and Windows users. Installing required R packages PRSice can automatically download all required packages, even without administrative right. You can specify the install directory using --dir . For example Rscript PRSice.R --dir . will install all required packages under the local directory. Quick Start For Quick start use, please refer to Quick Start List user options You can also type ./PRSice to view all available parameters unrelated to plotting, or Rscript PRSice.R -h to view all available parameters, including those used for plotting Output of Results You can see the expected output of PRSice here Detailed Guide You can find a more detailed document explaining the input and output of PRSice in this page Full command line options You can find all command line options of PRSice under the section Details of PRSice/PRSet Citation If you use PRSice, then please cite: Citation Choi SW, and O\u2019Reilly PF. \"PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.\" GigaScience 8, no. 7 (July 1, 2019). https://doi.org/10.1093/gigascience/giz082 . Support This wiki should contain all the basic instruction for the use of PRSice. Shall you have any problems, please feel free to start an issue here or visit our google group . You can help us to speed up the debug process by including the log file generated by PRSice. In addition, you can use the search bar in this webpage to search for specific functions. Authors For more details on the authors, see: Dr Shing Wan Choi Dr Paul O'Reilly PRSice-2 and all new functionalities are coded by: Dr Shing Wan Choi Acknowledgement PRSice is a software package written in C++ (main) and R (plotting). The code relies partially on those written in PLINK by Christopher Chang . Management of BGEN file is based on BGEN lib written by Gavin Band . We also utilize the Eigen C++ library, the gzstream library.","title":"Home"},{"location":"#executable-downloads","text":"Operating System Link Linux 64-bit v2.3.5 OS X 64-bit v2.3.5 Windows 32-bit Not available Windows 64-bit v2.3.5 Latest Update","title":"Executable downloads"},{"location":"#2021-09-20-v235","text":"This is a temporary fix for known issues. We are currently re-building PRSice with best practices and try to make sure robustness and extensibility with unit tests. Some fixes are Fixed --perm Fixed --prevalence Reduced memory usage for bgen in multi-threaded mode Increased speed of file generation when there's enough memory Some chr-id fix, though it is still rather buggy","title":"2021-09-20 (v2.3.5)"},{"location":"#2020-08-05-v233","text":"Thanks to report from @charlisech, we were able to pinpoint a bug related to sample selection when using bgen data.","title":"2020-08-05 (v2.3.3)"},{"location":"#2020-05-18-v230","text":"We now support multi-threaded clumping (separated by chromosome) Genotypes will be stored to memory during clumping (increase memory usage, significantly speed up clumping) Will only generate one .prsice file for all phenotypes .prsice file now has additional column call \"Pheno\" Introduced --chr-id which generate rs id based on user provided formula (see detail for more info) Format of --base-maf and --base-info are now changed to : from , If user provided the --stat information, PRSice will now error out instead of trying to look for BETA or OR in the file. update log for previous release can be found here Caution We have now fixed window problem. But was unable to access the computer that is used for compilation due to COVID. Will try to compile it when we regain access. Caution PRSet are currently under open beta - results output are reliable but please report any specific problems to our google group (see Support below)3","title":"2020-05-18 (v2.3.0)"},{"location":"#r-packages-requirements","text":"To plot graphs, PRSice requires R ( version 3.2.3+ ) installed. Additional steps might be required for Mac and Windows users. Installing required R packages PRSice can automatically download all required packages, even without administrative right. You can specify the install directory using --dir . For example Rscript PRSice.R --dir . will install all required packages under the local directory.","title":"R Packages Requirements"},{"location":"#quick-start","text":"For Quick start use, please refer to Quick Start List user options You can also type ./PRSice to view all available parameters unrelated to plotting, or Rscript PRSice.R -h to view all available parameters, including those used for plotting","title":"Quick Start"},{"location":"#output-of-results","text":"You can see the expected output of PRSice here","title":"Output of Results"},{"location":"#detailed-guide","text":"You can find a more detailed document explaining the input and output of PRSice in this page","title":"Detailed Guide"},{"location":"#full-command-line-options","text":"You can find all command line options of PRSice under the section Details of PRSice/PRSet","title":"Full command line options"},{"location":"#citation","text":"If you use PRSice, then please cite: Citation Choi SW, and O\u2019Reilly PF. \"PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.\" GigaScience 8, no. 7 (July 1, 2019). https://doi.org/10.1093/gigascience/giz082 .","title":"Citation"},{"location":"#support","text":"This wiki should contain all the basic instruction for the use of PRSice. Shall you have any problems, please feel free to start an issue here or visit our google group . You can help us to speed up the debug process by including the log file generated by PRSice. In addition, you can use the search bar in this webpage to search for specific functions.","title":"Support"},{"location":"#authors","text":"For more details on the authors, see: Dr Shing Wan Choi Dr Paul O'Reilly PRSice-2 and all new functionalities are coded by: Dr Shing Wan Choi","title":"Authors"},{"location":"#acknowledgement","text":"PRSice is a software package written in C++ (main) and R (plotting). The code relies partially on those written in PLINK by Christopher Chang . Management of BGEN file is based on BGEN lib written by Gavin Band . We also utilize the Eigen C++ library, the gzstream library.","title":"Acknowledgement"},{"location":"archive/","text":"You can download previous versions of PRSice here Warning We no longer support PRSice-1.x. Please use PRSice-2 unless you need specific features in PRSice-1 that isn't implemented in PRSice-2. Bugs and errors within PRSice-1.x will not be fixed nor will the script be updated. Version Software Manual Vignette 1.25 download download download 1.23 download download download 1.22 download download download 1.21 download download download 1.2 download download download Citation If you use PRSice-1, then please cite: Citation PRSice: Polygenic Risk Score software, Euesden, Lewis, O'Reilly, Bioinformatics (2015) 31 (9):1466-1468 Authors Authors of PRSice-1 are as follow: Dr Jack Euesden Professor Cathryn Lewis Dr Paul O'Reilly","title":"Archive"},{"location":"archive/#citation","text":"If you use PRSice-1, then please cite: Citation PRSice: Polygenic Risk Score software, Euesden, Lewis, O'Reilly, Bioinformatics (2015) 31 (9):1466-1468","title":"Citation"},{"location":"archive/#authors","text":"Authors of PRSice-1 are as follow: Dr Jack Euesden Professor Cathryn Lewis Dr Paul O'Reilly","title":"Authors"},{"location":"command_detail/","text":"Available Commands This page contains all command available in PRSice. Tips When constructing new parameters, we follow the following rule: if the command has effect on any file that is not the target, it will have a prefix of the file name. For example, --base-info applies INFO score filtering on the base file, --ld-info perform INFO score filtering on the LD reference file and --info applies the INFO score filtering on the target file. Base File --a1 Column header containing the effective allele . There isn't any standardized label for the effective allele, therefore extra care must be taken to ensure the correct label is provided, otherwise, the effect will be flipped. --a2 Column header containing non-effective allele . --base | -b Base (i.e. GWAS) association file. This is a whitespace delimited file containing association results for SNPs on the base phenotype. This file can be gzipped (must have the .gz suffix). For PRSice to run, the base file must contain the effective allele ( --A1 ), effect size estimates ( --stat ), p-value for association ( --pvalue ), and the SNP ID ( --snp ). --beta This flag is used to indicate if the test statistic is in the form of BETA. If set, PRSice assume the statistic is in the form BETA. Mutually exclusive from --or --bp Column header containing the coordinate of SNPs. When provided, the coordinate of the SNPs will be scrutinized between the base and target file. SNPs with mismatched coordinate will be excluded. --chr Column header containing the chromosome information. When provided, the chromosome information of the SNPs will be scrutinized between the base and target file. SNPs with mismatched chromosome information will be automatically excluded. --index If set, assume the base columns are INDEX instead of the name of the corresponding columns. Index should be 0-based (start counting from 0) --base-info Base INFO score filtering. Format should be : . SNPs with info score less than will be ignored. It is useful to perform INFO score filtering to remove SNPs with low imputation confidence score. By default, PRSice will search for the INFO column in your base file and perform info score filtering with threshold of 0.9 . You can disable this behaviour by using --no-default --base-maf Base minor allele frequency (MAF) filtering. Format should be ,: . SNPs with MAF less than will be ignored. Additional column can be provided (e.g. different filtering threshold for case and control), using the following format: :,: --no-default Remove all default options. If set, PRSice will not set any defaults. --or This flag is used to indicate if the test statistic is in the form of odd ratios. If set, PRSice assume the statistic is in the form OR. Mutually exclusive from --beta --pvalue | -p Column header containing the p-value. The p-value information must be provided --snp Column header containing the SNP ID. This is required to allow SNP matching between the base and target file. Note While it is possible to implement a feature to allow SNP matching purely based on the chromosome number and coordinate of a variant, the possibiliy of flipping and multi-allelic input complicates the matter. Therefore this feature will not be implemented until an elegant solution can be provided. --stat Column header containing the summary statistic. If --beta is set, default to BETA ; likewise, if --or is set, default to OR . Otherwise, will try and search for OR or BETA from the header of the base file. If both OR and BETA is presented in the header, PRSice will terminate. Target File --binary-target Indicate whether the target phenotype is binary or not. Either T or F should be provided where T represent a binary phenotype. For multiple phenotypes, the input should be separated by comma without space. Default: F if --beta is set and T if --or is set --geno Filter SNPs based on gentype missingness. Must be a value between 0.0 and 1.0 . --info Filter SNPs based on info score. Only used for imputed target data. The INFO score is calculated as the MaCH imputation r-squared value, represented by the following pseudo code: m=Mean of expected genotype v=variance of expected genotype p=m/2 p_a = 2p(1-p) INFO = v/p_a --keep File containing the sample(s) to be extracted from the target file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID Mutually exclusive from --remove --maf Filter SNPs based on minor allele frequency (MAF). MAF is calculated using only the founder samples Note When perform MAF filtering on dosage data, the MAF is calculated using the hard-coded genotype --nonfounders By default, PRSice will exclude all non-founders from the analysis. When this flag is set, non-founders will be included in the regression model but will still be excluded from LD estimation. --pheno | -f Tab or space delimited phenotype file containing the phenotype(s). First column must be FID of the samples and the second column must be IID of the samples. When --ignore-fid is set, first column must be the IID of the samples. Must contain a header if --pheno-col is specified --pheno-col Headers of phenotypes to be included from the phenotype file. When multiple phenotypes are provided, the phenotype name will be used as part of the file output prefix --prevalence | -k Prevalence of all binary trait. If provided, PRSice will adjust the ascertainment bias of the R2. Note When multiple binary trait is found, prevalence information must be provided for all of them (Either adjust all binary traits, or don't adjust at all). For example, if there are 3 traits A, B and C, where A and C are binary traits with population prevalence of 0.1 and 0.2 respectively. The correct input should be --binary-target T,F,T --prevalence 0.1,0.2 --remove File containing the sample(s) to be removed from the target file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID Mutually exclusive from --keep --target | -t Target genotype file. Currently support both BGEN and binary PLINK format. For multiple chromosome input, simply substitute the chromosome number with #. PRSice will automatically replace # with 1-22. A separate fam/sample file can be specified by --target , --target-list File containing prefix of target genotype files. Similar to --target but allow for more flexibility. A separate fam/sample file can be specified by --target-list , --type File type of the target file. Support bed (binary plink) and bgen format. Default: bed Dosage --allow-inter Allow the generate of intermediate file. This will speed up PRSice when using dosage data as clumping reference and for hard coding PRS calculation --dose-thres Translate any SNPs with highest genotype probability less than this threshold to missing call. For example, with --dose-thres 0.9 , sample with genotype probability of \\(P(0/0)=0.2\\) , \\(P(0/1)=0.52\\) , \\(P(1/1)=0.28\\) will be set to missing --hard-thres A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. Otherwise a missing code is saved. Default is: 0.1 The distance ( \\(D\\) ) to the nearest hardcall is calculated as: \\[ P(Ref) = 2 \\times P(HomRef) + P(Het) \\\\ P(Alt) = 2 \\times P(HomAlt) + P(Het) \\\\ D = 0.5 \\times \\left(|P\\left(Ref\\right)- round\\left(P\\left(Ref\\right)\\right)| + |P\\left(Alt\\right)- round\\left(P\\left(Alt\\right)\\right)|\\right) \\] Note If dosage data is used as a LD reference, it will always be hard coded to calculate the LD Default: 0.9 --hard When set, will use hard thresholding instead of dosage for PRS construction. Default is to use dosage. Clumping --clump-kb The distance for clumping in kb. For example, if --clump-kb 250 is provided, PRSice will clump any SNPs that is within 250kb to both end of the index SNP (therefore a 500kb window with the index SNP at the center). Now also support distance with a unit. e.g. --clump-kb 1M is a valid input. Default: 250kb for PRSice, 1mb for PRSet --clump-r2 The r 2 threshold for clumping. Default: 0.1 --clump-p The p-value threshold use for clumping. Default: 1. --ld | -L LD reference file. Use for estimation of LD during clumping. If not provided, will use the post-filtered target genotype for LD calculation. Support multiple chromosome input. Please see --target for more information. When the target sample is small (e.g. < 500) and external panel of the same population is available (e.g. 1000 genome), an external reference panel might be used to improve the LD estimation for clumping. --ld-dose-thres Translate any SNPs with highest genotype probability less than this threshold to missing call. For example, with --ld-dose-thres 0.9 , sample with genotype probability of \\(P(0/0)=0.2\\) , \\(P(0/1)=0.52\\) , \\(P(1/1)=0.28\\) will be set to missing --ld-geno Filter SNPs based on genotype missingness. Must be a value between 0.0 and 1.0 . --ld-info Filter SNPs based on info score. Only used for imputed LD reference. The INFO score is calculated as the MaCH imputation r-squared value, represented by the following pseudo code m=Mean of expected genotype v=variance of expected genotype p=m/2 p_a = 2p(1-p) INFO = v/p_a --ld-hard-thres A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. Otherwise a missing code is saved. Default is: 0.1 The distance ( \\(D\\) ) to the nearest hardcall is calculated as: \\[ P(Ref) = 2 \\times P(HomRef) + P(Het) \\\\ P(Alt) = 2 \\times P(HomAlt) + P(Het) \\\\ D = 0.5 \\times \\left(|P\\left(Ref\\right)- round\\left(P\\left(Ref\\right)\\right)| + |P\\left(Alt\\right)- round\\left(P\\left(Alt\\right)\\right)|\\right) \\] Note If dosage data is used as a LD reference, it will always be hard coded to calculate the LD Default: 0.9 --ld-keep File containing the sample(s) to be extracted from the LD reference file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID. Mutually exclusive from --ld-remove . No effect if --ld was not provided --ld-list File containing prefix of multiple LD reference files. Similar to --ld but allow more flexibility. A separate fam/sample file can be specified by --ld-list , --ld-maf Filter SNPs based on minor allele frequency (MAF) Note When perform MAF filtering on dosage data, MAF is calculated using the hard-coded genotype --ld-remove File containing the sample(s) to be removed from the LD reference file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID. Mutually exclusive from --ld-keep --ld-type File type of the LD file. Support bed (binary plink) and bgen format. Default: bed\\n\" --no-clump When set, PRSice will not perform clumping. This is useful a pre-clumped list of SNPs is available. --proxy Proxy threshold for index SNP to be considered as part of the region represented by the clumped SNP(s). e.g. --proxy 0.8 means the index SNP will represent region of any clumped SNP(s) that has r 2 =0.8 even if the index SNP does not physically locate within the region Covariate --cov | -C Covariate file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID --cov-col | -c Header of covariates. If not provided, will use all variables in the covariate file. By adding @ in front of the string, any numbers within [ and ] will be parsed. E.g. @PC[1-3] will be read as PC1,PC2,PC3 . Discontinuous input are also supported: @cov[1.3-5] will be parsed as cov1,cov3,cov4,cov5 --cov-factor Header of categorical covariate(s). Dummy variable will be automatically generated. Any items in --cov-factor must also be found in --cov-col Also accept continuous input (start with @ ). P-value Thresholding --bar-levels Level of barchart to be plotted. When --fastscore is set, PRSice will only calculate the PRS for threshold within the bar level. Levels should be comma separated without space --fastscore Only calculate threshold stated in --bar-levels --no-full By default, PRSice will include the full model, i.e. p-value threshold = 1. Setting this flag will disable that behaviour --interval | -i The step size of the threshold. Default: 5e-05 --lower | -l The starting p-value threshold. Default: 5e-08 --model Genetic model use for regression. The genetic encoding is based on the base data where the encoding represent number of the effective allele Available models include: add - Additive model, code as 0/1/2 (default) dom - Dominant model, code as 0/1/1 rec - Recessive model, code as 0/0/1 het - Heterozygous only model, code as 0/1/0 --missing Method to handle missing genotypes. Available methods include: MEAN_IMPUTE - Missing genotypes contribute an amount proportional to imputed allele frequency (default) SET_ZERO - To throw out missing observations instead CENTER - shift all scores to mean zero. --no-regress Do not perform the regression analysis and simply output all PRS. --score Method to calculate the polygenic score. Available methods include: avg - Take the average effect size (default) std - Standardize the effect size con-std - Standardize the effect size using mean and sd derived from control samples sum - Direct summation of the effect size --upper | -u The final p-value threshold. Default: 0.5 PRSet --background String to indicate a background file. This string should have the format of Name:Type where type can be bed - 0-based range with 3 column. Chr Start End range - 1-based range with 3 column. Chr Start End gene - A file contain a column of gene name As the name suggest, the background file inform PRSet of the background signal to be used for competitive p-value calculation. When a background file is not provided, PRSet will construct the background using the GTF file. However, if both the background and the GTF file isn't provided, PRSet cannot perform the set base permutation. In this case, you can use --full-back to indicate that you'd like to use the whole genome as the background set --bed | -B Bed file containing the selected regions. Name of bed file will be used as the region identifier. Warning Bed file is 0-based. --feature Feature(s) to be included from the gtf file. Default: exon,CDS,gene,protein_coding --full-back Use the whole genome as background set for competitive p-value calculation --gtf | -g GTF file containing gene boundaries. Required when --msigdb is used Tip Human Genome build GRCh38 can be downloaded from here . --msigdb | -m MSigDB file containing the pathway information. Require the gtf file. The GMT file format used by MSigDB is a simple tab/space delimited text file where each line correspond to a single gene set following by Gene IDs: [Set A] [Gene 1] [Gene 2] ... [Set B] [Gene 1] [Gene 2] ... Tip Curated MSigDB files can be downloaded from here after registration in here --set-perm The number of set base permutation to perform. This is only used for calculating the competitive p-value. 10,000 permutation nshould generally be enough. --snp-set Provide gene sets using SNP ID. Two different format is allowed: SNP Set list format: A file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set. --wind-3 Add N base(s) to the 3' region of each gene regions. Unit suffix are allowed e.g. --wind-3 1M --wind-5 Add N base(s) to the 5' region of each gene regions. Unit suffix are allowed e.g. --wind-5 1M R specific commands --prsice Location of the PRSice executable. --dir Location to install require R packages. Only require if the required packages are not installed. We require the following packages: optparse , method , tools , ggplot2 , data.table , grDevices , RColorBrewer Plotting --bar-col-high Colour of the most predicting threshold. Default: firebrick --bar-col-low Colour of the poorest predicting threshold. Default: dodgerblue --bar-col-p When set, will change the colour of bar to p-value threshold instead of the p-value from the association with phenotype --bar-palatte Colour palatte to be used for bar plotting when --bar_col_p is set. Default: YlOrRd --device Select different plotting devices. You can choose any plotting devices supported by base R. Default: png --multi-plot Plot the top N target phenotypes / gene sets in a summary plot --plot When set, will only perform plotting using existing PRSice result files. All other parameters are still required such that PRSice can correctly locate the required input files for plotting. --plot-set The default behaviour of PRSet is to plot the bar-chart, high-resolution plot and quantile plot of the \"Base\" gene set, which consider all SNPs within the genome. By using the --plot-set option, you can plot the specific set of interest. --quantile | -q Number of quantiles to plot. No quantile plot will be generated when this is not provided. --quant-break Parameter to indicate an uneven distribution of quantile. Values represent the upperbound of each quantile group. e.g. With --quantile 10 --quant-break 1,5,10 , the quantiles will be grouped into \\(0\\lt Q \\le 1\\) , \\(1\\lt Q \\le 5\\) , \\(5\\lt Q \\le 10\\) Note To use --quant-break , you must set the correct amount of quantiles. For example, if the largest value in --quant-break is 100, then you must use --quantile 100 --quant-extract | -e File containing sample ID to be plot on a separated quantile e.g. extra quantile containing only schizophrenia samples. Must contain IID. Should contain FID if --ignore-fid isn't set. Note This will only work if the base and target has a different phenotype or if the target phenotype is quantitative --quant-ref Reference quantile for quantile plot. Default is number of quantiles divided by 2 Or in the event where --quant-break is used, represent the upper bound of the reference quantile --scatter-r2 When set, will change the y-axis of the high resolution scatter plot to R2 instead Miscellaneous --all-score Output PRS for ALL threshold. Warning This will generate a huge file --exclude File contains SNPs to be excluded from the analysis. Mutually exclusive from --extract --chr-id Try to construct an RS ID for SNP based on its chromosome, coordinate, effective allele and non-effective allele. For example, c:L-aBd is translated to: :-d This ID will always be used to represent SNPs on the target file, whereas for the base file, we will still prefer to use the column provided in the --snp parameter. SNPs in base file will only be represented by the --chr-id if the RS ID is not provided. --extract File contains SNPs to be included in the analysis. Mutually exclusive from --exclude --id-delim Delimiter used to concatinate FID and IID when performing ID matching. Especially useful for BGEN file processing --ignore-fid Ignore FID for all input. When this is set, first column of all file will be assume to be IID instead of FID --keep-ambig Keep ambiguous SNPs. PRSice will only perform dosage flipping but not strand flipping on ambiguous SNPs. e.g. If your base data contain A/T, with effective allele being A, and your target data is T/A with dosage of T, then PRSice will change the dosage in target to A/T with dosage of A. Only use this option when you are certain your base and target are on the same strand. --logit-perm When performing permutation on binary phenotypes, use logistic regression instead of linear regression. This will substantially slow down PRSice. Note One problem with using --logit-perm is that some of the permuted phenotype might be suffer from perfect separation. This leads to the GLM logistic model not being able to be converge (thus terminating PRSice). If you encounter such problem, you might want to exclude the --logit-perm option. In most case, the p-value of the linear model should be similar to the logistic model --memory Maximum memory usage allowed. PRSice will try its best to honor this setting. For example, --memory 10Gb will restrict PRSice to use no more than 10Gb of memory. However, as we are not using memory pool like PLINK, it is possible for PRSice to use more than the allowed amount. PRSice will mainly check the memory usage when: Perform Clumping Perform permutation analysis Perform set-based permutation --non-cumulate Calculate non-cumulative PRS. PRS will be reset to 0 for each new P-value threshold instead ofadding up --out | -o Prefix for all file output. Note If multiple target phenotypes are included (e.g. using --pheno-col ), the phenotype will be appended to the output prefix If multiple gene set are included, the name of the set will be appended to the output prefix (after the phenotype (if any)) --perm Number of permutation to perform. This will generate the empirical p-value. Recommend to use value larger than or equal to 10,000 Note When permutation is required, PRSice will perform the following operation Perform normal PRSice across all thresholds and obtain p-value of the most significant threshold Repeat PRSice analysis N times with permuted phenotype. Count the number of time where the p-value of the most significant threshold for the permuted --print-snp Print all SNPs that remains in the analysis after clumping is performed. For PRSet, 1 indicate the SNPs falls within the gene set of interest and 0 otherwise. If only PRSice is performed, a single \"gene set\" called \"Base\" will be indicated with all entries marked as 1 --seed | -s Seed used for permutation. If not provided, system time will be used as seed. This will allow the same results to be generated when the same seed and input is used --thread | -n Number of thread use Tip Maximum number of thread can be specified by using --thread max Note PRSice will limit the maximum number of thread used to the number of core available on the system as detected by PRSice. --ultra Ultra aggressive memory managememnt. Will store all genotype into the memory after clumping is performed. This will significant speed up PRSice and PRSet at the expense of increased memory usage. --x-range Range of SNPs to be excluded from the whole analysis. It can either be a single bed file or a comma seperated list of range. Range must be in the format of chr:start-end or chr:coordinate --help | -h Display the help messages","title":"Available Commands"},{"location":"command_detail/#available-commands","text":"This page contains all command available in PRSice. Tips When constructing new parameters, we follow the following rule: if the command has effect on any file that is not the target, it will have a prefix of the file name. For example, --base-info applies INFO score filtering on the base file, --ld-info perform INFO score filtering on the LD reference file and --info applies the INFO score filtering on the target file.","title":"Available Commands"},{"location":"command_detail/#base-file","text":"--a1 Column header containing the effective allele . There isn't any standardized label for the effective allele, therefore extra care must be taken to ensure the correct label is provided, otherwise, the effect will be flipped. --a2 Column header containing non-effective allele . --base | -b Base (i.e. GWAS) association file. This is a whitespace delimited file containing association results for SNPs on the base phenotype. This file can be gzipped (must have the .gz suffix). For PRSice to run, the base file must contain the effective allele ( --A1 ), effect size estimates ( --stat ), p-value for association ( --pvalue ), and the SNP ID ( --snp ). --beta This flag is used to indicate if the test statistic is in the form of BETA. If set, PRSice assume the statistic is in the form BETA. Mutually exclusive from --or --bp Column header containing the coordinate of SNPs. When provided, the coordinate of the SNPs will be scrutinized between the base and target file. SNPs with mismatched coordinate will be excluded. --chr Column header containing the chromosome information. When provided, the chromosome information of the SNPs will be scrutinized between the base and target file. SNPs with mismatched chromosome information will be automatically excluded. --index If set, assume the base columns are INDEX instead of the name of the corresponding columns. Index should be 0-based (start counting from 0) --base-info Base INFO score filtering. Format should be : . SNPs with info score less than will be ignored. It is useful to perform INFO score filtering to remove SNPs with low imputation confidence score. By default, PRSice will search for the INFO column in your base file and perform info score filtering with threshold of 0.9 . You can disable this behaviour by using --no-default --base-maf Base minor allele frequency (MAF) filtering. Format should be ,: . SNPs with MAF less than will be ignored. Additional column can be provided (e.g. different filtering threshold for case and control), using the following format: :,: --no-default Remove all default options. If set, PRSice will not set any defaults. --or This flag is used to indicate if the test statistic is in the form of odd ratios. If set, PRSice assume the statistic is in the form OR. Mutually exclusive from --beta --pvalue | -p Column header containing the p-value. The p-value information must be provided --snp Column header containing the SNP ID. This is required to allow SNP matching between the base and target file. Note While it is possible to implement a feature to allow SNP matching purely based on the chromosome number and coordinate of a variant, the possibiliy of flipping and multi-allelic input complicates the matter. Therefore this feature will not be implemented until an elegant solution can be provided. --stat Column header containing the summary statistic. If --beta is set, default to BETA ; likewise, if --or is set, default to OR . Otherwise, will try and search for OR or BETA from the header of the base file. If both OR and BETA is presented in the header, PRSice will terminate.","title":"Base File"},{"location":"command_detail/#target-file","text":"--binary-target Indicate whether the target phenotype is binary or not. Either T or F should be provided where T represent a binary phenotype. For multiple phenotypes, the input should be separated by comma without space. Default: F if --beta is set and T if --or is set --geno Filter SNPs based on gentype missingness. Must be a value between 0.0 and 1.0 . --info Filter SNPs based on info score. Only used for imputed target data. The INFO score is calculated as the MaCH imputation r-squared value, represented by the following pseudo code: m=Mean of expected genotype v=variance of expected genotype p=m/2 p_a = 2p(1-p) INFO = v/p_a --keep File containing the sample(s) to be extracted from the target file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID Mutually exclusive from --remove --maf Filter SNPs based on minor allele frequency (MAF). MAF is calculated using only the founder samples Note When perform MAF filtering on dosage data, the MAF is calculated using the hard-coded genotype --nonfounders By default, PRSice will exclude all non-founders from the analysis. When this flag is set, non-founders will be included in the regression model but will still be excluded from LD estimation. --pheno | -f Tab or space delimited phenotype file containing the phenotype(s). First column must be FID of the samples and the second column must be IID of the samples. When --ignore-fid is set, first column must be the IID of the samples. Must contain a header if --pheno-col is specified --pheno-col Headers of phenotypes to be included from the phenotype file. When multiple phenotypes are provided, the phenotype name will be used as part of the file output prefix --prevalence | -k Prevalence of all binary trait. If provided, PRSice will adjust the ascertainment bias of the R2. Note When multiple binary trait is found, prevalence information must be provided for all of them (Either adjust all binary traits, or don't adjust at all). For example, if there are 3 traits A, B and C, where A and C are binary traits with population prevalence of 0.1 and 0.2 respectively. The correct input should be --binary-target T,F,T --prevalence 0.1,0.2 --remove File containing the sample(s) to be removed from the target file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID Mutually exclusive from --keep --target | -t Target genotype file. Currently support both BGEN and binary PLINK format. For multiple chromosome input, simply substitute the chromosome number with #. PRSice will automatically replace # with 1-22. A separate fam/sample file can be specified by --target , --target-list File containing prefix of target genotype files. Similar to --target but allow for more flexibility. A separate fam/sample file can be specified by --target-list , --type File type of the target file. Support bed (binary plink) and bgen format. Default: bed","title":"Target File"},{"location":"command_detail/#dosage","text":"--allow-inter Allow the generate of intermediate file. This will speed up PRSice when using dosage data as clumping reference and for hard coding PRS calculation --dose-thres Translate any SNPs with highest genotype probability less than this threshold to missing call. For example, with --dose-thres 0.9 , sample with genotype probability of \\(P(0/0)=0.2\\) , \\(P(0/1)=0.52\\) , \\(P(1/1)=0.28\\) will be set to missing --hard-thres A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. Otherwise a missing code is saved. Default is: 0.1 The distance ( \\(D\\) ) to the nearest hardcall is calculated as: \\[ P(Ref) = 2 \\times P(HomRef) + P(Het) \\\\ P(Alt) = 2 \\times P(HomAlt) + P(Het) \\\\ D = 0.5 \\times \\left(|P\\left(Ref\\right)- round\\left(P\\left(Ref\\right)\\right)| + |P\\left(Alt\\right)- round\\left(P\\left(Alt\\right)\\right)|\\right) \\] Note If dosage data is used as a LD reference, it will always be hard coded to calculate the LD Default: 0.9 --hard When set, will use hard thresholding instead of dosage for PRS construction. Default is to use dosage.","title":"Dosage"},{"location":"command_detail/#clumping","text":"--clump-kb The distance for clumping in kb. For example, if --clump-kb 250 is provided, PRSice will clump any SNPs that is within 250kb to both end of the index SNP (therefore a 500kb window with the index SNP at the center). Now also support distance with a unit. e.g. --clump-kb 1M is a valid input. Default: 250kb for PRSice, 1mb for PRSet --clump-r2 The r 2 threshold for clumping. Default: 0.1 --clump-p The p-value threshold use for clumping. Default: 1. --ld | -L LD reference file. Use for estimation of LD during clumping. If not provided, will use the post-filtered target genotype for LD calculation. Support multiple chromosome input. Please see --target for more information. When the target sample is small (e.g. < 500) and external panel of the same population is available (e.g. 1000 genome), an external reference panel might be used to improve the LD estimation for clumping. --ld-dose-thres Translate any SNPs with highest genotype probability less than this threshold to missing call. For example, with --ld-dose-thres 0.9 , sample with genotype probability of \\(P(0/0)=0.2\\) , \\(P(0/1)=0.52\\) , \\(P(1/1)=0.28\\) will be set to missing --ld-geno Filter SNPs based on genotype missingness. Must be a value between 0.0 and 1.0 . --ld-info Filter SNPs based on info score. Only used for imputed LD reference. The INFO score is calculated as the MaCH imputation r-squared value, represented by the following pseudo code m=Mean of expected genotype v=variance of expected genotype p=m/2 p_a = 2p(1-p) INFO = v/p_a --ld-hard-thres A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. Otherwise a missing code is saved. Default is: 0.1 The distance ( \\(D\\) ) to the nearest hardcall is calculated as: \\[ P(Ref) = 2 \\times P(HomRef) + P(Het) \\\\ P(Alt) = 2 \\times P(HomAlt) + P(Het) \\\\ D = 0.5 \\times \\left(|P\\left(Ref\\right)- round\\left(P\\left(Ref\\right)\\right)| + |P\\left(Alt\\right)- round\\left(P\\left(Alt\\right)\\right)|\\right) \\] Note If dosage data is used as a LD reference, it will always be hard coded to calculate the LD Default: 0.9 --ld-keep File containing the sample(s) to be extracted from the LD reference file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID. Mutually exclusive from --ld-remove . No effect if --ld was not provided --ld-list File containing prefix of multiple LD reference files. Similar to --ld but allow more flexibility. A separate fam/sample file can be specified by --ld-list , --ld-maf Filter SNPs based on minor allele frequency (MAF) Note When perform MAF filtering on dosage data, MAF is calculated using the hard-coded genotype --ld-remove File containing the sample(s) to be removed from the LD reference file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID. Mutually exclusive from --ld-keep --ld-type File type of the LD file. Support bed (binary plink) and bgen format. Default: bed\\n\" --no-clump When set, PRSice will not perform clumping. This is useful a pre-clumped list of SNPs is available. --proxy Proxy threshold for index SNP to be considered as part of the region represented by the clumped SNP(s). e.g. --proxy 0.8 means the index SNP will represent region of any clumped SNP(s) that has r 2 =0.8 even if the index SNP does not physically locate within the region","title":"Clumping"},{"location":"command_detail/#covariate","text":"--cov | -C Covariate file. First column should be FID and the second column should be IID. If --ignore-fid is set, first column should be IID --cov-col | -c Header of covariates. If not provided, will use all variables in the covariate file. By adding @ in front of the string, any numbers within [ and ] will be parsed. E.g. @PC[1-3] will be read as PC1,PC2,PC3 . Discontinuous input are also supported: @cov[1.3-5] will be parsed as cov1,cov3,cov4,cov5 --cov-factor Header of categorical covariate(s). Dummy variable will be automatically generated. Any items in --cov-factor must also be found in --cov-col Also accept continuous input (start with @ ).","title":"Covariate"},{"location":"command_detail/#p-value-thresholding","text":"--bar-levels Level of barchart to be plotted. When --fastscore is set, PRSice will only calculate the PRS for threshold within the bar level. Levels should be comma separated without space --fastscore Only calculate threshold stated in --bar-levels --no-full By default, PRSice will include the full model, i.e. p-value threshold = 1. Setting this flag will disable that behaviour --interval | -i The step size of the threshold. Default: 5e-05 --lower | -l The starting p-value threshold. Default: 5e-08 --model Genetic model use for regression. The genetic encoding is based on the base data where the encoding represent number of the effective allele Available models include: add - Additive model, code as 0/1/2 (default) dom - Dominant model, code as 0/1/1 rec - Recessive model, code as 0/0/1 het - Heterozygous only model, code as 0/1/0 --missing Method to handle missing genotypes. Available methods include: MEAN_IMPUTE - Missing genotypes contribute an amount proportional to imputed allele frequency (default) SET_ZERO - To throw out missing observations instead CENTER - shift all scores to mean zero. --no-regress Do not perform the regression analysis and simply output all PRS. --score Method to calculate the polygenic score. Available methods include: avg - Take the average effect size (default) std - Standardize the effect size con-std - Standardize the effect size using mean and sd derived from control samples sum - Direct summation of the effect size --upper | -u The final p-value threshold. Default: 0.5","title":"P-value Thresholding"},{"location":"command_detail/#prset","text":"--background String to indicate a background file. This string should have the format of Name:Type where type can be bed - 0-based range with 3 column. Chr Start End range - 1-based range with 3 column. Chr Start End gene - A file contain a column of gene name As the name suggest, the background file inform PRSet of the background signal to be used for competitive p-value calculation. When a background file is not provided, PRSet will construct the background using the GTF file. However, if both the background and the GTF file isn't provided, PRSet cannot perform the set base permutation. In this case, you can use --full-back to indicate that you'd like to use the whole genome as the background set --bed | -B Bed file containing the selected regions. Name of bed file will be used as the region identifier. Warning Bed file is 0-based. --feature Feature(s) to be included from the gtf file. Default: exon,CDS,gene,protein_coding --full-back Use the whole genome as background set for competitive p-value calculation --gtf | -g GTF file containing gene boundaries. Required when --msigdb is used Tip Human Genome build GRCh38 can be downloaded from here . --msigdb | -m MSigDB file containing the pathway information. Require the gtf file. The GMT file format used by MSigDB is a simple tab/space delimited text file where each line correspond to a single gene set following by Gene IDs: [Set A] [Gene 1] [Gene 2] ... [Set B] [Gene 1] [Gene 2] ... Tip Curated MSigDB files can be downloaded from here after registration in here --set-perm The number of set base permutation to perform. This is only used for calculating the competitive p-value. 10,000 permutation nshould generally be enough. --snp-set Provide gene sets using SNP ID. Two different format is allowed: SNP Set list format: A file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set. --wind-3 Add N base(s) to the 3' region of each gene regions. Unit suffix are allowed e.g. --wind-3 1M --wind-5 Add N base(s) to the 5' region of each gene regions. Unit suffix are allowed e.g. --wind-5 1M","title":"PRSet"},{"location":"command_detail/#r-specific-commands","text":"--prsice Location of the PRSice executable. --dir Location to install require R packages. Only require if the required packages are not installed. We require the following packages: optparse , method , tools , ggplot2 , data.table , grDevices , RColorBrewer","title":"R specific commands"},{"location":"command_detail/#plotting","text":"--bar-col-high Colour of the most predicting threshold. Default: firebrick --bar-col-low Colour of the poorest predicting threshold. Default: dodgerblue --bar-col-p When set, will change the colour of bar to p-value threshold instead of the p-value from the association with phenotype --bar-palatte Colour palatte to be used for bar plotting when --bar_col_p is set. Default: YlOrRd --device Select different plotting devices. You can choose any plotting devices supported by base R. Default: png --multi-plot Plot the top N target phenotypes / gene sets in a summary plot --plot When set, will only perform plotting using existing PRSice result files. All other parameters are still required such that PRSice can correctly locate the required input files for plotting. --plot-set The default behaviour of PRSet is to plot the bar-chart, high-resolution plot and quantile plot of the \"Base\" gene set, which consider all SNPs within the genome. By using the --plot-set option, you can plot the specific set of interest. --quantile | -q Number of quantiles to plot. No quantile plot will be generated when this is not provided. --quant-break Parameter to indicate an uneven distribution of quantile. Values represent the upperbound of each quantile group. e.g. With --quantile 10 --quant-break 1,5,10 , the quantiles will be grouped into \\(0\\lt Q \\le 1\\) , \\(1\\lt Q \\le 5\\) , \\(5\\lt Q \\le 10\\) Note To use --quant-break , you must set the correct amount of quantiles. For example, if the largest value in --quant-break is 100, then you must use --quantile 100 --quant-extract | -e File containing sample ID to be plot on a separated quantile e.g. extra quantile containing only schizophrenia samples. Must contain IID. Should contain FID if --ignore-fid isn't set. Note This will only work if the base and target has a different phenotype or if the target phenotype is quantitative --quant-ref Reference quantile for quantile plot. Default is number of quantiles divided by 2 Or in the event where --quant-break is used, represent the upper bound of the reference quantile --scatter-r2 When set, will change the y-axis of the high resolution scatter plot to R2 instead","title":"Plotting"},{"location":"command_detail/#miscellaneous","text":"--all-score Output PRS for ALL threshold. Warning This will generate a huge file --exclude File contains SNPs to be excluded from the analysis. Mutually exclusive from --extract --chr-id Try to construct an RS ID for SNP based on its chromosome, coordinate, effective allele and non-effective allele. For example, c:L-aBd is translated to: :-d This ID will always be used to represent SNPs on the target file, whereas for the base file, we will still prefer to use the column provided in the --snp parameter. SNPs in base file will only be represented by the --chr-id if the RS ID is not provided. --extract File contains SNPs to be included in the analysis. Mutually exclusive from --exclude --id-delim Delimiter used to concatinate FID and IID when performing ID matching. Especially useful for BGEN file processing --ignore-fid Ignore FID for all input. When this is set, first column of all file will be assume to be IID instead of FID --keep-ambig Keep ambiguous SNPs. PRSice will only perform dosage flipping but not strand flipping on ambiguous SNPs. e.g. If your base data contain A/T, with effective allele being A, and your target data is T/A with dosage of T, then PRSice will change the dosage in target to A/T with dosage of A. Only use this option when you are certain your base and target are on the same strand. --logit-perm When performing permutation on binary phenotypes, use logistic regression instead of linear regression. This will substantially slow down PRSice. Note One problem with using --logit-perm is that some of the permuted phenotype might be suffer from perfect separation. This leads to the GLM logistic model not being able to be converge (thus terminating PRSice). If you encounter such problem, you might want to exclude the --logit-perm option. In most case, the p-value of the linear model should be similar to the logistic model --memory Maximum memory usage allowed. PRSice will try its best to honor this setting. For example, --memory 10Gb will restrict PRSice to use no more than 10Gb of memory. However, as we are not using memory pool like PLINK, it is possible for PRSice to use more than the allowed amount. PRSice will mainly check the memory usage when: Perform Clumping Perform permutation analysis Perform set-based permutation --non-cumulate Calculate non-cumulative PRS. PRS will be reset to 0 for each new P-value threshold instead ofadding up --out | -o Prefix for all file output. Note If multiple target phenotypes are included (e.g. using --pheno-col ), the phenotype will be appended to the output prefix If multiple gene set are included, the name of the set will be appended to the output prefix (after the phenotype (if any)) --perm Number of permutation to perform. This will generate the empirical p-value. Recommend to use value larger than or equal to 10,000 Note When permutation is required, PRSice will perform the following operation Perform normal PRSice across all thresholds and obtain p-value of the most significant threshold Repeat PRSice analysis N times with permuted phenotype. Count the number of time where the p-value of the most significant threshold for the permuted --print-snp Print all SNPs that remains in the analysis after clumping is performed. For PRSet, 1 indicate the SNPs falls within the gene set of interest and 0 otherwise. If only PRSice is performed, a single \"gene set\" called \"Base\" will be indicated with all entries marked as 1 --seed | -s Seed used for permutation. If not provided, system time will be used as seed. This will allow the same results to be generated when the same seed and input is used --thread | -n Number of thread use Tip Maximum number of thread can be specified by using --thread max Note PRSice will limit the maximum number of thread used to the number of core available on the system as detected by PRSice. --ultra Ultra aggressive memory managememnt. Will store all genotype into the memory after clumping is performed. This will significant speed up PRSice and PRSet at the expense of increased memory usage. --x-range Range of SNPs to be excluded from the whole analysis. It can either be a single bed file or a comma seperated list of range. Range must be in the format of chr:start-end or chr:coordinate --help | -h Display the help messages","title":"Miscellaneous"},{"location":"compilation/","text":"Introduction Here is the guideline for anyone who might want to compile PRSice from source. Prerequisites For the C++ executable 1. GCC version 7 or higher (for C++17 support) 2. CMake version 3.1 or higher (Optional) 3. Git (Optional) Note Only the C++ executable need to be built Using CMake With CMake, you can simply do the following: git clone https://github.com/choishingwan/PRSice.git cd PRSice mkdir build cd build cmake ../ make Then the PRSice executable will be located within PRSice/bin If you don't have git installed, you can still do (remember to download eigen to lib ) curl https://codeload.github.com/choishingwan/PRSice/tar.gz/2.3.3 > PRSice.tar.gz tar -xvf PRSice.tar.gz cd PRSice-2.3.3 mkdir build cd build cmake ../ make Note The above procedure was not tested on Windows Without CMake Without CMake, you will need to first download the eigen library You can then do the following git clone https://github.com/choishingwan/PRSice.git cd PRSice g++ -std=c++17 -O3 -DNDEBUG -march=native -isystem lib -isystem ${PATH_TO_EIGEN} -I inc src/*.cpp -lpthread -lz -o PRSice Then PRSice will be located in the current directory Alternatively, if you don't have git installed, you can still do curl https://codeload.github.com/choishingwan/PRSice/tar.gz/2.3.3 > PRSice.tar.gz tar -xvf PRSice.tar.gz cd PRSice-2.3.3 g++ -std=c++17 -O3 -DNDEBUG -march=native -isystem lib -isystem ${PATH_TO_EIGEN} -I inc src/*.cpp -lpthread -lz -o PRSice Intel MKL If you know how to setup the Intel \\(\\circledR\\) MKL library, you can compile PRSice with it to speed up the processing speed. You can use this to help you with the linking.","title":"Compile from Source"},{"location":"compilation/#introduction","text":"Here is the guideline for anyone who might want to compile PRSice from source.","title":"Introduction"},{"location":"compilation/#prerequisites","text":"For the C++ executable 1. GCC version 7 or higher (for C++17 support) 2. CMake version 3.1 or higher (Optional) 3. Git (Optional) Note Only the C++ executable need to be built","title":"Prerequisites"},{"location":"compilation/#using-cmake","text":"With CMake, you can simply do the following: git clone https://github.com/choishingwan/PRSice.git cd PRSice mkdir build cd build cmake ../ make Then the PRSice executable will be located within PRSice/bin If you don't have git installed, you can still do (remember to download eigen to lib ) curl https://codeload.github.com/choishingwan/PRSice/tar.gz/2.3.3 > PRSice.tar.gz tar -xvf PRSice.tar.gz cd PRSice-2.3.3 mkdir build cd build cmake ../ make Note The above procedure was not tested on Windows","title":"Using CMake"},{"location":"compilation/#without-cmake","text":"Without CMake, you will need to first download the eigen library You can then do the following git clone https://github.com/choishingwan/PRSice.git cd PRSice g++ -std=c++17 -O3 -DNDEBUG -march=native -isystem lib -isystem ${PATH_TO_EIGEN} -I inc src/*.cpp -lpthread -lz -o PRSice Then PRSice will be located in the current directory Alternatively, if you don't have git installed, you can still do curl https://codeload.github.com/choishingwan/PRSice/tar.gz/2.3.3 > PRSice.tar.gz tar -xvf PRSice.tar.gz cd PRSice-2.3.3 g++ -std=c++17 -O3 -DNDEBUG -march=native -isystem lib -isystem ${PATH_TO_EIGEN} -I inc src/*.cpp -lpthread -lz -o PRSice","title":"Without CMake"},{"location":"compilation/#intel-mkl","text":"If you know how to setup the Intel \\(\\circledR\\) MKL library, you can compile PRSice with it to speed up the processing speed. You can use this to help you with the linking.","title":"Intel MKL"},{"location":"decisions/","text":"Introduction Here, we detail some of the decisions we made during the implementatino of PRSice Support of BGEN v1.3 We have purposefully disabled support to BGEN v1.3 to avoid the inclusion of the zstd library. This is because - UKBB is v1.2 - We are not familiar with the licensing of zstd library (developed by facebook) Removal of PCA calculation The main goal of PRSice 2 is to support the polygenic score analysis on large scale data. With such data, it is very time consuming to the PCA and will require specific algorithms such as those implemented in flashPCA . In order to support the in-place PCA calculation, not only will we have to implement the flashPCA algorithm, we will also need to implement prunning, which is required prior to PCA calculation. Due to the lack of man power, we therefore decided that we will not implement the PCA calculation. Another reasoning is that we believe users should first examine the PCA results before directly applying them to the PRS analysis.","title":"Development Decisions"},{"location":"decisions/#introduction","text":"Here, we detail some of the decisions we made during the implementatino of PRSice","title":"Introduction"},{"location":"decisions/#support-of-bgen-v13","text":"We have purposefully disabled support to BGEN v1.3 to avoid the inclusion of the zstd library. This is because - UKBB is v1.2 - We are not familiar with the licensing of zstd library (developed by facebook)","title":"Support of BGEN v1.3"},{"location":"decisions/#removal-of-pca-calculation","text":"The main goal of PRSice 2 is to support the polygenic score analysis on large scale data. With such data, it is very time consuming to the PCA and will require specific algorithms such as those implemented in flashPCA . In order to support the in-place PCA calculation, not only will we have to implement the flashPCA algorithm, we will also need to implement prunning, which is required prior to PCA calculation. Due to the lack of man power, we therefore decided that we will not implement the PCA calculation. Another reasoning is that we believe users should first examine the PCA results before directly applying them to the PRS analysis.","title":"Removal of PCA calculation"},{"location":"extra_steps/","text":"Introduction After installation of R, additional steps might be require for MAC and Window users. Below are the instructions MAC Users Download and install the latest XQuartz This is because MAC no longer ship the X11 package which is required by R to perform plotting. Run xcode-select --install on your terminal This will install the required zlib package on your system, which is required by PRSice (for decompressing bgen files) Note For anyone with older Mac Versions (e.g. Mountain Lion or before), you should follow the guide here to install the require Command Line Tools . Window Users As installation of R does not automatically add it to the system path, one will need to type the full path of the R.exe and Rscript.exe in order to use PRSice. To avoid this complication, we can manually add the folder containing the R binary to the system path: For Windows 8 and 10 In Search, search for and then select: System (Control Panel) Click the Advanced system settings link Click the Advanced tab Click Environment Variables Under System Variables , select path (If you cannot find path, you can click new to make it) Click Edit Click Browse and select the location of the executable of R. If you use the default installation path, you can add C:\\Program Files\\R\\R-3.3.2\\bin , where (eg.) 3.3.2 is the version number. Some installation might also have a i384 and x64 version and either one of those will work. For Windows 7 From the desktop, right click the Computer icon. Choose Properties from the context menu. Click the Advanced system settings link. Click Environment Variables . In the section System Variables , find the PATH environment variable and select it. Click Edit. In the Edit System Variable (or New System Variable ) window, add the location of the executable of R. If you use the default installation path, you can add C:\\Program Files\\R\\R-3.3.2\\bin , where (eg.) 3.3.2 is the version number. Some installation might also have a i384 and x64 version and either one of those will work.","title":"Additional Steps for MAC and Window users"},{"location":"extra_steps/#introduction","text":"After installation of R, additional steps might be require for MAC and Window users. Below are the instructions","title":"Introduction"},{"location":"extra_steps/#mac-users","text":"Download and install the latest XQuartz This is because MAC no longer ship the X11 package which is required by R to perform plotting. Run xcode-select --install on your terminal This will install the required zlib package on your system, which is required by PRSice (for decompressing bgen files) Note For anyone with older Mac Versions (e.g. Mountain Lion or before), you should follow the guide here to install the require Command Line Tools .","title":"MAC Users"},{"location":"extra_steps/#window-users","text":"As installation of R does not automatically add it to the system path, one will need to type the full path of the R.exe and Rscript.exe in order to use PRSice. To avoid this complication, we can manually add the folder containing the R binary to the system path:","title":"Window Users"},{"location":"extra_steps/#for-windows-8-and-10","text":"In Search, search for and then select: System (Control Panel) Click the Advanced system settings link Click the Advanced tab Click Environment Variables Under System Variables , select path (If you cannot find path, you can click new to make it) Click Edit Click Browse and select the location of the executable of R. If you use the default installation path, you can add C:\\Program Files\\R\\R-3.3.2\\bin , where (eg.) 3.3.2 is the version number. Some installation might also have a i384 and x64 version and either one of those will work.","title":"For Windows 8 and 10"},{"location":"extra_steps/#for-windows-7","text":"From the desktop, right click the Computer icon. Choose Properties from the context menu. Click the Advanced system settings link. Click Environment Variables . In the section System Variables , find the PATH environment variable and select it. Click Edit. In the Edit System Variable (or New System Variable ) window, add the location of the executable of R. If you use the default installation path, you can add C:\\Program Files\\R\\R-3.3.2\\bin , where (eg.) 3.3.2 is the version number. Some installation might also have a i384 and x64 version and either one of those will work.","title":"For Windows 7"},{"location":"faq/","text":"Frequently Asked Questions We will continue to update this list to address the more common questions. I've receive the following error message, what should I do? GLM model did not converge! Please send me the DEBUG files This error message means that the logistic regression model cannot converge. This is usually caused by small sample size or caused by problem in the input file. You should first check the DEBUG file and see if that contains any NaN or Inf . These will likely be caused by un-quality controlled input, which can contain complete missingness. If that isn't the case, then you can load the DEBUG and DEBUG.y file into R and see if you can perform the logistic regression on the data (DEBUG.y is the y, whereas DEBUG contains the independent variables, including the intercept). My base/ target data do not contain the RS ID. Can I still use PRSice? As of version 2.3.x, PRSice now support chromosome ID via the parameter --chr-id . --chr-id will automatically generate an ID for each of your SNPs based on user input string, some characters are reserved: c = chromosome l = coordinates a = effective allele b = non-effective allele So --chr-id c:l-ab will construct SNP ID as :- Note It is not case sensitive","title":"Frequently Asked Questions"},{"location":"faq/#frequently-asked-questions","text":"We will continue to update this list to address the more common questions. I've receive the following error message, what should I do? GLM model did not converge! Please send me the DEBUG files This error message means that the logistic regression model cannot converge. This is usually caused by small sample size or caused by problem in the input file. You should first check the DEBUG file and see if that contains any NaN or Inf . These will likely be caused by un-quality controlled input, which can contain complete missingness. If that isn't the case, then you can load the DEBUG and DEBUG.y file into R and see if you can perform the logistic regression on the data (DEBUG.y is the y, whereas DEBUG contains the independent variables, including the intercept). My base/ target data do not contain the RS ID. Can I still use PRSice? As of version 2.3.x, PRSice now support chromosome ID via the parameter --chr-id . --chr-id will automatically generate an ID for each of your SNPs based on user input string, some characters are reserved: c = chromosome l = coordinates a = effective allele b = non-effective allele So --chr-id c:l-ab will construct SNP ID as :- Note It is not case sensitive","title":"Frequently Asked Questions"},{"location":"prset_detail/","text":"Input Data MSigDB One simple way to obtain gene sets or pathway is through the MSigDB . After registration in here , you can download different gene sets curated by the Broad Institute. Alternatively, you can also generate your own gene sets in the GMT format: [Set A] [Gene 1] [Gene 2] ... [Set B] [Gene 1] [Gene 2] ... Sometimes, the MSigDB file might store the URL for the gene set on the second column [Set A] [url for set A] [Gene 1] [Gene 2] ... And PRSet can properly handle that. Gene GTF As MSigDB file does not contain the genome boundary of the genes within the gene set, one must also provide a GTF file. A GTF file contains the genome boundary of the genetic elements within the human genome and PRSet can use the information from GTF to determine if a SNP falls within a specific gene. One can download the GTF file file Human (Genome build GTCh38.p7) here . PRSet will look for any regions with feature of exon , gene , protein_coding or CDS ( case sensitive ). Any genomic regions without these features will be ignored. Alternatively, you can specify the features using the --feature command. Note For those who are unfamiliar, different version of the genome might differ slightly in their coordinates. Therefore it is vital to ensure all the files are originated from the same genome build Bed Files In addition, PRSet also accept bed file(s) as an input. Important A bed file must contain a minimum of 3 columns: chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671). chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. PRSet will read in any number of bed files (comma separated) and use the file names as the name of the gene set. Note An annoying feature of bed file is that it starts with 0 whereas for example, the plink formats starts the coordinates at 1. So do remember to -1 from the region start when you build your own bed file from scratch. SNP Set Files Finally, PRSet also allow SNP sets, input via the --snp-set option. Two different formats are allowed SNP list format, a file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set. Clumping in PRSet In PRSice-2, clumping is performed to account for linkage disequilibrium (LD) between SNPs. However, when performing set based analysis, special care are required to perform clumping. Take the following as an example: Assume that: Light Blue fragments are the intergenic regions Dark Blue fragments are the genic regions Red fragments are the gene set regions SNPs are represented as thunder bolt, with the \"index\" SNP in clumping denoted by the green thunderbolt If we simply perform a genome wide clumping, we might remove all SNPs residing within the gene set of interest, reducing the signal: Therefore, to maximize signal within each gene set, we must perform clumping for each gene sets separately: this can be a tedious process and are prone to error. To speed up clumping, PRSice-2 adopt a \" capture the flag \" system. Each SNPs contains a flag to represent their gene set membership. If a SNP is a member for the set, it will have a flag of 1, otherwise it will have a flag of 0. For example: SNP Set A Set B Set C Set D SNP 1 1 0 1 1 SNP 2 0 0 1 1 SNP 3 1 1 0 1 If we use SNP 1 as the index SNP, then after clumping, we will have SNP Set A Set B Set C Set D SNP 1 1 0 1 1 SNP 2 0 0 0 0 SNP 3 0 1 0 0 which removes SNP 2, but will retain SNP 3. This allow us to achieve set based clumping by only performing a single pass genome wide clumping. P-value Threshold and Proxy Clumping Options Proxy PRSet One complication in PRSet is the definition of SNP membership. The default option of PRSet is to only include SNPs that are physically within the target region. However, it is also likely for SNPs outside the region to influence functions of the set. Therefore we provide the --proxy option. Essentially, this provide a soft cutoff to SNP membership. For example, when user define --proxy 0.8 , if LD between SNP A and SNP B is more than 0.8, then SNP A will be considered to be within the same regions as SNP B and vice versa. P-value thresholding By default, PRSet do not perform p-value thresholding and will simply calculate the set based PRS at P-value threshold of 1. This is because it is unclear whether the set is associated with the phenotype when the best-threshold contained only a small portion of SNPs within the gene sets. If you wish to perform p-value thresholding with PRSet, you will need to specify any of the parameters related to p-value thresholding, i.e. --interval , --lower , --upper , --fastscore or --bar-levels . Competitive P-value Calculation A challenge in Set base analysis is to obtain a competitive p-value, which indicates the level of enrichment, as opposed to the self-contained p-value which indicates the level of association. To obtain a competitive p-value, PRSet can perform a permutation analysis as follow Allocate SNPs to each gene sets Allocate SNPs to a background gene set if --full-back is specified, use the whole genome as the background if a background file is provided via the --background command, it will be used to construct the background set otherwise, will try to use the GTF file provided from --gtf command as the background (with feature filtering w.r.t --feature ) Perform set based clumping on all sets (including the background set) Obtain the p-value of association for the best threshold for each sets ( \\(P_{observed}\\) ) While PRSet allow one to perform p-value thresholding on the set scores, we recommend against it as it is difficult to interpret the result. Using an extreme example, if only one SNP is included in the best threshold for a set, should we really consider this single SNP as representative of the gene set? For each gene set with \\(N\\) post-clump SNPs Randomly select \\(N\\) post-clump SNPs from the background set and construct a null PRS Calculate the p-value of association of the null PRS to obtain a null P-value ( \\(P_{null}\\) ) Repeat 1-2 \\(M\\) times, where \\(M\\) can be set via --set-perm The competitive P-value is calculated as $$ \\text{Competitive-}P = \\frac{\\sum_{n=1}^NI(P_{null}\\lt P_observed)+1}{N+1} $$ where \\(I(.)\\) is the indicator function. Computation Algorithm Due to the number of operation required, the set based permutation are extremely time consuming. To speed up the set based permutation, we noted that in regression, $$ Y\\sim X\\beta+C+\\epsilon $$ and \\[ X\\sim Y\\beta+C+\\epsilon \\] will generate the same t-statistic for \\(X\\) in the first equation and \\(Y\\) in the second equation. Based on this observation, we can then do the following Generate a matrix \\(A\\) containing the phenotype of interest and the covariates Decompose matrix \\(A\\) For each new PRS calculated, solve \\(PRS=A\\beta+\\epsilon\\) and obtain the t-statistic. These t-statistics are then used to construct the null distribution, allow us to obtain the competitive p-value As we only need to do the decomposition once, this should significantly increase the speed of set based permutation. In our test, for the TOY data, with --set-perm 5000 , we can speed up the set-based permutation by around 20~25% Note With binary traits, unless --logit-perm is set, we will still perform linear regression as we assume linear regression and logistic regression should produce similar t-statistics Output Data PRS model-fit A file containing the PRS model fit across thresholds is named [Name].prsice , where [Name] is the output prefix name as specified by --out this is stored as Name of Set, Threshold, R2, P-value, Coefficient, Standard Error, and Number of SNPs at this threshold Scores for each individual A file containing PRS for each individual at the best-fit PRS named [Name].best is provide. This file has the format of: FID,IID, In Regression, PRS at best threshold for Set 1, PRS at best threshold for Set 2, ... Where the has phenotype column indicate whether the sample contain all the required phenotype for PRSice analysis (e.g. Samples with missing phenotype/covariate will not be included in the regression. These samples will be indicated as \"No\" under the in regression column) If --all option is used, a file named [Name].all.score is also generated Please note, if --all options is used, the PRS for each individual at all threshold will be given. In the event where the target sample size is large and a lot of threshold are tested, this file can be large. This is especially true when large number of gene sets were provided. Note PRSice also supports multiple phenotypes for target data. All output prefix will change to [Name].[Pheno] where [Pheno] is the name of the phenotype. For more details on the options used to implement this, see here . Summary Information Information of the best model fit of each phenotype and gene set is stored in [Name].summary. The summary file contain the following fields: Phenotype - Name of Phenotype Set - Name of Gene Set Threshold - Best P-value Threshold PRS.R2 - Variance explained by the PRS. If prevalence is provided, this will be adjusted for ascertainment Full.R2 - Variance explained by the full model (including the covariates). If prevalence is provided, this will be adjusted for ascertainment Null.R2 - Variance explained by the covariates. If prevalence is provided, this will be adjusted for ascertainment Prevalence - Population prevalence as indicated by the user. \"-\" if not provided. Coefficient - Regression coefficient of the model. Can provide insight of the direction of effect. P - P value of the model fit Num_SNP - Number of SNPs included in the model Empirical-P - Only provided if permutation is performed. This is the empirical p-value and should account for multiple testing and over-fitting Competitive-P - Only provided if set permutation is performed. This is the competitive p-value and should measure the enrichment of signal of the gene set Multi-Set Plot When the --multi-plot option is set, the results of the top N gene sets will be plotted. An example of the multi-set plot is: Other Figures The default behaviour of PRSet is to only plot the High-resolution plot, bar-plot and the quantile plot for the \"Base\" data. You can change this behaviour by using the --plot-set option. Log File We value reproducible research. Therefore we try our best to make replicating PRSice run easier. For every PRSice run, a log file named [Name].log is generated which contain the all the commands used for the analysis and information regarding filtering, field selected etc. This also allow users to quickly identify problems in the input dataset.","title":"PRSet"},{"location":"prset_detail/#input-data","text":"","title":"Input Data"},{"location":"prset_detail/#msigdb","text":"One simple way to obtain gene sets or pathway is through the MSigDB . After registration in here , you can download different gene sets curated by the Broad Institute. Alternatively, you can also generate your own gene sets in the GMT format: [Set A] [Gene 1] [Gene 2] ... [Set B] [Gene 1] [Gene 2] ... Sometimes, the MSigDB file might store the URL for the gene set on the second column [Set A] [url for set A] [Gene 1] [Gene 2] ... And PRSet can properly handle that.","title":"MSigDB"},{"location":"prset_detail/#gene-gtf","text":"As MSigDB file does not contain the genome boundary of the genes within the gene set, one must also provide a GTF file. A GTF file contains the genome boundary of the genetic elements within the human genome and PRSet can use the information from GTF to determine if a SNP falls within a specific gene. One can download the GTF file file Human (Genome build GTCh38.p7) here . PRSet will look for any regions with feature of exon , gene , protein_coding or CDS ( case sensitive ). Any genomic regions without these features will be ignored. Alternatively, you can specify the features using the --feature command. Note For those who are unfamiliar, different version of the genome might differ slightly in their coordinates. Therefore it is vital to ensure all the files are originated from the same genome build","title":"Gene GTF"},{"location":"prset_detail/#bed-files","text":"In addition, PRSet also accept bed file(s) as an input. Important A bed file must contain a minimum of 3 columns: chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671). chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. PRSet will read in any number of bed files (comma separated) and use the file names as the name of the gene set. Note An annoying feature of bed file is that it starts with 0 whereas for example, the plink formats starts the coordinates at 1. So do remember to -1 from the region start when you build your own bed file from scratch.","title":"Bed Files"},{"location":"prset_detail/#snp-set-files","text":"Finally, PRSet also allow SNP sets, input via the --snp-set option. Two different formats are allowed SNP list format, a file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set.","title":"SNP Set Files"},{"location":"prset_detail/#clumping-in-prset","text":"In PRSice-2, clumping is performed to account for linkage disequilibrium (LD) between SNPs. However, when performing set based analysis, special care are required to perform clumping. Take the following as an example: Assume that: Light Blue fragments are the intergenic regions Dark Blue fragments are the genic regions Red fragments are the gene set regions SNPs are represented as thunder bolt, with the \"index\" SNP in clumping denoted by the green thunderbolt If we simply perform a genome wide clumping, we might remove all SNPs residing within the gene set of interest, reducing the signal: Therefore, to maximize signal within each gene set, we must perform clumping for each gene sets separately: this can be a tedious process and are prone to error. To speed up clumping, PRSice-2 adopt a \" capture the flag \" system. Each SNPs contains a flag to represent their gene set membership. If a SNP is a member for the set, it will have a flag of 1, otherwise it will have a flag of 0. For example: SNP Set A Set B Set C Set D SNP 1 1 0 1 1 SNP 2 0 0 1 1 SNP 3 1 1 0 1 If we use SNP 1 as the index SNP, then after clumping, we will have SNP Set A Set B Set C Set D SNP 1 1 0 1 1 SNP 2 0 0 0 0 SNP 3 0 1 0 0 which removes SNP 2, but will retain SNP 3. This allow us to achieve set based clumping by only performing a single pass genome wide clumping.","title":"Clumping in PRSet"},{"location":"prset_detail/#p-value-threshold-and-proxy-clumping","text":"","title":"P-value Threshold and Proxy Clumping"},{"location":"prset_detail/#options","text":"","title":"Options"},{"location":"prset_detail/#proxy-prset","text":"One complication in PRSet is the definition of SNP membership. The default option of PRSet is to only include SNPs that are physically within the target region. However, it is also likely for SNPs outside the region to influence functions of the set. Therefore we provide the --proxy option. Essentially, this provide a soft cutoff to SNP membership. For example, when user define --proxy 0.8 , if LD between SNP A and SNP B is more than 0.8, then SNP A will be considered to be within the same regions as SNP B and vice versa.","title":"Proxy PRSet"},{"location":"prset_detail/#p-value-thresholding","text":"By default, PRSet do not perform p-value thresholding and will simply calculate the set based PRS at P-value threshold of 1. This is because it is unclear whether the set is associated with the phenotype when the best-threshold contained only a small portion of SNPs within the gene sets. If you wish to perform p-value thresholding with PRSet, you will need to specify any of the parameters related to p-value thresholding, i.e. --interval , --lower , --upper , --fastscore or --bar-levels .","title":"P-value thresholding"},{"location":"prset_detail/#competitive-p-value-calculation","text":"A challenge in Set base analysis is to obtain a competitive p-value, which indicates the level of enrichment, as opposed to the self-contained p-value which indicates the level of association. To obtain a competitive p-value, PRSet can perform a permutation analysis as follow Allocate SNPs to each gene sets Allocate SNPs to a background gene set if --full-back is specified, use the whole genome as the background if a background file is provided via the --background command, it will be used to construct the background set otherwise, will try to use the GTF file provided from --gtf command as the background (with feature filtering w.r.t --feature ) Perform set based clumping on all sets (including the background set) Obtain the p-value of association for the best threshold for each sets ( \\(P_{observed}\\) ) While PRSet allow one to perform p-value thresholding on the set scores, we recommend against it as it is difficult to interpret the result. Using an extreme example, if only one SNP is included in the best threshold for a set, should we really consider this single SNP as representative of the gene set? For each gene set with \\(N\\) post-clump SNPs Randomly select \\(N\\) post-clump SNPs from the background set and construct a null PRS Calculate the p-value of association of the null PRS to obtain a null P-value ( \\(P_{null}\\) ) Repeat 1-2 \\(M\\) times, where \\(M\\) can be set via --set-perm The competitive P-value is calculated as $$ \\text{Competitive-}P = \\frac{\\sum_{n=1}^NI(P_{null}\\lt P_observed)+1}{N+1} $$ where \\(I(.)\\) is the indicator function.","title":"Competitive P-value Calculation"},{"location":"prset_detail/#computation-algorithm","text":"Due to the number of operation required, the set based permutation are extremely time consuming. To speed up the set based permutation, we noted that in regression, $$ Y\\sim X\\beta+C+\\epsilon $$ and \\[ X\\sim Y\\beta+C+\\epsilon \\] will generate the same t-statistic for \\(X\\) in the first equation and \\(Y\\) in the second equation. Based on this observation, we can then do the following Generate a matrix \\(A\\) containing the phenotype of interest and the covariates Decompose matrix \\(A\\) For each new PRS calculated, solve \\(PRS=A\\beta+\\epsilon\\) and obtain the t-statistic. These t-statistics are then used to construct the null distribution, allow us to obtain the competitive p-value As we only need to do the decomposition once, this should significantly increase the speed of set based permutation. In our test, for the TOY data, with --set-perm 5000 , we can speed up the set-based permutation by around 20~25% Note With binary traits, unless --logit-perm is set, we will still perform linear regression as we assume linear regression and logistic regression should produce similar t-statistics","title":"Computation Algorithm"},{"location":"prset_detail/#output-data","text":"","title":"Output Data"},{"location":"prset_detail/#prs-model-fit","text":"A file containing the PRS model fit across thresholds is named [Name].prsice , where [Name] is the output prefix name as specified by --out this is stored as Name of Set, Threshold, R2, P-value, Coefficient, Standard Error, and Number of SNPs at this threshold","title":"PRS model-fit"},{"location":"prset_detail/#scores-for-each-individual","text":"A file containing PRS for each individual at the best-fit PRS named [Name].best is provide. This file has the format of: FID,IID, In Regression, PRS at best threshold for Set 1, PRS at best threshold for Set 2, ... Where the has phenotype column indicate whether the sample contain all the required phenotype for PRSice analysis (e.g. Samples with missing phenotype/covariate will not be included in the regression. These samples will be indicated as \"No\" under the in regression column) If --all option is used, a file named [Name].all.score is also generated Please note, if --all options is used, the PRS for each individual at all threshold will be given. In the event where the target sample size is large and a lot of threshold are tested, this file can be large. This is especially true when large number of gene sets were provided. Note PRSice also supports multiple phenotypes for target data. All output prefix will change to [Name].[Pheno] where [Pheno] is the name of the phenotype. For more details on the options used to implement this, see here .","title":"Scores for each individual"},{"location":"prset_detail/#summary-information","text":"Information of the best model fit of each phenotype and gene set is stored in [Name].summary. The summary file contain the following fields: Phenotype - Name of Phenotype Set - Name of Gene Set Threshold - Best P-value Threshold PRS.R2 - Variance explained by the PRS. If prevalence is provided, this will be adjusted for ascertainment Full.R2 - Variance explained by the full model (including the covariates). If prevalence is provided, this will be adjusted for ascertainment Null.R2 - Variance explained by the covariates. If prevalence is provided, this will be adjusted for ascertainment Prevalence - Population prevalence as indicated by the user. \"-\" if not provided. Coefficient - Regression coefficient of the model. Can provide insight of the direction of effect. P - P value of the model fit Num_SNP - Number of SNPs included in the model Empirical-P - Only provided if permutation is performed. This is the empirical p-value and should account for multiple testing and over-fitting Competitive-P - Only provided if set permutation is performed. This is the competitive p-value and should measure the enrichment of signal of the gene set","title":"Summary Information"},{"location":"prset_detail/#multi-set-plot","text":"When the --multi-plot option is set, the results of the top N gene sets will be plotted. An example of the multi-set plot is:","title":"Multi-Set Plot"},{"location":"prset_detail/#other-figures","text":"The default behaviour of PRSet is to only plot the High-resolution plot, bar-plot and the quantile plot for the \"Base\" data. You can change this behaviour by using the --plot-set option.","title":"Other Figures"},{"location":"prset_detail/#log-file","text":"We value reproducible research. Therefore we try our best to make replicating PRSice run easier. For every PRSice run, a log file named [Name].log is generated which contain the all the commands used for the analysis and information regarding filtering, field selected etc. This also allow users to quickly identify problems in the input dataset.","title":"Log File"},{"location":"quick_start/","text":"Preparation Before performing PRSice, quality control should be performed on the target samples. See here for an example. Input PRSice.R file: A wrapper for the PRSice executable and for plotting PRSice executable file: Perform all analysis except plotting Base data set: GWAS summary results, which the PRS is based on Target data set: Raw genotype data of target phenotype . Can be in the form of PLINK binary or BGEN Running PRSice In most case, assuming the PRSice executable is located in ($HOME)/PRSice/ and the working directory is ($HOME)/PRSice , you can run PRSice with the following commands: Note For window users, please use Rscript.exe instead of Rscript Important Do not copy codes to Microsoft Word. Word has a tendency to change characters from codes into special characters that cannot be recognized by the terminal Binary Traits For binary traits, the following command can be used (commands specific to binary traits are highlighted in yellow) Unix Rscript PRSice.R --dir . \\ --prsice ./PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --thread 1 \\ --stat OR \\ --binary-target T Windows Rscript.exe PRSice.R --dir . ^ --prsice ./PRSice.exe ^ --base TOY_BASE_GWAS.assoc ^ --target TOY_TARGET_DATA ^ --thread 1 ^ --stat OR ^ --binary-target T Quantitative Traits For quantitative traits, the following can be used instead (commands specific to quantitative traits are highlighted in yellow) Unix Rscript PRSice.R --dir . \\ --prsice ./PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --thread 1 \\ --stat BETA \\ --beta \\ --binary-target F Windows Rscript.exe PRSice.R --dir . ^ --prsice ./PRSice.exe ^ --base TOY_BASE_GWAS.assoc ^ --target TOY_TARGET_DATA ^ --thread 1 ^ --stat BETA ^ --beta ^ --binary-target F Note If the type of Effect ( --stat ) or data type ( --binary-target ) were not specified, PRSice will try to determine these information based on the header of the base file: When BETA (case insensitive) is found in the header and --stat was not provided, --beta will be added to the command, and if --binary-target was not provided, --binary-target F will be added to the command When OR (case insensitive) is found in the header and --binary-target was not provided, --or will be added to the command, and if --binary-target was not provided, --binary-target T will be added to the command PRSice cannot determine if the type of effect / data type if the base file contains both OR and BETA PRSice will detail all effective options in its log file. Quality Control of Target Samples Quality controls can be performed on the target samples using PLINK. A good starting point is (assume ($target) is the prefix of the target binary file) Unix plink --bfile ( $target ) \\ --maf 0 .05 \\ --mind 0 .1 \\ --geno 0 .1 \\ --hwe 1e-6 \\ --make-just-bim \\ --make-just-fam \\ --out ( $target ) .qc Windows plink.exe --bfile ( $target ) ^ --maf 0 .05 ^ --mind 0 .1 ^ --geno 0 .1 ^ --hwe 1e-6 ^ --make-just-bim ^ --make-just-fam ^ --out ( $target ) .qc Then, --keep ($target).qc.fam --extract ($target).qc.bim can be added to the PRSice command to filter out the samples and SNPs. You can refer to Marees et al (2018) for a more detail guide. You can also find our PRS tutorial here .","title":"PRSice"},{"location":"quick_start/#preparation","text":"Before performing PRSice, quality control should be performed on the target samples. See here for an example.","title":"Preparation"},{"location":"quick_start/#input","text":"PRSice.R file: A wrapper for the PRSice executable and for plotting PRSice executable file: Perform all analysis except plotting Base data set: GWAS summary results, which the PRS is based on Target data set: Raw genotype data of target phenotype . Can be in the form of PLINK binary or BGEN","title":"Input"},{"location":"quick_start/#running-prsice","text":"In most case, assuming the PRSice executable is located in ($HOME)/PRSice/ and the working directory is ($HOME)/PRSice , you can run PRSice with the following commands: Note For window users, please use Rscript.exe instead of Rscript Important Do not copy codes to Microsoft Word. Word has a tendency to change characters from codes into special characters that cannot be recognized by the terminal","title":"Running PRSice"},{"location":"quick_start/#binary-traits","text":"For binary traits, the following command can be used (commands specific to binary traits are highlighted in yellow) Unix Rscript PRSice.R --dir . \\ --prsice ./PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --thread 1 \\ --stat OR \\ --binary-target T Windows Rscript.exe PRSice.R --dir . ^ --prsice ./PRSice.exe ^ --base TOY_BASE_GWAS.assoc ^ --target TOY_TARGET_DATA ^ --thread 1 ^ --stat OR ^ --binary-target T","title":"Binary Traits"},{"location":"quick_start/#quantitative-traits","text":"For quantitative traits, the following can be used instead (commands specific to quantitative traits are highlighted in yellow) Unix Rscript PRSice.R --dir . \\ --prsice ./PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --thread 1 \\ --stat BETA \\ --beta \\ --binary-target F Windows Rscript.exe PRSice.R --dir . ^ --prsice ./PRSice.exe ^ --base TOY_BASE_GWAS.assoc ^ --target TOY_TARGET_DATA ^ --thread 1 ^ --stat BETA ^ --beta ^ --binary-target F Note If the type of Effect ( --stat ) or data type ( --binary-target ) were not specified, PRSice will try to determine these information based on the header of the base file: When BETA (case insensitive) is found in the header and --stat was not provided, --beta will be added to the command, and if --binary-target was not provided, --binary-target F will be added to the command When OR (case insensitive) is found in the header and --binary-target was not provided, --or will be added to the command, and if --binary-target was not provided, --binary-target T will be added to the command PRSice cannot determine if the type of effect / data type if the base file contains both OR and BETA PRSice will detail all effective options in its log file.","title":"Quantitative Traits"},{"location":"quick_start/#quality-control-of-target-samples","text":"Quality controls can be performed on the target samples using PLINK. A good starting point is (assume ($target) is the prefix of the target binary file) Unix plink --bfile ( $target ) \\ --maf 0 .05 \\ --mind 0 .1 \\ --geno 0 .1 \\ --hwe 1e-6 \\ --make-just-bim \\ --make-just-fam \\ --out ( $target ) .qc Windows plink.exe --bfile ( $target ) ^ --maf 0 .05 ^ --mind 0 .1 ^ --geno 0 .1 ^ --hwe 1e-6 ^ --make-just-bim ^ --make-just-fam ^ --out ( $target ) .qc Then, --keep ($target).qc.fam --extract ($target).qc.bim can be added to the PRSice command to filter out the samples and SNPs. You can refer to Marees et al (2018) for a more detail guide. You can also find our PRS tutorial here .","title":"Quality Control of Target Samples"},{"location":"quick_start_prset/","text":"Background A new feature of PRSice is the ability to perform set base/pathway based analysis. This new feature is called PRSet. Paper on PRSet currently under preparation. Important PRSet is currently under active development. Preparation PRSet is based on PRSice , with additional input requirements Input PRSice.R file : A wrapper for the PRSice binary and for plotting PRSice binary file : Perform all analysis except plotting Base data set : GWAS summary results, which the PRS is based on Target data set : Raw genotype data of \"target phenotype\". Can be in the form of PLINK binary or BGEN PRSet Specific Input Bed file(s) : Bed file(s) containing region of genes within a gene set; or MSigDB file : File containing name of each gene sets and the ID of genes within the gene set on each individual line. If MSigDB is provided, GTF file is required. GTF file : A file contain the genome boundary of each individual gene SNP file : A file containing SNPs constituting the gene set of interest. Can be in MSigDB (gmt) format or a file contain a single column of SNP IDs. Running PRSet In most case, assuming the PRSice binary is located in ($HOME)/PRSice/bin/ and the working directory is ($HOME)/PRSice , you can run PRSet with the following commands: With MSigDB data Assuming a MSigDB file ( set.txt ) is downloaded and a gene gtf file (gene.gtf) from Ensemble is available, PRSet can then be performed using: Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --gtf gene.gtf \\ --msigdb set.txt \\ --multi-plot 10 This will perform PRSet analysis and generate the multi-set plot with the top 10 gene sets With Bed Files Alternatively, if a list of bed files are available, e.g. A.bed,B.bed , PRSet can be performed by running Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --bed A.bed:SetA,B.bed \\ --multi-plot 10 Note Both bed and GTF+MSigDB input can be used together Tips Name of the set will be the bed file name or can be provided using --bed File:Name With SNP Set Finally, if you want to construct sets based on a list of SNPs, you can use --snp-set : Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --snp-set A.snp:A,B.snp \\ --multi-plot 10 Two different format are allowed: SNP Set list format: A file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set.","title":"PRSet"},{"location":"quick_start_prset/#background","text":"A new feature of PRSice is the ability to perform set base/pathway based analysis. This new feature is called PRSet. Paper on PRSet currently under preparation. Important PRSet is currently under active development.","title":"Background"},{"location":"quick_start_prset/#preparation","text":"PRSet is based on PRSice , with additional input requirements","title":"Preparation"},{"location":"quick_start_prset/#input","text":"PRSice.R file : A wrapper for the PRSice binary and for plotting PRSice binary file : Perform all analysis except plotting Base data set : GWAS summary results, which the PRS is based on Target data set : Raw genotype data of \"target phenotype\". Can be in the form of PLINK binary or BGEN","title":"Input"},{"location":"quick_start_prset/#prset-specific-input","text":"Bed file(s) : Bed file(s) containing region of genes within a gene set; or MSigDB file : File containing name of each gene sets and the ID of genes within the gene set on each individual line. If MSigDB is provided, GTF file is required. GTF file : A file contain the genome boundary of each individual gene SNP file : A file containing SNPs constituting the gene set of interest. Can be in MSigDB (gmt) format or a file contain a single column of SNP IDs.","title":"PRSet Specific Input"},{"location":"quick_start_prset/#running-prset","text":"In most case, assuming the PRSice binary is located in ($HOME)/PRSice/bin/ and the working directory is ($HOME)/PRSice , you can run PRSet with the following commands:","title":"Running PRSet"},{"location":"quick_start_prset/#with-msigdb-data","text":"Assuming a MSigDB file ( set.txt ) is downloaded and a gene gtf file (gene.gtf) from Ensemble is available, PRSet can then be performed using: Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --gtf gene.gtf \\ --msigdb set.txt \\ --multi-plot 10 This will perform PRSet analysis and generate the multi-set plot with the top 10 gene sets","title":"With MSigDB data"},{"location":"quick_start_prset/#with-bed-files","text":"Alternatively, if a list of bed files are available, e.g. A.bed,B.bed , PRSet can be performed by running Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --bed A.bed:SetA,B.bed \\ --multi-plot 10 Note Both bed and GTF+MSigDB input can be used together Tips Name of the set will be the bed file name or can be provided using --bed File:Name","title":"With Bed Files"},{"location":"quick_start_prset/#with-snp-set","text":"Finally, if you want to construct sets based on a list of SNPs, you can use --snp-set : Rscript PRSice.R \\ --prsice ./bin/PRSice \\ --base TOY_BASE_GWAS.assoc \\ --target TOY_TARGET_DATA \\ --binary-target T \\ --thread 1 \\ --snp-set A.snp:A,B.snp \\ --multi-plot 10 Two different format are allowed: SNP Set list format: A file containing a single column of SNP ID. Name of the set will be the file name or can be provided using --snp-set File:Name MSigDB format: Each row represent a single SNP set with the first column containing the name of the SNP set.","title":"With SNP Set"},{"location":"resources/","text":"Introduction PRSice relies on a number of open source projects to achieve the current performance. We also used algorithm found in other projects and translate them into C++ code for our own use. Below are number of projects we relies on Open source projects Project Developer(s) Description PLINK 2 Christopher Chang Provide the backbone of the clumping algorithm and PRS calculation BGEN lib Gavin Band Provide API to handle BGEN files. Slight modification were made to accomodate PRSice's usage Eigen C++ Ga\u00ebl Guennebaud and Beno\u00eet Jacob and others For all matrix algebra gzstream Deepak Bandyopadhyay and Lutz Kettner For reading gz files fastglm Jared Huling, Douglas Bates, Dirk Eddelbuettel, Romain Francois and Yixuan Qiu Basis of our glm class RcppEigen Douglas Bates, Dirk Eddelbuettel, Romain Francois, and Yixuan Qiu Provide the fastlm algorithm","title":"Useful Resources"},{"location":"resources/#introduction","text":"PRSice relies on a number of open source projects to achieve the current performance. We also used algorithm found in other projects and translate them into C++ code for our own use. Below are number of projects we relies on","title":"Introduction"},{"location":"resources/#open-source-projects","text":"Project Developer(s) Description PLINK 2 Christopher Chang Provide the backbone of the clumping algorithm and PRS calculation BGEN lib Gavin Band Provide API to handle BGEN files. Slight modification were made to accomodate PRSice's usage Eigen C++ Ga\u00ebl Guennebaud and Beno\u00eet Jacob and others For all matrix algebra gzstream Deepak Bandyopadhyay and Lutz Kettner For reading gz files fastglm Jared Huling, Douglas Bates, Dirk Eddelbuettel, Romain Francois and Yixuan Qiu Basis of our glm class RcppEigen Douglas Bates, Dirk Eddelbuettel, Romain Francois, and Yixuan Qiu Provide the fastlm algorithm","title":"Open source projects"},{"location":"step_by_step/","text":"Background You will need to have basic understanding of Genome Wide Association Studies (GWAS) in order to be able to perform Polygenic risk score (PRS) analyses. If you are unfamiliar with GWAS, you can consider reading this paper . Input Data Here, we briefly discuss different input files required by PRSice: Base Dataset Base (i.e. GWAS) data must be provided as a whitespace delimited file containing association analysis results for SNPs on the base phenotype. PRSice has no problem reading in a gzipped base file (need to have a .gz suffix). If PLINK output is used, then please make sure there is a column for the effective allele (A1) and specify it with --A1 option. If your base data follows other formats, then the column headers can be provided using the --chr , --A1 , --A2 , --stat , --snp , --bp , --pvalue options Important PRSice requires the base file to contain information of the effective allele ( --A1 ), effect size estimates ( --stat ), p-value for association ( --pvalue ), and the SNP ID ( --snp ). If the input file does not contain a column header, the column can be specified using their index (start counting from 0) with the --index flag. For example, with the following input format: SNP CHR BP A1 A2 OR SE P rs3094315 1 752566 A G 0.9912 0.0229 0.7009 rs3131972 1 752721 A G 1.007 0.0228 0.769 rs3131971 1 752894 T C 1.003 0.0232 0.8962 the parameters can either be --snp SNP --chr CHR --bp BP --A1 A1 --A2 A2 --stat OR --pvalue P or --snp 0 --chr 1 --bp 2 --A1 3 --A2 4 --stat 5 --pvalue 7 --index Strand flips are automatically detected and accounted for. If an imputation info score or the minor allele frequencies (MAF) are also included in the file, --base-info : and --base-maf : can be used to filter SNPs based on their INFO score and MAF respectively. For binary trait base file, SNPs can be filtered according to the MAF in case and control separately using --base-maf :,: By default, PRSice will look for the following column names automatically from the base file header if --index was not provided or if the column name of the specific arguement(s) were not provided: CHR, BP, A1, A2, SNP, P, INFO (case sensitive) and OR / BETA (case insensitive) --no-default can be used to disable all the defaults of PRSice. Note PRSice will ignore any columns that were not found in the base file (e.g. If --A2 B is specified but none of the column header is B , then PRSice will treat it as if no A2 information is presented) Target Dataset Currently two different target file format is supported by PRSice: PLINK Binary A target dataset in PLINK binary format must consist of three files: .bed , .bim , and a .fam file - where bed contains the compressed genotype data, bim contains the SNP information and fam contains the family information. Currently only SNP major PLINK format are supported (default output of the latest PLINK program). The .bed and .bim file must have the same prefix. If the .fam file follow a different prefix from the .bed and bim file, it can be specified using --target , Warning The fam file MUST contains the correct number of samples or PRSice will crash Missing phenotype data can be coded as NA, or -9 for binary traits and NA for quantitative traits. Note -9 will NOT be considered as missing for quantitative traits If the binary file is separated into individual chromosomes, then an # can be used to specify the location of the chromosome number in the file name. PRSice will automatically substitute # with 1-22 i.e. If the files are chr1. ,chr2. ,...,chr22. , just use --target chr# Note Chromosome number substitution will not be performed on the external fam file as the fam file should be the same for all chromosomes. Alternatively, if your PLINK files do not have a unified prefix, you can use --target-list to provide a file containing all prefix to PRSice. Note .pgen files are not currently supported BGEN PRSice currently support BGEN v1.1 and v1.2. To specify a BGEN file, simply add the --type bgen or --ld-type bgen to the PRSice command Note In theory, we can support BGEN v1.3, but that will require us to include zstd library, developed by facebook. You can enable the support by including the zstd library and changing the bgen_lib files. As BGEN does not store the phenotype information and sometime not even the sample ID, you must provide a phenotype file ( --pheno ). Alternatively, if you have a sample file containing the phenotype information, you can provide it with --target , Note The sample file is required even if --no-regress is set as the sample ID is required for output. This requirement might be losen in future versions With BGEN input, a number of PRSice options become effective: --hard : Normally, with BGEN format, PRS is calculated using the dosage information. But hard-thresholding can be performed by using the --hard option. SNPs will then coded as the genotype (0,1 or 2) and filtered according to threshold set by --hard-thres . If no such genotype is presented, the SNP will be coded as missing --hard-thres : A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. See here for more detail. To perform clumping on BGEN file, we need to repeatly decompress the genotype dosage and convert them into PLINK binary format. To speed up the clumping process, you can allow PRSice to generate a large intermediate file, containing the hard coded genotypes in PLINK binary format by using the --allow-inter option. Phenotype files An external phenotype file can be provided to PRSice using the --pheno parameter. This must be a tab / space delimited file and missing data must be represented by either NA or -9 (only for binary traits). The first two column of the phenotype file should be the FID and the IID, or when --ignore-fid is set, the first column should be the IID. The rest of the columns can be the phenotype(s). To specify a trait within the phenotype file, the column name for the trait can be specified using --pheno-col , providing that the phenotype file contains a header. Multiple column name can be provided via a comma separated list: e.g. --pheno-col A,B,C,D . Trait(s) not found within the phenotype file will be automatically skipped. Important The column name(s) should not contain space nor comma Note When more than one traits are provided, the column name will be appended to the output prefix. LD reference When the target sample is small (e.g. < 500 samples), an external reference panel can be used to improve the LD estimation for clumping. The LD reference follows the same notion as the target dataset. Simply use --ld to specify your LD reference panel file and --ld-type to specify the format When a LD reference file is not provided and --no-clump is not specified, the target file will be used as the LD reference panel Important Any parameters with the --ld prefix will only work on the file specified by the --ld parameter. That is, if a LD reference file is not provided, none of the --ld-* options will be used. If a different set of filtering is to be perforemd on the target file when performing LD calculation, it must be provided separately to the --ld parameter e.g. --target --ld --keep --ld-keep Note BGEN file will always be hard coded when used to estimate the LD Clumping By default, PRSice will perform Clumping to remove SNPs that are in LD with each other. Similar to PLINK, the r 2 values computed by PRSice are based on maximum likelihood haplotype frequency estimates. Both cases and controls are included in the LD calculation. Alternatively, a combination of --ld and --ld-keep / -ld-remove can be used to restrict LD calculation in control samples. Clumping parameters can be changed by using the --clump-kb , --clump-r2 and --clump-p option. Clumping can be disabled using --no-clump PRS calculation PRSice allow different genetic models to be specified (e.g. add, dom, het, rec), and the polygenic score of each of those are calculated differently Assuming \\(S\\) is the summary statistic for the effective allele and \\(G\\) is the number of the effective allele observed, then the main difference between the models is how the genotypes are coded: For additive model (add) \\[ G = G \\] For dominant model (with respect to the effective allele of the base file) \\[ G = \\begin{cases} 0 & \\text{if $G$ = 0} \\\\ 1 & \\text{otherwise} \\end{cases} \\] For recessive model (with respect to the effective allele of the base file) \\[ G = \\begin{cases} 1 & \\text{if $G$ = 2} \\\\ 0 & \\text{otherwise} \\end{cases} \\] For heterozygous model \\[ G = \\begin{cases} 1 & \\text{if $G$ = 1} \\\\ 0 & \\text{otherwise} \\end{cases} \\] Then depending on the --score option, the PRS is calculated as (assuming \\(M_j\\) is the number of Alleles included in the PRS of the \\(j^{th}\\) individual) --score avg (default): $$ PRS_j = \\sum_i{\\frac{S_i\\times G_{ij}}{M_j}} $$ --score sum : $$ PRS_j = \\sum_i{S_i\\times G_{ij}} $$ --score std : $$ PRS_j = \\frac{\\sum_i({S_i\\times G_{ij}}) - \\text{Mean}(PRS)}{\\text{SD}(PRS)} $$ --score con-std : $$ PRS_j = \\frac{\\sum_i({S_i\\times G_{ij}}) - \\text{Mean}(PRS in control)}{\\text{SD}(PRS in control)} $$ Sometimes, sample can have missing genotype. The --missing option is used to determine how PRSice handle the missingness. When not specified, the Minor Allele Frequency (MAF) in the target sample will be used as the genotype as the sample with missing genotype. If --missing SET_ZERO is set, the SNP for the missing samples will be excluded. Alternatively, if --missing CENTER is set, all PRS calculated will be minused by the MAF of the SNP (therefore, missing samples will have PRS of 0). Note Missingness imputation is usually based on the target samples. If you would like to impute the missingness using the reference sample, you can use --use-ref-maf parameter to specify all MAF to be calculated using the reference samples. Empirical P-value calculation All approaches to PRS calculation involve parameter optimisation and are therefore overfitted. There are a few methods to account for the overfitting: Evaluate performance in an independent validation sample Cross validation Calculate an empirical P-value In, PRSice-2, we have implemented permutation procedure to calculate the empirical P-value. Permutation Procedure To calculate the empirical P-value, PRSice-2 perform the following Perform standard PRSice analysis Obtain the p-value of association of the best p-value threshold ( \\(P_o\\) ) Randomly shuffle the phenotype and repeat the PRSice analysis Obtain the p-value of association of the best p-value threshold under the null ( \\(P_{null}\\) ) Repeat step-2 \\(N=10,000\\) times (for --perm 10000 ) The empirical p-value can then be calculated as \\[ \\text{Empirical-}P = \\frac{\\sum_{n=1}^NI(P_{null}\\lt P_o)+1}{N+1} \\] where \\(I(.)\\) is the indicator function. Warning While the empirical p-value for association will be controlled for Type 1 error, the observed phenotypic variance explained, R 2 , remains unadjusted and is affected by overfitting. Therefore, it is imperative to perform out-of-samp,le prediction, or cross-validation to evaluate the predictive accuracy of PRS. Computation Algorithm In reality, PRSice-2 exploit certain property of random number generation to speed up the permutation analysis. To generate random numbers, a random seed is required. When the same seed is provided, the same sequence of random number will always be generated. PRSice-2 exploit this property, such that the permutation analysis is performed as follow Generate the random seed or set the random seed to the user provided random seed ( \\(S\\) ) For each p-value threshold Calculate the observed p-value Seed the random number generator with \\(S\\) For Quantitative trait, (and binary trait, unless --logit-perm is set), decompose the matrix of the independent variables ( \\(Intercept+PRS+Covariates\\) ) Generate N copies of random phenotypes via random shuffling. Calculate the p-value association for each null phenotype For each permutation, check if the current null p-value is the most significant. Replace the previous \"best\" p-value if the current null p-value is more significant Calculate the empirical p-value once all p-value thresholds have been processed As we re-seed the random number generator for each p-value threshold, we ensure the random phenotypes generated in each p-value thresholds are identical, allowing us to reuse the calculated PRS and the decomosed matrix, which leads to significant speed up of the permutation process. Note With binary traits, unless --logit-perm is set, we will still perform linear regression as we assume linear regression and logistic regression should produce similar t-statistics Output of Results Bar Plot Note Hereon, [Name] is assumed to be the output prefix specified using --out and [date] is the date when the analysis was performed. PRSice will always generate a bar plot displaying the model fit of the PRS at P-value threshold as indicated by --bar-levels The plot will be named as [Name]_BARPLOT_[date].png . An example bar plot: High Resolution Plot If --fastscore is not specified, a high-resolution plot named [Name]_HIGH-RES_PLOT_[date].png will be generated. This plot present the model fit of PRS calculated at all P-value thresholds. Important The model fit is defined as the \\(R^2\\) of the Full model - the \\(R^2\\) of the Null model For example, if Sex is a covariate in the PRSice calculation, then model fit = \\(R^2\\) of \\(Pheno\\sim PRS+Sex\\) - \\(R^2\\) of \\(Pheno\\sim Sex\\) A green line connects points showing the model fit at the broad P-value thresholds used in the corresponding bar plot are also added. An example high-resolution plot: Quantile Plots If --quantile [number of quantile] is specified, a quantile plot named [Name]_QUANTILE_PLOT_[date].png will be generated. The quantile plot provide an illustration of the effect of increasing PRS on predicted risk of phenotype. An example quantile plot: Specifically, the quantile plot is generated by the following steps Distribute samples into user specified number of quantiles based on their PRS Treat the quantiles as a factor, where the --quant-ref is the base factor Perform regression with \\(Pheno \\sim Quantile + Covariates\\) (use logistic regression if phenotype is binary, and linear regression otherwise) Set the reference quantile to have coefficient of 1 (if binary) or 0 (otherwise) The point of each quantile is their OR (if binary) or coefficient (otherwise) from the regression analysis A text file [Name]_QUANTILE\\_[date].txt is also produced, which provides all the data used for the plotting. Moreover, uneven distribution of quantiles can be specified using the --quant-break function, which will generate the strata plot. For example, to replicate the quantile break from Natarajan et al (2015): Percentile of PRS, % All studies in iCOGS excluding pKARMA OR (95% CI) pKARMA only OR (95% CI) <1 0.29 (0.23 to 0.37) 0.48 (0.28 to 0.83) >1\u20135 0.42 (0.37 to 0.47) 0.48 (0.36 to 0.63) 5\u201310 0.55 (0.50 to 0.61) 0.58 (0.45 to 0.74) 10\u201320 0.65 (0.60 to 0.70) 0.68 (0.57 to 0.81) 20\u201340 0.80 (0.76 to 0.85) 0.81 (0.71 to 0.94) 40\u201360 1 (referent) 1 (referent) 60\u201380 1.18 (1.12 to 1.24) 1.35 (1.19 to 1.54) 80\u201390 1.48 (1.39 to 1.57) 1.56 (1.34 to 1.82) 90\u201395 1.69 (1.56 to 1.82) 2.05 (1.70 to 2.47) 95\u201399 2.20 (2.03 to 2.38) 2.12 (1.73 to 2.59) >99 2.81 (2.43 to 3.24) 3.06 (2.16 to 4.34) The following command can be added to PRSice command: --quantile 100 \\ --quant-break 1,5,10,20,40,60,80,90,95,99,100 \\ --quant-ref 60 Specifically, --quant-break indicates the upper bound of each group and --quant-ref specify the upper bound of the reference quantiles Note The quantile boundaries are non-overlapping, with the inclusive upper bound and exclusive lower bound Note Usually, you will need --quantile 100 together with --quant-break PRS model-fit A file containing the PRS model fit across thresholds is named [Name].prsice ; this is stored as Set, Threshold, \\(R^2\\) , P-value, Coefficient, Standard Deviation and Number of SNPs at this threshold Important \\(R^2\\) reported in the prsice file is the \\(R^2\\) of the Full model - the \\(R^2\\) of the Null model Scores for each individual A file containing PRS for each individual at the best-fit PRS named [Name].best is provide. This file has the format of: FID,IID,In_Regression, PRS at best threshold of first set, PRS at best threshold of second set, ... Where the In_Regression column indicate whether the sample is included in the regression model performed by PRSice. If --all-score option is used, a file named [Name].all.score is also generated This file has the format of FID, IID, PRS for first set at first threshold, PRS for first set at second threshold, ... If --all-score is used, the PRS for each individual at all threshold and all sets will be given. In the event where the target sample size is large and a lot of threshold are tested, this file can be large. Summary Information Information of the best model fit of each phenotype and gene set is stored in [Name].summary . The summary file contain the following fields: Phenotype - Name of Phenotype Set - Name of Gene Set Threshold - Best P-value Threshold PRS.R2 - Variance explained by the PRS. If prevalence is provided, this will be adjusted for ascertainment Full.R2 - Variance explained by the full model (including the covariates). If prevalence is provided, this will be adjusted for ascertainment Null.R2 - Variance explained by the covariates. If prevalence is provided, this will be adjusted for ascertainment Prevalence - Population prevalence as indicated by the user. \"-\" if not provided. Coefficient - Regression coefficient of the model. Can provide insight of the direction of effect. P - P value of the model fit Num_SNP - Number of SNPs included in the model Empirical-P - Only provided if permutation is performed. This is the empirical p-value and should account for multiple testing and over-fitting Only one summary file will be generated for each PRSice run (disregarding the number of target phenotype used) Log File To allow for easy replication, a log file named [Name].log is generated for each PRSice run, which contain the all the commands used for the analysis and information regarding filtering, field selected etc. This also allow easy identification of problems and should always be included in the bug report.","title":"PRSice"},{"location":"step_by_step/#background","text":"You will need to have basic understanding of Genome Wide Association Studies (GWAS) in order to be able to perform Polygenic risk score (PRS) analyses. If you are unfamiliar with GWAS, you can consider reading this paper .","title":"Background"},{"location":"step_by_step/#input-data","text":"Here, we briefly discuss different input files required by PRSice:","title":"Input Data"},{"location":"step_by_step/#base-dataset","text":"Base (i.e. GWAS) data must be provided as a whitespace delimited file containing association analysis results for SNPs on the base phenotype. PRSice has no problem reading in a gzipped base file (need to have a .gz suffix). If PLINK output is used, then please make sure there is a column for the effective allele (A1) and specify it with --A1 option. If your base data follows other formats, then the column headers can be provided using the --chr , --A1 , --A2 , --stat , --snp , --bp , --pvalue options Important PRSice requires the base file to contain information of the effective allele ( --A1 ), effect size estimates ( --stat ), p-value for association ( --pvalue ), and the SNP ID ( --snp ). If the input file does not contain a column header, the column can be specified using their index (start counting from 0) with the --index flag. For example, with the following input format: SNP CHR BP A1 A2 OR SE P rs3094315 1 752566 A G 0.9912 0.0229 0.7009 rs3131972 1 752721 A G 1.007 0.0228 0.769 rs3131971 1 752894 T C 1.003 0.0232 0.8962 the parameters can either be --snp SNP --chr CHR --bp BP --A1 A1 --A2 A2 --stat OR --pvalue P or --snp 0 --chr 1 --bp 2 --A1 3 --A2 4 --stat 5 --pvalue 7 --index Strand flips are automatically detected and accounted for. If an imputation info score or the minor allele frequencies (MAF) are also included in the file, --base-info : and --base-maf : can be used to filter SNPs based on their INFO score and MAF respectively. For binary trait base file, SNPs can be filtered according to the MAF in case and control separately using --base-maf :,: By default, PRSice will look for the following column names automatically from the base file header if --index was not provided or if the column name of the specific arguement(s) were not provided: CHR, BP, A1, A2, SNP, P, INFO (case sensitive) and OR / BETA (case insensitive) --no-default can be used to disable all the defaults of PRSice. Note PRSice will ignore any columns that were not found in the base file (e.g. If --A2 B is specified but none of the column header is B , then PRSice will treat it as if no A2 information is presented)","title":"Base Dataset"},{"location":"step_by_step/#target-dataset","text":"Currently two different target file format is supported by PRSice:","title":"Target Dataset"},{"location":"step_by_step/#plink-binary","text":"A target dataset in PLINK binary format must consist of three files: .bed , .bim , and a .fam file - where bed contains the compressed genotype data, bim contains the SNP information and fam contains the family information. Currently only SNP major PLINK format are supported (default output of the latest PLINK program). The .bed and .bim file must have the same prefix. If the .fam file follow a different prefix from the .bed and bim file, it can be specified using --target , Warning The fam file MUST contains the correct number of samples or PRSice will crash Missing phenotype data can be coded as NA, or -9 for binary traits and NA for quantitative traits. Note -9 will NOT be considered as missing for quantitative traits If the binary file is separated into individual chromosomes, then an # can be used to specify the location of the chromosome number in the file name. PRSice will automatically substitute # with 1-22 i.e. If the files are chr1. ,chr2. ,...,chr22. , just use --target chr# Note Chromosome number substitution will not be performed on the external fam file as the fam file should be the same for all chromosomes. Alternatively, if your PLINK files do not have a unified prefix, you can use --target-list to provide a file containing all prefix to PRSice. Note .pgen files are not currently supported","title":"PLINK Binary"},{"location":"step_by_step/#bgen","text":"PRSice currently support BGEN v1.1 and v1.2. To specify a BGEN file, simply add the --type bgen or --ld-type bgen to the PRSice command Note In theory, we can support BGEN v1.3, but that will require us to include zstd library, developed by facebook. You can enable the support by including the zstd library and changing the bgen_lib files. As BGEN does not store the phenotype information and sometime not even the sample ID, you must provide a phenotype file ( --pheno ). Alternatively, if you have a sample file containing the phenotype information, you can provide it with --target , Note The sample file is required even if --no-regress is set as the sample ID is required for output. This requirement might be losen in future versions With BGEN input, a number of PRSice options become effective: --hard : Normally, with BGEN format, PRS is calculated using the dosage information. But hard-thresholding can be performed by using the --hard option. SNPs will then coded as the genotype (0,1 or 2) and filtered according to threshold set by --hard-thres . If no such genotype is presented, the SNP will be coded as missing --hard-thres : A hardcall is saved when the distance to the nearest hardcall is less than the hardcall threshold. See here for more detail. To perform clumping on BGEN file, we need to repeatly decompress the genotype dosage and convert them into PLINK binary format. To speed up the clumping process, you can allow PRSice to generate a large intermediate file, containing the hard coded genotypes in PLINK binary format by using the --allow-inter option.","title":"BGEN"},{"location":"step_by_step/#phenotype-files","text":"An external phenotype file can be provided to PRSice using the --pheno parameter. This must be a tab / space delimited file and missing data must be represented by either NA or -9 (only for binary traits). The first two column of the phenotype file should be the FID and the IID, or when --ignore-fid is set, the first column should be the IID. The rest of the columns can be the phenotype(s). To specify a trait within the phenotype file, the column name for the trait can be specified using --pheno-col , providing that the phenotype file contains a header. Multiple column name can be provided via a comma separated list: e.g. --pheno-col A,B,C,D . Trait(s) not found within the phenotype file will be automatically skipped. Important The column name(s) should not contain space nor comma Note When more than one traits are provided, the column name will be appended to the output prefix.","title":"Phenotype files"},{"location":"step_by_step/#ld-reference","text":"When the target sample is small (e.g. < 500 samples), an external reference panel can be used to improve the LD estimation for clumping. The LD reference follows the same notion as the target dataset. Simply use --ld to specify your LD reference panel file and --ld-type to specify the format When a LD reference file is not provided and --no-clump is not specified, the target file will be used as the LD reference panel Important Any parameters with the --ld prefix will only work on the file specified by the --ld parameter. That is, if a LD reference file is not provided, none of the --ld-* options will be used. If a different set of filtering is to be perforemd on the target file when performing LD calculation, it must be provided separately to the --ld parameter e.g. --target --ld --keep --ld-keep Note BGEN file will always be hard coded when used to estimate the LD","title":"LD reference"},{"location":"step_by_step/#clumping","text":"By default, PRSice will perform Clumping to remove SNPs that are in LD with each other. Similar to PLINK, the r 2 values computed by PRSice are based on maximum likelihood haplotype frequency estimates. Both cases and controls are included in the LD calculation. Alternatively, a combination of --ld and --ld-keep / -ld-remove can be used to restrict LD calculation in control samples. Clumping parameters can be changed by using the --clump-kb , --clump-r2 and --clump-p option. Clumping can be disabled using --no-clump","title":"Clumping"},{"location":"step_by_step/#prs-calculation","text":"PRSice allow different genetic models to be specified (e.g. add, dom, het, rec), and the polygenic score of each of those are calculated differently Assuming \\(S\\) is the summary statistic for the effective allele and \\(G\\) is the number of the effective allele observed, then the main difference between the models is how the genotypes are coded: For additive model (add) \\[ G = G \\] For dominant model (with respect to the effective allele of the base file) \\[ G = \\begin{cases} 0 & \\text{if $G$ = 0} \\\\ 1 & \\text{otherwise} \\end{cases} \\] For recessive model (with respect to the effective allele of the base file) \\[ G = \\begin{cases} 1 & \\text{if $G$ = 2} \\\\ 0 & \\text{otherwise} \\end{cases} \\] For heterozygous model \\[ G = \\begin{cases} 1 & \\text{if $G$ = 1} \\\\ 0 & \\text{otherwise} \\end{cases} \\] Then depending on the --score option, the PRS is calculated as (assuming \\(M_j\\) is the number of Alleles included in the PRS of the \\(j^{th}\\) individual) --score avg (default): $$ PRS_j = \\sum_i{\\frac{S_i\\times G_{ij}}{M_j}} $$ --score sum : $$ PRS_j = \\sum_i{S_i\\times G_{ij}} $$ --score std : $$ PRS_j = \\frac{\\sum_i({S_i\\times G_{ij}}) - \\text{Mean}(PRS)}{\\text{SD}(PRS)} $$ --score con-std : $$ PRS_j = \\frac{\\sum_i({S_i\\times G_{ij}}) - \\text{Mean}(PRS in control)}{\\text{SD}(PRS in control)} $$ Sometimes, sample can have missing genotype. The --missing option is used to determine how PRSice handle the missingness. When not specified, the Minor Allele Frequency (MAF) in the target sample will be used as the genotype as the sample with missing genotype. If --missing SET_ZERO is set, the SNP for the missing samples will be excluded. Alternatively, if --missing CENTER is set, all PRS calculated will be minused by the MAF of the SNP (therefore, missing samples will have PRS of 0). Note Missingness imputation is usually based on the target samples. If you would like to impute the missingness using the reference sample, you can use --use-ref-maf parameter to specify all MAF to be calculated using the reference samples.","title":"PRS calculation"},{"location":"step_by_step/#empirical-p-value-calculation","text":"All approaches to PRS calculation involve parameter optimisation and are therefore overfitted. There are a few methods to account for the overfitting: Evaluate performance in an independent validation sample Cross validation Calculate an empirical P-value In, PRSice-2, we have implemented permutation procedure to calculate the empirical P-value.","title":"Empirical P-value calculation"},{"location":"step_by_step/#permutation-procedure","text":"To calculate the empirical P-value, PRSice-2 perform the following Perform standard PRSice analysis Obtain the p-value of association of the best p-value threshold ( \\(P_o\\) ) Randomly shuffle the phenotype and repeat the PRSice analysis Obtain the p-value of association of the best p-value threshold under the null ( \\(P_{null}\\) ) Repeat step-2 \\(N=10,000\\) times (for --perm 10000 ) The empirical p-value can then be calculated as \\[ \\text{Empirical-}P = \\frac{\\sum_{n=1}^NI(P_{null}\\lt P_o)+1}{N+1} \\] where \\(I(.)\\) is the indicator function. Warning While the empirical p-value for association will be controlled for Type 1 error, the observed phenotypic variance explained, R 2 , remains unadjusted and is affected by overfitting. Therefore, it is imperative to perform out-of-samp,le prediction, or cross-validation to evaluate the predictive accuracy of PRS.","title":"Permutation Procedure"},{"location":"step_by_step/#computation-algorithm","text":"In reality, PRSice-2 exploit certain property of random number generation to speed up the permutation analysis. To generate random numbers, a random seed is required. When the same seed is provided, the same sequence of random number will always be generated. PRSice-2 exploit this property, such that the permutation analysis is performed as follow Generate the random seed or set the random seed to the user provided random seed ( \\(S\\) ) For each p-value threshold Calculate the observed p-value Seed the random number generator with \\(S\\) For Quantitative trait, (and binary trait, unless --logit-perm is set), decompose the matrix of the independent variables ( \\(Intercept+PRS+Covariates\\) ) Generate N copies of random phenotypes via random shuffling. Calculate the p-value association for each null phenotype For each permutation, check if the current null p-value is the most significant. Replace the previous \"best\" p-value if the current null p-value is more significant Calculate the empirical p-value once all p-value thresholds have been processed As we re-seed the random number generator for each p-value threshold, we ensure the random phenotypes generated in each p-value thresholds are identical, allowing us to reuse the calculated PRS and the decomosed matrix, which leads to significant speed up of the permutation process. Note With binary traits, unless --logit-perm is set, we will still perform linear regression as we assume linear regression and logistic regression should produce similar t-statistics","title":"Computation Algorithm"},{"location":"step_by_step/#output-of-results","text":"","title":"Output of Results"},{"location":"step_by_step/#bar-plot","text":"Note Hereon, [Name] is assumed to be the output prefix specified using --out and [date] is the date when the analysis was performed. PRSice will always generate a bar plot displaying the model fit of the PRS at P-value threshold as indicated by --bar-levels The plot will be named as [Name]_BARPLOT_[date].png . An example bar plot:","title":"Bar Plot"},{"location":"step_by_step/#high-resolution-plot","text":"If --fastscore is not specified, a high-resolution plot named [Name]_HIGH-RES_PLOT_[date].png will be generated. This plot present the model fit of PRS calculated at all P-value thresholds. Important The model fit is defined as the \\(R^2\\) of the Full model - the \\(R^2\\) of the Null model For example, if Sex is a covariate in the PRSice calculation, then model fit = \\(R^2\\) of \\(Pheno\\sim PRS+Sex\\) - \\(R^2\\) of \\(Pheno\\sim Sex\\) A green line connects points showing the model fit at the broad P-value thresholds used in the corresponding bar plot are also added. An example high-resolution plot:","title":"High Resolution Plot"},{"location":"step_by_step/#quantile-plots","text":"If --quantile [number of quantile] is specified, a quantile plot named [Name]_QUANTILE_PLOT_[date].png will be generated. The quantile plot provide an illustration of the effect of increasing PRS on predicted risk of phenotype. An example quantile plot: Specifically, the quantile plot is generated by the following steps Distribute samples into user specified number of quantiles based on their PRS Treat the quantiles as a factor, where the --quant-ref is the base factor Perform regression with \\(Pheno \\sim Quantile + Covariates\\) (use logistic regression if phenotype is binary, and linear regression otherwise) Set the reference quantile to have coefficient of 1 (if binary) or 0 (otherwise) The point of each quantile is their OR (if binary) or coefficient (otherwise) from the regression analysis A text file [Name]_QUANTILE\\_[date].txt is also produced, which provides all the data used for the plotting. Moreover, uneven distribution of quantiles can be specified using the --quant-break function, which will generate the strata plot. For example, to replicate the quantile break from Natarajan et al (2015): Percentile of PRS, % All studies in iCOGS excluding pKARMA OR (95% CI) pKARMA only OR (95% CI) <1 0.29 (0.23 to 0.37) 0.48 (0.28 to 0.83) >1\u20135 0.42 (0.37 to 0.47) 0.48 (0.36 to 0.63) 5\u201310 0.55 (0.50 to 0.61) 0.58 (0.45 to 0.74) 10\u201320 0.65 (0.60 to 0.70) 0.68 (0.57 to 0.81) 20\u201340 0.80 (0.76 to 0.85) 0.81 (0.71 to 0.94) 40\u201360 1 (referent) 1 (referent) 60\u201380 1.18 (1.12 to 1.24) 1.35 (1.19 to 1.54) 80\u201390 1.48 (1.39 to 1.57) 1.56 (1.34 to 1.82) 90\u201395 1.69 (1.56 to 1.82) 2.05 (1.70 to 2.47) 95\u201399 2.20 (2.03 to 2.38) 2.12 (1.73 to 2.59) >99 2.81 (2.43 to 3.24) 3.06 (2.16 to 4.34) The following command can be added to PRSice command: --quantile 100 \\ --quant-break 1,5,10,20,40,60,80,90,95,99,100 \\ --quant-ref 60 Specifically, --quant-break indicates the upper bound of each group and --quant-ref specify the upper bound of the reference quantiles Note The quantile boundaries are non-overlapping, with the inclusive upper bound and exclusive lower bound Note Usually, you will need --quantile 100 together with --quant-break","title":"Quantile Plots"},{"location":"step_by_step/#prs-model-fit","text":"A file containing the PRS model fit across thresholds is named [Name].prsice ; this is stored as Set, Threshold, \\(R^2\\) , P-value, Coefficient, Standard Deviation and Number of SNPs at this threshold Important \\(R^2\\) reported in the prsice file is the \\(R^2\\) of the Full model - the \\(R^2\\) of the Null model","title":"PRS model-fit"},{"location":"step_by_step/#scores-for-each-individual","text":"A file containing PRS for each individual at the best-fit PRS named [Name].best is provide. This file has the format of: FID,IID,In_Regression, PRS at best threshold of first set, PRS at best threshold of second set, ... Where the In_Regression column indicate whether the sample is included in the regression model performed by PRSice. If --all-score option is used, a file named [Name].all.score is also generated This file has the format of FID, IID, PRS for first set at first threshold, PRS for first set at second threshold, ... If --all-score is used, the PRS for each individual at all threshold and all sets will be given. In the event where the target sample size is large and a lot of threshold are tested, this file can be large.","title":"Scores for each individual"},{"location":"step_by_step/#summary-information","text":"Information of the best model fit of each phenotype and gene set is stored in [Name].summary . The summary file contain the following fields: Phenotype - Name of Phenotype Set - Name of Gene Set Threshold - Best P-value Threshold PRS.R2 - Variance explained by the PRS. If prevalence is provided, this will be adjusted for ascertainment Full.R2 - Variance explained by the full model (including the covariates). If prevalence is provided, this will be adjusted for ascertainment Null.R2 - Variance explained by the covariates. If prevalence is provided, this will be adjusted for ascertainment Prevalence - Population prevalence as indicated by the user. \"-\" if not provided. Coefficient - Regression coefficient of the model. Can provide insight of the direction of effect. P - P value of the model fit Num_SNP - Number of SNPs included in the model Empirical-P - Only provided if permutation is performed. This is the empirical p-value and should account for multiple testing and over-fitting Only one summary file will be generated for each PRSice run (disregarding the number of target phenotype used)","title":"Summary Information"},{"location":"step_by_step/#log-file","text":"To allow for easy replication, a log file named [Name].log is generated for each PRSice run, which contain the all the commands used for the analysis and information regarding filtering, field selected etc. This also allow easy identification of problems and should always be included in the bug report.","title":"Log File"},{"location":"update_log/","text":"From now on, I will try to archive our update log here. 2020-08-05 ( v2.3.3 ) Thanks to report from @charlisech, we were able to pinpoint a bug related to sample selection when using bgen data. 2020-07-15 ( v2.3.2 ) Fix off by one error in PRSet best score output Fix problem for bgen file when sample selection is performed on bgen files containing sample information 2020-05-30 ( v2.3.1.e ) Fix bug where SNPs without missingness will be wrongly considered as having 100% missingness Fix error log where PRSice should now correct stat if a parameter is missing the required arguments 2020-05-29 ( v2.3.1.d ) Fix segmentation fault when --ld is used 2020-05-28 ( v2.3.1.c ) Fix problem with missing covariate Fix Rscript such that it properly read in phenotype file when --pheno-co l is specified 2020-05-26 ( v2.3.1.b ) Fix best score output when --ignore-fid is used Also fix Rscript covariate and phenotype file read when handling IDs star t with 00 and when --ignore-fid is used 2020-05-26 ( v2.3.1.a ) Fix bar plot with covariate. Was plotting the full R2 instead of the PRS.R2 2020-05-23 ( v2.3.1 ) Update Rscript such that it match features in executable (thus avoid problem in plotting) Fix a bug where PRSice will crash when there are missing covariates 2020-05-21 ( v2.3.0.e ) Fix Rscript bar plot problem 2020-05-21 ( v2.3.0.d ) Fix problem introduced by previous fix. Was hoping 2.3.0's unit test will help reducing the amount of bugs. Sorry for the troubles. 2020-05-20 ( v2.3.0.c ) Fix all score output format Fix problem with --no-regress . Might still have problem with --no-regress --score con-std 2020-05-19 ( v2.3.0.b ) Fix error where sample selection will distort phenotype loading, loading the wrong phenotype to wrong sample. As this is a major bug, we deleted the previous 2 releases. Sorry for the troubles. 2020-05-19 ( v2.3.0.a ) Fix output error where we always say 0 valid phenotype were included for continuous trait Fix problem with permutation where PRSice will crash if input are rank deficient Fix problem when provide a binary phenotype file with a fam file containing -9 as phenotype, PRSice will wrongly state that there are no phenotype presented Fix problem in Rscript where if sample ID is numeric and starts with 0, the best file will not merge with the phenotype file, causing 0 valid PRS to be observed 2020-05-18 ( v2.3.0 ) We now support multi-threaded clumping (separated by chromosome) Genotypes will be stored to memory during clumping (increase memory usage, significantly speed up clumping) Will only generate one .prsice file for all phenotypes .prsice file now has additional column call \"Pheno\" Introduced --chr-id which generate rs id based on user provided formula (see detail for more info) Format of --base-maf and --base-info are now changed to : from , Fix a bug related to ambiguous allele dosage flipping when --keep-ambig is used Better mismatch handling. For example, if your base file only provide the effective allele A without the non-effective allele information, PRSice will now do dosage flipping if your target file has G/C as effective allele and A /T as an non-effective allele (whereas previous this SNP will be considered as a mismatch) Fix bug in 2.2.13 where PRSice won't output the error message during command parsing stage If user provided the --stat information, PRSice will now error out instead of trying to look for BETA or OR in the file. PRSice should now better recognize if phenotype file contains a header various small bug fix","title":"Update Log"},{"location":"update_log/#2020-08-05-v233","text":"Thanks to report from @charlisech, we were able to pinpoint a bug related to sample selection when using bgen data.","title":"2020-08-05 (v2.3.3)"},{"location":"update_log/#2020-07-15-v232","text":"Fix off by one error in PRSet best score output Fix problem for bgen file when sample selection is performed on bgen files containing sample information","title":"2020-07-15 (v2.3.2)"},{"location":"update_log/#2020-05-30-v231e","text":"Fix bug where SNPs without missingness will be wrongly considered as having 100% missingness Fix error log where PRSice should now correct stat if a parameter is missing the required arguments","title":"2020-05-30 (v2.3.1.e)"},{"location":"update_log/#2020-05-29-v231d","text":"Fix segmentation fault when --ld is used","title":"2020-05-29 (v2.3.1.d)"},{"location":"update_log/#2020-05-28-v231c","text":"Fix problem with missing covariate Fix Rscript such that it properly read in phenotype file when --pheno-co l is specified","title":"2020-05-28 (v2.3.1.c)"},{"location":"update_log/#2020-05-26-v231b","text":"Fix best score output when --ignore-fid is used Also fix Rscript covariate and phenotype file read when handling IDs star t with 00 and when --ignore-fid is used","title":"2020-05-26 (v2.3.1.b)"},{"location":"update_log/#2020-05-26-v231a","text":"Fix bar plot with covariate. Was plotting the full R2 instead of the PRS.R2","title":"2020-05-26 (v2.3.1.a)"},{"location":"update_log/#2020-05-23-v231","text":"Update Rscript such that it match features in executable (thus avoid problem in plotting) Fix a bug where PRSice will crash when there are missing covariates","title":"2020-05-23 (v2.3.1)"},{"location":"update_log/#2020-05-21-v230e","text":"Fix Rscript bar plot problem","title":"2020-05-21 (v2.3.0.e)"},{"location":"update_log/#2020-05-21-v230d","text":"Fix problem introduced by previous fix. Was hoping 2.3.0's unit test will help reducing the amount of bugs. Sorry for the troubles.","title":"2020-05-21 (v2.3.0.d)"},{"location":"update_log/#2020-05-20-v230c","text":"Fix all score output format Fix problem with --no-regress . Might still have problem with --no-regress --score con-std","title":"2020-05-20 (v2.3.0.c)"},{"location":"update_log/#2020-05-19-v230b","text":"Fix error where sample selection will distort phenotype loading, loading the wrong phenotype to wrong sample. As this is a major bug, we deleted the previous 2 releases. Sorry for the troubles.","title":"2020-05-19 (v2.3.0.b)"},{"location":"update_log/#2020-05-19-v230a","text":"Fix output error where we always say 0 valid phenotype were included for continuous trait Fix problem with permutation where PRSice will crash if input are rank deficient Fix problem when provide a binary phenotype file with a fam file containing -9 as phenotype, PRSice will wrongly state that there are no phenotype presented Fix problem in Rscript where if sample ID is numeric and starts with 0, the best file will not merge with the phenotype file, causing 0 valid PRS to be observed","title":"2020-05-19 (v2.3.0.a)"},{"location":"update_log/#2020-05-18-v230","text":"We now support multi-threaded clumping (separated by chromosome) Genotypes will be stored to memory during clumping (increase memory usage, significantly speed up clumping) Will only generate one .prsice file for all phenotypes .prsice file now has additional column call \"Pheno\" Introduced --chr-id which generate rs id based on user provided formula (see detail for more info) Format of --base-maf and --base-info are now changed to : from , Fix a bug related to ambiguous allele dosage flipping when --keep-ambig is used Better mismatch handling. For example, if your base file only provide the effective allele A without the non-effective allele information, PRSice will now do dosage flipping if your target file has G/C as effective allele and A /T as an non-effective allele (whereas previous this SNP will be considered as a mismatch) Fix bug in 2.2.13 where PRSice won't output the error message during command parsing stage If user provided the --stat information, PRSice will now error out instead of trying to look for BETA or OR in the file. PRSice should now better recognize if phenotype file contains a header various small bug fix","title":"2020-05-18 (v2.3.0)"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 98bfaf27..24895894 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ
We're hiring!!
+PRS Workshop 2024 in Japan!!!