-
Notifications
You must be signed in to change notification settings - Fork 23
1. Computing prior causal probabilities with PolyFun
There are three ways to run PolyFun:
- Using precomputed prior causal probabilities of 19 million imputed UK Biobank SNPs with MAF>0.1%, based on a meta-analysis of 15 UK Biobank traits. This is the simplest approach, but it may not include all your SNPs of interest (especially when analyzing non-European populations) and the prior causal probabilities may not be optimal for some traits.
- Computing prior causal probabilities via an L2-regularized extension of stratified LD-score regression (S-LDSC). This is a relatively simple approach, but the prior causal probabilities may not be robust to modeling misspecification (please see the paragraph beginning with "To assess the robustness of PolyFun to model misspecification of functional architectures" in the PolyFun paper Main Simulation section for details).
- Computing prior causal probabilities non-parametrically. This is the most robust approach, but it is computationally intensive and requires access to individual-level genotypic data from a large reference panel (optimally >10,000 population-matched individuals), unless you are analyzing summary statistics based only on UK Biobank British-ancestry individuals.
Approach (3) is generally the best but it suffers from the caveats specified above. If you are unable to use approach (3), we recommend using approach (1) if (a) all of your SNPs are included in our set of 19M SNPs; and (b) the trait you are analyzing has "typical" functional enrichments (e.g. coding and conserved SNPs are extremely enriched). If you cannot use approach (3) and (1), we recommend using approach (2) but to keep its limitations in mind.
Below are instructions and examples on how to use each approach. We recommend that you run these examples to become familiar with PolyFun. The examples use small datasets and run very quickly (typically <1 minute)
PolyFun uses input files that are very similar to the input files of S-LDSC. The main differences are:
- The .annot files must contain two additional columns called A1,A2 which encode the identifies of the effect and alternative allele (i.e., the sign of effect sizes is with respect to A1)
- The .l2.ldscore files may contain the additional columns A1,A2. We strongly encourage including these columns.
- Polyfun supports files in .parquet format in addition to .gzip/.bzip2 formats. Parquet files can be loaded substantially faster than alternative formats, at the cost of slightly larger file sizes.
PolyFun approach 1: Using precomputed prior causal probabilities based on a meta-analysis of 15 UK Biobank traits
Here, all you need to do is provide a file with SNP identifiers. PolyFun will extract the per-SNP heritabilities of these SNPs, which are proportional to prior causal probabilities. To do this, use the following command:
python extract_snpvar.py --sumstats <sumstats_file> --out <output_file>
The sumstats_file should be either a parquet or a whitespace-delimited file (that can be gzipped) with a header line. It can accept files with any combination of columns, as long as they include the following columns:
- CHR - chromosome
- BP - base pair position (in hg19 coordinates)
- A1 - The effect allele (i.e., the sign of the effect size is with respect to A1)
- A2 - the second allele
The script will output a file with exactly the same columns and one additional column called SNPVAR
. Specifically, extract_snpvar.py
can accept files created by the script munge_polyfun_sumstats.py
, and the output can be used directly for fine-mapping using the finemapper
script.
Here is a toy example you can try:
mkdir -p output
python extract_snpvar.py --sumstats example_data/snps_to_finemap.txt.gz --out output/snps_with_var.gz
cat output/snps_with_var.gz | zcat | head
The top lines of the output should be:
CHR BP SNP A1 A2 SNPVAR
1 10000006 rs186077422 G A 4.0733e-09
1 10000179 1:10000179_AAAAAAAC_A AAAAAAAC A 4.0733e-09
1 10000400 rs1237370 T A 4.0733e-09
1 10000476 rs182770070 A T 4.0733e-09
1 10000553 rs574892739 T G 4.0733e-09
1 10000732 rs563811805 T C 4.0733e-09
1 10000804 rs114880362 T C 4.0733e-09
1 10001239 rs68058227 G T 4.0733e-09
1 10001401 rs60132751 C T 4.0733e-09
The column SNPVAR
contains the per-SNP heritabilities, which are proportional to prior causal probabilities. These per-SNP heritabilities can be used directly as prior causal probabilities in fine-mapping.
We emphasize that the script will terminate if there exist SNPs in the input file for which we did not pre-compute per-SNP heritabilities. This behavior safeguards against removal of potentially causal SNPs. You may override this behavior by providing the flag "--allow-missing", but we caution that this may cause the removal of causal SNPs, which could lead to spurious fine-mapping results downstream.
Tip: You can use the script munge_polyfun_sumstats.py
(see below) to create a parquet file with the required columns.
This is done in two stages:
1. Create a munged summary statistics file in a PolyFun-friendly parquet format.
To do this, use the script munge_polyfun_sumstats.py
, which takes an input summary statistics file and creates a munged output file. The script tries to be flexible and accommodate multiple file formats and column names. It generally requires only a sample size parameter (n) and a whitespace-delimited input file with SNP rsids, chromosome and base pair info, and either a p-value, an effect size estimate and its standard error, a Z-score or a p-value.
Here is a usage example:
python munge_polyfun_sumstats.py \
--sumstats example_data/boltlmm_sumstats.gz \
--n 327209 \
--out example_data/sumstats_munged.parquet \
--min-info 0.6 \
--min-maf 0.001
This takes the input BOLT-LMM file example_data/boltlmm_sumstats.gz
and converts it to the parquet file example_data/sumstats_munged.parquet
, excluding SNPs with INFO score<0.6, with MAF<0.001 or in the MHC region. It will additionally compute the BOLT-LMM effective sample size. You can see other possible arguments with the command python munge_polyfun_sumstats.py --help
. You can see the output file by opening the parquet file through python with the command df = pd.read_parquet('example_data/sumstats_munged.parquet')
In this stage PolyFun will estimate per-SNP heritabilities for SNPs on odd (resp. even) chromosomes by applying L2-regularized S-LDSC to even (resp. odd) chromosomes. To do this, run the script polyfun.py
. This script handles all possible uses of PolyFun, but here we'll only compute prior causal probabilities with L2-extended S-LDSC, using a subset of the baseline-LF model annotations. Here is an example command, that uses 8 annotations from the baseline-LF model:
mkdir -p output
python polyfun.py \
--compute-h2-L2 \
--no-partitions \
--output-prefix output/testrun \
--sumstats example_data/sumstats.parquet \
--ref-ld-chr example_data/annotations. \
--w-ld-chr example_data/weights.
This will create 2 output files for each chromosome: output/testrun.<CHR>.snpvar_ridge.gz
and output/testrun.<CHR>.snpvar_ridge_constrained.gz
. The first contains estimated per-SNP heritabilities for all SNPs (which can be used for downstream analysis with PolyFun; see below), and the second contains truncated per-SNP heritabilities, which can be used directly as prior causal probabilities in fine-mapping. For example, here is the output for the top 10 SNPs in chromosome 1: (seen with cat output/testrun.22.snpvar_ridge_constrained.gz | zcat | head
)
CHR SNP BP A1 A2 SNPVAR Z N
22 rs139069276 16866502 G A 7.1906e-09 -1.2188e-02 383290
22 rs34747326 16870173 A G 7.1906e-09 -1.8948e+00 383290
22 rs4010550 16900134 G A 1.4878e-08 1.3657e+00 383290
22 rs5994099 16905044 G A 7.1906e-09 7.2224e-01 383290
22 rs59750689 16936369 T C 7.1906e-09 9.4481e-02 383290
22 rs148021587 16939232 C T 1.4878e-08 8.0397e-01 383290
22 rs3954635 17024983 A C 1.4878e-08 -3.4335e-01 383290
22 rs371202053 17034394 G A 1.4878e-08 -7.8644e-01 383290
22 rs200071370 17037779 C T 1.4878e-08 3.5049e-01 383290
The column called 'SNPVAR' contains truncated per-SNP heritabilities, which can be used directly as prior causal probabilities in fine-mapping
The parameters we provided are the following:
-
--compute-h2-L2
- this tells PolyFun to compute per-SNP heritabilities via an L2-regularized S-LDSC -
--no-partitions
- this tells PolyFun to not partition SNPs into bins based on their estimated per-SNP heritabilities. This makes the computations slightly faster. You should only provide this flag if you are only interested in L2-regularized estimation of per-SNP heritabilities. -
--output-prefix output/testrun
- this specifies the prefix of all the PolyFun output files. -
--sumstats
- this specifies an input summary statistics file (created via themunge_polyfun_sumstats.py
script). -
--ref-ld-chr
- this is the prefix of the LD-score and annotation files that S-LDSC uses. These are similar to the standard S-LDSC input files with an important addition: The annotation files must include columns called A1,A2 for effect and alternative alleles (because unfortunately SNP rsid is not a unique SNP identifier). Additionally, it is strongly recommended that the LD-score files also include columns called A1,A2, to prevent excluding multiple SNPs with the same rsid from the estimation stage. PolyFun will accept files with either .gz or .parquet extension (parquet is faster) -
--w-ld-chr
- this is the prefix of the LD-score weight files. These files should be similar to standard LD-score files, but with only one 'annotation' that includes weights. These weights should be equal to the (non-stratified) LD-scores of the SNPs, when computed using plink files that include only the set of SNPs used for fitting S-LDSC. As before, it is strongly recommended that these files include A1,A2 columns.
We strongly encourage that you look at the input files provided in the example_data
directory to get a sense of their structure.
please note that the annotation files in this example are just a toy example. Please see the annotations wiki page to get the full set of annotations.
This is done in four stages:
1. Create a munged summary statistics file in a PolyFun-friendly parquet format.
This is done exactly as in step 1 of Approach 2 (see above).
In this stage PolyFun will estimate per-SNP heritabilities for SNPs on odd (resp. even) chromosomes by applying L2-regularized S-LDSC to even (resp. odd) chromosomes, and will then partition the SNPs into bins. This is done similarly to step 2 of Approach 2 (see above), except that you should remove the --no-partitions
flag. There are several additional flags that you can specify:
-
--skip-Ckmedian
- This tells PolyFun to partition SNPs into bins using scikits-learn instead of Ckmedian. This is a suboptimal clustering, so Ckmedian is preferable. You should only specify this argument if rpy2 and/or Ckmeans.1d.dp are not installed on your machine or you can't get them to run. -
-num-bins <K>
- this specifies the number of bins to partition SNPs into. By default PolyFun will try to estimate this number. You should specify this number if either (a) you specified--skip-Ckmedian
(because scikits-learn cannot estimate the number of bins) or (b) the estimation is too slow.
Here is a usage example:
mkdir -p output
python polyfun.py \
--compute-h2-L2 \
--output-prefix output/testrun \
--sumstats example_data/sumstats.parquet \
--ref-ld-chr example_data/annotations. \
--w-ld-chr example_data/weights.
In this stage PolyFun will compute LD-scores for each SNP bin. This is the most computationally intensive part of PolyFun. PolyFun can compute LD-scores for all chromosomes in the same run, or for only one chromosome at a time. The second option is recommended if you can run multiple jobs on a cluster.
Here is a usage example that computes LD-scores for only chromosome 1, using genotypes from the 1000-genomes reference panel:
python polyfun.py \
--compute-ldscores \
--output-prefix output/testrun \
--bfile-chr example_data/reference. \
--chr 1
Here is an example that computes LD-scores for only chromosome 1 by downloading precomputed UK Biobank LD matrices:
mkdir -p LD_cache
python polyfun.py \
--compute-ldscores \
--output-prefix output/testrun \
--ld-ukb \
--ld-dir LD_cache \
--chr 1
- You must specify the same
--output-prefix
argument that you provided in stage 2, because PolyFun requires intermediate files that were created in stage 2. - If you remove the flag
--chr
, PolyFun will iterate over all chromosomes and compute LD-scores for all of them. - The genotype or LD data should be population-matched to your study. Ideally this data should come from your study directly. In the first example we used a small subset of SNPs of European-ancestry individuals from the 1000 genomes project. In the second example we used precomputed summary LD information from British-ancestry individuals from the UK Biobank.
- you compute LD from individual-level data, there are various parameters that you can use to control the LD-score computations, analogue to the respective parameters in the ldsc package. Please type
python polyfun.py --help
to see all available parameters. The parameter--keep <keep file>
can be especially useful if you have a very large reference panel and would like to speed-up the computations by using only a subset of individuals. - The
--ld-dir
flag specifies a directory to which precomputed UK Biobank LD files will be downloaded and cached for future use. If you don't specify this, PolyFun will use a temporary directory that may be deleted at the end of the analysis. - You can run stages 2 and 3 together by invoking
polyfun.py
with both the flags--compute-h2-L2
and--compute-ldscores
.
In this stage PolyFun will re-estimate per-SNP heritabilities robustly by estimating the heritability causally explained by each bin, using only the chromosomes that were not used to partition SNPs into bins. Technically, PolyFun will estimate the heritability causally explained by SNPs in each bin that are in an even (resp. odd) target chromosome, by applying S-LDSC with non-negativity constraints to even (resp. odd) chromosomes, excluding the target chromosome, and dividing the bin heritability by the number of SNPs in the bin. You can only run this stage after computing LD-scores for all chromosomes. Here is a usage example:
python polyfun.py \
--compute-h2-bins \
--output-prefix output/testrun \
--sumstats example_data/sumstats.parquet \
--w-ld-chr example_data/weights.
This script will output files with re-estimated per-SNP heritabilities that can be used directly for fine-mapping. Here is the output for chromosome 1 (seen via cat output/testrun.22.snpvar_constrained.gz | zcat | head
):
CHR SNP BP A1 A2 SNPVAR Z N
22 rs139069276 16866502 G A 4.5173e-06 -1.2188e-02 383290
22 rs34747326 16870173 A G 3.5684e-06 -1.8948e+00 383290
22 rs4010550 16900134 G A 4.9595e-06 1.3657e+00 383290
22 rs5994099 16905044 G A 3.5684e-06 7.2224e-01 383290
22 rs59750689 16936369 T C 4.5173e-06 9.4481e-02 383290
22 rs148021587 16939232 C T 4.9595e-06 8.0397e-01 383290
22 rs3954635 17024983 A C 4.9595e-06 -3.4335e-01 383290
22 rs371202053 17034394 G A 4.9595e-06 -7.8644e-01 383290
22 rs200071370 17037779 C T 4.9595e-06 3.5049e-01 383290
The SNPVAR
column contains per-SNP heritabilities. These can be used directly as prior causal probabilities in fine-mapping.