Fine-mapping : Fine-mapping aims to identify the causal variant(s) within a locus for a disease, given the evidence of the significant association of the locus (or genomic region) in GWAS of a disease.
Fine-mapping using individual data is usually performed by fitting the multiple linear regression model:
-
is a vector of genetic effects of variants.
Fine-mapping (using Bayesian methods) aims to estimate the PIP (posterior inclusion probability), which indicates the evidence for SNP j having a non-zero effect (namely, causal).
!!! info "PIP(Posterior Inclusion Probability)" PIP is often calculated by the sum of the posterior probabilities over all models that include variant j as causal.
$$ PIP_j:=Pr(b_j\neq0|X,y) $$
!!! info "Bayesian methods and Posterior probability" $$ Pr(M_m | O) = {{Pr(O | M_m) Pr(M_m)}\over{\sum_{i=1}^n{Pr( O | M_i) Pr(M_i)}}} $$
$O$ : Observed data
$M$ : Models (**the configurations of causal variants** in the context of fine-mapping).
$Pr(M_m | O)$: **Posterior Probability** of Model m
$Pr(O | M_m)$: **Likelihood** (the probability of observing your dataset given Model m is true.)
$Pr(M_m)$: **Prior** distribution of Model m (the probability of Model m being true)
${\sum_{i=1}^n{Pr( O | M_i) Pr(M_i)}}$: **Evidence** (the probability of observing your dataset), namely $Pr(O)$
!!! info "Credible sets"
A credible set refers to the minimum set of variants that contains all causal SNPs with probability
!!! note "Commonly used tools for fine-mapping"
Methods assuming only **one causal variant** in the locus
- ABF
Methods assuming **multiple causal variants** in the locus
- SUSIE / SUSIE-RSS
- CAVIAR, CAVIARBF, eCAVIAR
- FINEMAP
Methods assuming a small number of **larger causal effects** with a large number of **infinitesimal effects**
- SUSIE-inf
- FINEMAP-inf
Methods for **Cross-ancestry** fine-mapping
- SUSIEX
You can check [here](https://cloufield.github.io/CTGCatalog/Tools_Fine_mapping_README/) for more information.
In this tutorial, we will introduce SuSiE as an example. SuSiE stands for Sum of Single Effects” model.
The key idea behind SuSiE is :
where each vector
For fine-mapping with summary statistics using Susie (SuSiE-RSS), IBSS was modified (IBSS-ss) to take sufficient statistics (which can be computed from other combinations of summary statistics) as input. SuSie will then approximate the sufficient statistics to run fine-mapping.
!!! quote For details of SuSiE and SuSiE-RSS, please check : Zou, Y., Carbonetto, P., Wang, G., & Stephens, M. (2022). Fine-mapping from summary data with the “Sum of Single Effects” model. PLoS Genetics, 18(7), e1010299. Link
- Sumstats for the loci
- SNP list for calculating LD matrix
!!! example "Using python to check novel loci and extract the files." ```python import gwaslab as gl import pandas as pd import numpy as np
sumstats = gl.Sumstats("../06_Association_tests/1kgeas.B1.glm.firth",fmt="plink2")
...
sumstats.basic_check()
...
sumstats.get_lead()
Fri Jan 13 23:31:43 2023 Start to extract lead variants...
Fri Jan 13 23:31:43 2023 -Processing 1122285 variants...
Fri Jan 13 23:31:43 2023 -Significance threshold : 5e-08
Fri Jan 13 23:31:43 2023 -Sliding window size: 500 kb
Fri Jan 13 23:31:44 2023 -Found 59 significant variants in total...
Fri Jan 13 23:31:44 2023 -Identified 3 lead variants!
Fri Jan 13 23:31:44 2023 Finished extracting lead variants successfully!
SNPID CHR POS EA NEA SE Z P OR N STATUS
110723 2:55574452:G:C 2 55574452 C G 0.160948 -5.98392 2.178320e-09 0.381707 503 9960099
424615 6:29919659:T:C 6 29919659 T C 0.155457 -5.89341 3.782970e-09 0.400048 503 9960099
635128 9:36660672:A:G 9 36660672 G A 0.160275 5.63422 1.758540e-08 2.467060 503 9960099
```
We will perform fine-mapping for the first significant loci whose lead variant is `2:55574452:G:C`.
```
# filter in the variants in the this locus.
locus = sumstats.filter_value('CHR==2 & POS>55074452 & POS<56074452')
locus.fill_data(to_fill=["BETA"])
locus.harmonize(basic_check=False, ref_seq="/Users/he/mydata/Reference/Genome/human_g1k_v37.fasta")
locus.data.to_csv("sig_locus.tsv",sep="\t",index=None)
locus.data["SNPID"].to_csv("sig_locus.snplist",sep="\t",index=None,header=None)
```
check in terminal:
```bash
head sig_locus.tsv
SNPID CHR POS EA NEA BETA SE Z P OR N STATUS
2:54535206:C:T 2 54535206 T C 0.30028978 0.142461 2.10786 0.0350429 1.35025 503 9960099
2:54536167:C:G 2 54536167 G C 0.14885099 0.246871 0.602952 0.546541 1.1605 503 9960099
2:54539096:A:G 2 54539096 G A -0.0038474211 0.288489 -0.0133355 0.98936 0.99616 503 9960099
2:54540264:G:A 2 54540264 A G -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099
2:54540614:G:T 2 54540614 T G -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099
2:54540621:A:G 2 54540621 G A -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099
2:54540970:T:C 2 54540970 C T -0.049506452 0.149053 -0.332144 0.739781 0.951699 503 9960099
2:54544229:T:C 2 54544229 C T -0.14338203 0.151172 -0.948468 0.342891 0.866423 503 9960099
2:54545593:T:C 2 54545593 C T -0.1536723 0.165879 -0.926409 0.354234 0.857553 503 9960099
head sig_locus.snplist
2:54535206:C:T
2:54536167:C:G
2:54539096:A:G
2:54540264:G:A
2:54540614:G:T
2:54540621:A:G
2:54540970:T:C
2:54544229:T:C
2:54545593:T:C
2:54546032:C:G
```
!!! example Calculate LD matrix using PLINK ``` #!/bin/bash
plinkFile="../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing"
# LD r matrix
plink \
--bfile ${plinkFile} \
--keep-allele-order \
--r square \
--extract sig_locus.snplist \
--out sig_locus_mt
# LD r2 matrix
plink \
--bfile ${plinkFile} \
--keep-allele-order \
--r2 square \
--extract sig_locus.snplist \
--out sig_locus_mt_r2
```
Take a look at the LD matrix (first 5 rows and columns)
```
head -5 sig_locus_mt.ld | cut -f 1-5
1 -0.145634 0.252616 -0.0876317 -0.0876317
-0.145634 1 -0.0916734 -0.159635 -0.159635
0.252616 -0.0916734 1 0.452333 0.452333
-0.0876317 -0.159635 0.452333 1 1
-0.0876317 -0.159635 0.452333 1 1
head -5 sig_locus_mt_r2.ld | cut -f 1-5
1 0.0212091 0.0638148 0.00767931 0.00767931
0.0212091 1 0.00840401 0.0254833 0.0254833
0.0638148 0.00840401 1 0.204605 0.204605
0.00767931 0.0254833 0.204605 1 1
0.00767931 0.0254833 0.204605 1 1
```
Heatmap of the LD matrix:

!!! note Susie in R ``` install.packages("susieR")
# Fine-mapping with summary statistics
fitted_rss2 = susie_rss(bhat = sumstats$betahat, shat = sumstats$sebetahat, R = R, n = n, L = 10)
```
`R` : a `p` x `p` LD r matrix.
`N` : Sample size.
`bhat` : Alternative summary data giving the estimated effects (a vector of length `p`). This, together with shat, may be provided instead of z.
`shat` : Alternative summary data giving the standard errors of the estimated effects (a vector of length `p`). This, together with bhat, may be provided instead of z.
`L` : Maximum number of non-zero effects in the susie regression model. (defaul : `L = 10`)
!!! quote SusieR tutorial
For deatils, please check [SusieR tutorial - Fine-mapping with susieR using summary statistics](https://stephenslab.github.io/susieR/articles/finemapping_summary_statistics.html)
!!! tip "Use susieR in jupyter notebook (with Python):"
Please check : https://github.com/Cloufield/GWASTutorial/blob/main/12_fine_mapping/finemapping_susie.ipynb
- Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. (2020). A simple new approach to variable selection in regression, with application to genetic fine mapping. Journal of the Royal Statistical Society, Series B 82, 1273–1300. https://doi.org/10.1111/rssb.12388
- Zou, Y., Carbonetto, P., Wang, G. & Stephens, M. (2021). Fine-mapping from summary data with the “Sum of Single Effects” model. bioRxiv https://doi.org/10.1101/2021.11.03.467167
- Schaid, Daniel J., Wenan Chen, and Nicholas B. Larson. "From genome-wide associations to candidate causal variants by statistical fine-mapping." Nature Reviews Genetics 19.8 (2018): 491-504.
- Purcell, Shaun, et al. "PLINK: a tool set for whole-genome association and population-based linkage analyses." The American journal of human genetics 81.3 (2007): 559-575.