-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Commiting changes to Rpackage branch
- Loading branch information
1 parent
9ffe1bc
commit 8db6061
Showing
18 changed files
with
1,086 additions
and
77 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,9 @@ | ||
# FGEM | ||
Functional GWAS by Expectation Maximization | ||
install using | ||
|
||
`install.packages(devtools)` | ||
`library(devtools)` | ||
`install_github("CreRecombinase/FGEM")` | ||
|
||
Annotation scripts are in `data-raw` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
--- | ||
title: "Statistical Gene Mapping Using Gene Annotation and Expectation Maximization" | ||
author: "Nicholas Knoblauch" | ||
date: "October 7, 2015" | ||
output: html_document | ||
--- | ||
|
||
|
||
###Introduction | ||
|
||
GWAS, in the most simple context, seeks to identify loci significantly associated with a trait of interest. In the majority of GWAS, variants are weighted equally, that is to say, the prior probability that a particular variant contributes to the trait of interest is uniform. Given that the number of loci in the human genome that are known to very is in the millions, and the number of loci that contribute to a given trait is much smaller, there is an extremely high threshold for significance, due to the multiple-testing problem. | ||
The functional characterization of elements of the genome is another avenue by which we can understand the biological basis of complex traits. | ||
|
||
Instead of blindly testing all observed variants in isolation with their association with a trait of interest, it would be desirable to prioritize variants based on their functional relevance to the trait of interest. Furthermore, it would be desirable in the context of coding regions to summarize the contribution of variants within a gene, rather than treating each individually. | ||
|
||
###Method | ||
|
||
####Data | ||
For each of the $I$ genes being tested $B_i$ is the gene-level Bayes Factor from an association study for the trait of interest. For each of the $I$ genes being tested we also have $J$ functional annotations. These annotations can be a mix of almost any type of data. They could be binary data (e.g presence of absence of a particular GO term), count data (e.g number of exons), or continuous data (e.g level of conservation across mammals). These annotations make up the matrix $A$, which is $IxJ$. | ||
|
||
####Model | ||
For each gene, $Z_i=1$ indicates that gene $i$ is involved in the trait of interest, and $Z_i=0$ indicates that is not. If we knew $Z_i$ we could use logistic regression to learn the importance of each annotation in the trait of interest: | ||
$$logit(P(Z_i=1))=A_i\beta$$ | ||
|
||
However, we do not know $Z_i$. We can instead use Empirical Bayes. If we rewrite our bayes factor as the probability of observing the genotype data given the gene is causal divided by the probability of observing the genotype data given that the gene is not causal, then we can write the following likelihood function. | ||
|
||
$$ P(x|\beta)=\prod_iP(x_i|\beta)=\prod_i [\pi_i(\beta)P(x_i|Z_i=1)+(1-\pi_i(\beta))P(x_i|Z_i=0)]$$ | ||
|
||
Remembering the definition of the Bayes factor from, this is equivalent to | ||
|
||
$$P(x|\beta) \propto \prod_i[\pi_i(\beta)B_i+(1-\pi_i(\beta))]$$ | ||
And we can use maximum likelihood here to make estimates for $\beta$ | ||
|
||
Using Bayes rule, the posterior is | ||
$$P(Z_i=1|x)=\frac{P(x|Z_i=1)P(Z_i=1)}{P(x)}=\frac{\pi_i(\beta)B_i}{\pi_i(\beta)B_i+(1-\pi_i(\beta))}$$ | ||
|
||
###Code | ||
|
||
We will use Expectation Maximization to estimate $\beta$ (I will denote the current estimate of $\beta$ as $\beta^{(t)}$) and the membership probabilities(i.e $P(Z_i=1|x,\beta^{(t)})$). The first step is simply to set up our matrix of annotations. For this example We will use the Biological Process (BP) and Molecular Function (MF) GO terms. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
--- | ||
title: "ASD Gene Set" | ||
output: html_document | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE) | ||
``` | ||
|
||
# Candidate ASD genes | ||
|
||
```{r load,echo=FALSE,message=FALSE,warning=FALSE} | ||
library(dplyr) | ||
library(ggplot2) | ||
source("~/Dropbox/BayesianDA/FGEM/SFARIFunc.R") | ||
outdir <- "~/Dropbox/BayesianDA/GSE/" | ||
featdir <- "~/Desktop/SFARI/" | ||
consdf <- gen_annotations(filelist,featdir) | ||
consdf <- mutate(consdf,TADAq=ifelse(is.na(TADAq),1,TADAq)) | ||
consdf <- mutate(consdf,TADAi=ifelse(TADAq<0.15,1,0)) | ||
consdf <- mutate(consdf,UASD=pmax(TADAi,ASDc)) | ||
result_df <-readRDS("~/Dropbox/BayesianDA/GSE/COMBUASD-3.RDS") | ||
pred_df <- readRDS("~/Dropbox/BayesianDA/GSE/COMBUASD-3-PRED.RDS") | ||
err_df <- group_by(result_df,iter) %>% summarise(err=err[1]) | ||
nconsdf <- select(consdf,-ASDc,-TADAq,-TADAi) %>% rename(y=UASD) %>% combofeat(sep="5") | ||
``` | ||
|
||
## Including Plots | ||
|
||
You can also embed plots, for example: | ||
|
||
```{r misclass, echo=FALSE} | ||
ggplot(err_df)+geom_histogram(aes(x=err),binwidth=.005)+ggtitle(label="CV misclassification error across iterations") | ||
``` | ||
|
||
|
||
```{r effect,echo=FALSE} | ||
ggplot(result_df)+geom_histogram(aes(x=B))+facet_wrap(~coef)+ggtitle("Distribution of effect Betas") | ||
filter(result_df,B!=0) %>% mutate(BetaSign=ifelse(B>0,"Positive","Negative")) %>% filter(!grepl("5",coef)) %>% | ||
ggplot(aes(coef))+geom_bar()+geom_bar(aes(fill=BetaSign))+ggtitle("Number of nonzero Betas") | ||
filter(result_df,B!=0) %>% mutate(BetaSign=ifelse(B>0,"Positive","Negative")) %>% filter(grepl("5",coef)) %>% | ||
separate(coef,c("firstCoef","secondCoef"),sep="5",remove=T) %>% do(bind_rows(.,rename(.,firstCoef=secondCoef,secondCoef=firstCoef))) %>% | ||
ggplot(aes(secondCoef))+geom_bar()+geom_bar(aes(fill=BetaSign))+ggtitle("Number of nonzero Betas(Interaction Terms)") +facet_wrap(~firstCoef,scales="free") | ||
``` | ||
|
||
|
||
|
||
```{r} | ||
mpred <- group_by(pred_df,gene) %>% summarise(varp=var(pred),meanp=mean(pred),medp=median(pred),minp=min(pred),maxp=max(pred)) %>% ungroup() %>% arrange(desc(minp)) | ||
predf <- inner_join(consdf,mpred) %>% arrange(desc(minp)) | ||
predf <- arrange(predf,desc(meanp)) | ||
ggplot(predf)+geom_point(aes(x=log(meanp),y=log(TADAq))) | ||
ggplot(predf)+geom_point(aes(x=log(minp),y=log(TADAq))) | ||
write.table(predf,"~/Dropbox/BayesianDA/FGEM/ASD_Candidates.txt",col.names=T,row.names=F,sep="\t") | ||
finlist <- filter(predf,TADAq<0.5) %>% arrange(desc(meanp)) | ||
arrange(finlist,desc(worstp)) %>% head | ||
``` | ||
|
||
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
--- | ||
title: "ExAC_Corrections" | ||
author: "Nicholas Knoblauch" | ||
date: "October 15, 2016" | ||
output: html_document | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE) | ||
``` | ||
|
||
## R Markdown | ||
|
||
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. | ||
|
||
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: | ||
|
||
```{r data} | ||
library(FGEM) | ||
library(readr) | ||
library(dplyr) | ||
library(lazyeval) | ||
datafile <- "~/Dropbox/BayesianDA/FGEM_Data/TADA_ASC_SSC_results_Dec23.csv" | ||
anno_dff <- "~/Dropbox/BayesianDA/FGEM_Data/all_annotations.RDS" | ||
datadf <-read.table(datafile,header=T,sep=",",stringsAsFactors = F) | ||
anno_df <- readRDS(anno_dff) | ||
exacf <- "~/Dropbox/BayesianDA/FGEM_Data/fordist_cleaned_exac_r03_march16_z_pli_rec_null_data.txt" | ||
exacdf <- read.table(exacf,sep="\t",stringsAsFactors = F,header=T) | ||
exacdf <- dplyr::select(exacdf,-chr,-transcript) %>% dplyr::rename(Gene=gene) | ||
exacdf <- inner_join(exacdf,datadf) | ||
exac_feat <- c("syn_z", | ||
"mis_z", | ||
"lof_z", | ||
"pLI", | ||
"pRec", | ||
"pNull") | ||
i <- 1 | ||
eq <- paste0("y~x") | ||
eresid <- exacdf %>% mutate(res_syn_z=residuals(lm(interp(eq,y=syn_z,x=bp)))) | ||
eresid <- mutate(eresid,res_mis_z=residuals(lm(interp(eq,y=mis_z,x=bp)))) | ||
eresid <- mutate(eresid,res_lof_z=residuals(lm(interp(eq,y=lof_z,x=bp)))) | ||
eresid <- mutate(eresid,res_pLI_z=residuals(lm(interp(eq,y=pLI,x=bp)))) | ||
eresid <- mutate(eresid,res_pnull_z=residuals(lm(interp(eq,y=pNull,x=bp)))) | ||
eresid <- mutate(eresid,res_pRec_z=residuals(lm(interp(eq,y=pRec,x=bp)))) | ||
exac_feat <- c("syn_z", | ||
"mis_z", | ||
"pLI", | ||
"pRec", | ||
"pNull") | ||
ofeat <- anno2df(select(eresid,Gene,one_of(exac_feat)),feat.name="ExAC_z") | ||
eres_feat <- select(eresid,Gene,starts_with("res_")) | ||
eres_feat <- anno2df(eres_feat,feat.name="ExAC_resid") | ||
anno_df <- bind_rows(anno_df,eres_feat) | ||
go_sigfeat <- c("GO:0071420", "GO:0006473", "GO:0097119", "GO:0001711", "GO:0097114") | ||
nsig_feat <- c(go_sigfeat,unique(eres_feat$feature)) | ||
nsig_feat <- nsig_feat[!nsig_feat%in%c("res_lof_z","res_pRec_z")] | ||
sanno_df <- filter(anno_df,feature=="GO:0018024") | ||
smdf <- cfeat_df(sanno_df,datadf) | ||
tgmodel <- gen_model("GO:0018024",anno_df,datadf) | ||
res_post <- posterior_results(nfmodel,datadf,anno_df) %>% arrange(desc(post_improvement)) | ||
head(res_post) | ||
tail(res_post) | ||
res_results <- group_by(eres_feat,feature) %>% do(sem_df(cfeat_df(.,datadf))) | ||
# exac_zs <- select(exacdf,Gene,BF,qvalue,one_of(exac_feat)) | ||
neresid <- inner_join(eresid,res_post) | ||
cor(log(neresid$new_posterior),log(neresid$res_pRec_z),) | ||
texac_zs <- mutate_at(exac_zs,vars(ends_with("_z")),percent_rank) | ||
filter(texac_zs,Gene=="KATNAL2") | ||
exacdf <- anno2df(exacdf,feat.name="ExAC") | ||
``` | ||
|
Oops, something went wrong.