Fine_Mapping_PD.Rmd

---
title: "<center><h1>Fine Mapping:</h1>Parkinson's Disease</h1></center>" 
author: 
    "<div class='container'>
     <h3>Brian M. Schilder, Bioinformatician II<br>
     Raj Lab<br>
     Department of Neuroscience<br>
     Icahn School of Medicine at Mount Sinai<br>
     NYC, New York<br>
     </h3> 
     <a href='https://github.com/RajLabMSSM/Fine_Mapping' target='_blank'><img src='./echolocatoR/images/echo_logo_sm.png'></a> 
     <a href='https://github.com/RajLabMSSM' target='_blank'><img src='./web/images/github.png'></a> 
     <a class='item' href='https://rajlabmssm.github.io/RajLab_website/' target='_blank'>
        <img src='./web/images/brain-icon.png'>
        <span class='caption'>RAJ LAB</span>
     <a href='https://icahn.mssm.edu/' target='_blank'><img src='./web/images/sinai.png'></a>
     </div>"
date: "<br>Most Recent Update:<br> `r Sys.Date()`"
output:
 html_document:
    theme: cerulean
    highlight: zenburn
    code_folding: show
    toc: true
    toc_float: true
    smooth_scroll: true
    number_sections: false
    self_contained: true
    css: ./web/css/style.css
editor_options: 
  chunk_output_type: inline
--- 


```{r setup, message=F, warning=F, dpi = 600, class.output = "pre"}
# knitr::opts_chunk$set(error = T) # Knit even with errors
# devtools::install_local("echolocatoR/", force = T)
# library("echolocatoR")
source("echolocatoR/R/MAIN.R")
# reticulate::conda_install(envname = "echoR",
#                           packages = c("tqdm","scipy","scikit-learn",
#                                        "requests","pandas","pyarrow","fastparquet",
#                                        "bitarray","networkx","rpy2","tabix"), 
#                           python_version = 3.7,
#                           conda="/opt/anaconda3/condabin/conda")
if(startsWith(getwd(),"/sc/")){
  reticulate::use_condaenv("echoR")
} else {
  reticulate::use_condaenv("echoR", conda="/opt/anaconda3/condabin/conda")
}
# setwd("~/Desktop/Fine_Mapping")
# MINERVA PATHS TO SUMMARY STATISTICS DATASETS:
# Data_dirs <- read.csv("./Data/directories_table.csv")
finemap_results <- list()
coloc_results <- list()  
```

  
# Nalls et. al. (2019) w/ 23andMe

## LRRK2

LRRK2 is a tricky locus with multiple peaks. We therefore fine-mapped it a little differently:
1. Remove G2019S (known rare variant) and its correlates.
2. Identify the minimum and maximum span of SNPs that are in LD (r2) with the lead common SNP within the LRRK2 gene.

```{r}
quick_finemap(locus = "LRRK2")
lead.snp <- subset(finemap_DT, leadSNP)
merged <-data.table:::merge.data.table(finemap_DT, 
                              data.table::data.table(r2=LD_matrix[lead.snp$SNP,]^2,
                                                     SNP=names(LD_matrix[lead.snp$SNP,])), 
                              by="SNP")
subset(merged, r2>0.4)$POS %>% summary()
min_POS=min(subset(merged, r2>0.4)$POS)
max_POS=max(subset(merged, r2>0.4)$POS)
finemap_DT %>% arrange(P)
```

 
## Fine-mapping {.tabset .tabset-fade .tabset-pills} 

Parkinson's Disease GWAS.  

BA, specificity spelling, T-cell?, FDR5%, networks of co-expression[ed genes], singel-cell; NOD2, GPNMB, BST1 and CHRNB1 are eQTLs in MyND monocytes

```{r PD: Nalls23andMe_2019, results = 'asis', fig.height=10, fig.width=7}
# fig.show='hold'
dataset_name <- "Nalls23andMe_2019"
top_SNPs <- Nalls_top_SNPs <- import_topSNPs(
  topSS_path = Directory_info(dataset_name, "topSS"),
  chrom_col = "CHR", position_col = "BP", snp_col="SNP",
  pval_col="P, all studies", effect_col="Beta, all studies", gene_col="Nearest Gene",
  caption= "Nalls et al. (2018) w/ 23andMe PD GWAS Summary Stats",
  group_by_locus = T, 
  locus_col = "Locus Number",
  remove_variants = "rs34637584")
# Manually add new SNP of interest  
top_SNPs <- data.table::rbindlist(list(top_SNPs, 
                                      data.frame(CHR=14,POS=55360203, SNP="rs3783642", 
                                                   P=2e-308, Effect=1, Gene="ATG14", Locus=NA ),
                                      data.frame(CHR=12,POS=53797236, SNP="rs34656641",
                                                   P=2e-308, Effect=1, Gene="SP1", Locus=NA ),
                                      data.frame(CHR=5,POS=126181862, SNP="rs959573",
                                                   P=2e-308, Effect=1, Gene="LMNB1", Locus=NA ),
                                      data.frame(CHR=17,POS=40609405, SNP="rs9897702",
                                                   P=2e-308, Effect=1, Gene="ATP6V0A1", Locus=NA ),
                                      data.frame(CHR=12,POS=40388109, SNP="rs140722239",
                                                   P=2.691e-37, Effect=0.3869, Gene="SLC2A13", Locus=NA )
                                      ), use.names = T)
# loci <- unique(top_SNPs$Gene)
# # Filter loci
# finished_loci <- list.files("./Data/GWAS/Nalls23andMe_2019",pattern = "_ggbio.png$",
#            recursive = T, full.names = T) 
# info <- file.info(finished_loci)$ctime %>% as.character()
# loci <- finished_loci[which(!startsWith(info,"2020-01-24"))]%>% dirname() %>% dirname() %>% basename()
#  

finemap_results[[dataset_name]] <- finemap_loci(top_SNPs,
                                               loci = "LRRK2",#top_SNPs$Gene, #unique(c("LRRK2", loci)),
                                               trim_gene_limits = F,#c(T, rep(F,length(gene_list))),
                                               dataset_name = dataset_name,
                                               dataset_type = "GWAS",
                                               query_by ="tabix", # coordinates
                                               subset_path = "auto", 
                                               finemap_methods = c("ABF","SUSIE","POLYFUN_SUSIE","FINEMAP"),  
                                               force_new_subset = F,
                                               force_new_LD = F,
                                               force_new_finemap = F, 
                 fullSS_path = Directory_info(dataset_name, "fullSS.local"),
                 chrom_col = "CHR", position_col = "POS", snp_col = "RSID",
                 pval_col = "p", effect_col = "beta", stderr_col = "se", 
                 freq_col = "freq", MAF_col = "calculate",
                 A1_col = "A1",
                 A2_col = "A2",
                 bp_distance = 500000*2,
                 plot_window = 100000,
                 plot_Nott_binwidth = 100,
                 # max_snps=1000,
                 download_reference = T, 
                 LD_reference = "UKB",#"1KG_Phase1",
                 superpopulation = "EUR", 
                 LD_block = F, 
                 min_MAF = 0.001,
                 # min_POS = min_POS,
                 # max_POS = max_POS,
                 # remove_variants = c("rs34637584"),
                 # remove_correlates = c("rs34637584"),
                 conditioned_snps = "auto", # Use the lead SNP for each gene  
                 n_causal = 5, 
                 plot_types = c("simple"), #"fancy" 
                 Nott_bigwig_dir = "/pd-omics/data/Nott_2019",
                 PP_threshold = .95,
                 remove_tmps = T,  
                 server=F)  
```


# Summarise Results

## Gather AD Results

```{r}
# MERGED_DT <- lapply(c("Kunkle_2019","Posthuma_2018","Marioni_2018"), function(x){
#   message(x)
#   merged_DT <- merge_finemapping_results(minimum_support=1,
#                                           include_leadSNPs=T,
#                                           dataset = file.path("./Data/GWAS",x),
#                                           xlsx_path=F,
#                                           from_storage=T,
#                                           consensus_thresh = 2,
#                                           haploreg_annotation=F,
#                                           biomart_annotation=F, 
#                                           verbose = F)
#   return(merged_DT)
# }) %>% data.table::rbindlist(fill=T)
```

## Gather PD Results
```{r Gather Results}
merged_DT <- merge_finemapping_results(minimum_support=0,
                                      include_leadSNPs=T,
                                      dataset = "./Data/GWAS/Nalls23andMe_2019",
                                      xlsx_path=F,
                                      from_storage=T,
                                      consensus_thresh = 2,
                                      haploreg_annotation=F,
                                      biomart_annotation=F, 
                                      verbose = F) 
no_no_loci <- c("HLA-DRB5","MAPT","ATG14","SP1","LMNB1","ATP6V0A1", 
                # Tau region 
                "RETREG3","UBTF","FAM171A2","MAP3K14","CRHR1","MAPT-AS1","KANSL1","NSF","WNT3")
merged_DT <- subset(merged_DT, !(Locus %in% no_no_loci))

mean(subset(merged_DT, leadSNP)$mean.PP, na.rm = T)
mean(subset(merged_DT, Support>0)$mean.PP, na.rm = T)
mean(subset(merged_DT, Consensus_SNP)$mean.PP, na.rm = T)

head(arrange(merged_DT, Locus), 50)
```


```{r locus_order}

# Sort from loci with the smallest to largest UCS size.
locus_order <- merged_DT %>%
  dplyr::group_by(Locus, .drop=F) %>% 
  dplyr::summarise_at(vars(ends_with("Credible_Set")),
                      funs(size=n_distinct(SNP[.>0], na.rm = T),
                           length=n_distinct(SNP) )
                   ) %>%  
  dplyr::mutate(UCS_size = mean.Credible_Set_length) %>% 
 dplyr::arrange(-UCS_size) 
locus_order <- locus_order[!endsWith(colnames(locus_order), ".Credible_Set_length")]
locus_order$Locus <- factor(locus_order$Locus,  levels = locus_order$Locus, ordered = T)   

createDT(subset(merged_DT, Consensus_SNP == T))  

```

## Get filtering estimates

```{r}
snp_groups <- merged_DT %>% 
  dplyr::group_by(Locus) %>% 
  dplyr::summarise(Total.SNPs=n_distinct(SNP, na.rm = T),
                   nom.sig.GWAS=n_distinct(SNP[P<.05], na.rm = T),
                   sig.GWAS=n_distinct(SNP[P<5e-8], na.rm = T),
                   CS=n_distinct(SNP[Support>0], na.rm = T),
                   Consensus=n_distinct(SNP[Support>1], na.rm = T),
                   topConsensus=n_distinct(SNP[Consensus_SNP & mean.PP==max(mean.PP)], na.rm = T ),
                   topConsensus.leadGWAS=n_distinct(SNP[Consensus_SNP & leadSNP], na.rm = T ))
 
# snp_groups[,-1] %>% dplyr::summarise_all(.funs =  funs( mean(subset(., .>0)) ))
snp_groups[,-1] %>% colSums() / n_distinct(snp_groups$Locus) 
```

## Credible Set Bins (Fig 2a)

```{r Credible Set Bins}    
# n_bins <- 5
# seq(0,max_CS_size,n_bins)  
max_CS_size <- sapply(locus_order[,-1], max, na.rm=T) %>% max()
labels = c("0","1","2-4","5-7","8-10","11-15")
bin_counts <-
  locus_order %>% 
  reshape2:::melt.data.frame(measure.vars = grep("*_size$", colnames(locus_order), value = T), 
                             value.name = "CS_size") %>%
  dplyr::mutate(Method=gsub(".Credible_Set_size$|_size$","",variable))  %>%
  dplyr::group_by(Method, .drop=F) %>%
  # subset(Method=="UCS") %>%
  dplyr::mutate(bin = case_when(
      CS_size == 0 ~ labels[1],
      CS_size == 1 ~ labels[2],
      CS_size > 1 & CS_size <= 4 ~ labels[3],
      CS_size > 4 & CS_size <= 7 ~ labels[4],
      CS_size > 7 & CS_size <= 10 ~ labels[5],
      CS_size > 10 & CS_size <= 15 ~ labels[6], 
      ))
  # dplyr::mutate(count_bins = gsub("\\(|\\]|\\[","", cut(CS_size, breaks = breaks, include.lowest = F))) 
  # tidyr::separate(count_bins, into=c("start","end"),sep=",") %>% 
  # dplyr::mutate(bin= paste0(start,"-",end), type="CS") %>% 
bin_counts$bin <- factor(bin_counts$bin, levels = rev(labels), ordered = T)
```

```{r}
# Plot
custom_colors <- c("lightgrey",RColorBrewer::brewer.pal(n=5,"GnBu"))
gg_locus_summary <- ggplot(subset(bin_counts, Method!="mean"), aes(x=Method, fill=bin)) + 
  geom_bar(stat="count",show.legend = T, position = position_stack(reverse = F), color="lightblue") + 
  # scale_fill_brewer(palette = "Spectral", direction = -1) +
  scale_fill_manual(values = rev(custom_colors)) +
  # geom_text(aes(label = paste(bin,"SNPs")), position =  position_stack(vjust = .5), vjust=-1, stat = "count") +
  geom_text(aes(label = ..count..),  position =  position_stack(vjust = .5), vjust=.5, stat = "count") +
  theme_bw() + 
  labs(x=NULL, y="Loci", fill="Credible Set size") +
  coord_flip() +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        rect = element_blank(), 
        axis.text.x =element_blank(),
        axis.ticks = element_blank(),
        legend.position = "top") +
  guides(fill = guide_legend(nrow = 1, reverse = T))
gg_locus_summary
```
 
## Variant Annotation

```{r Variant Annotation}
snp_info <- biomart_snp_info(unique(subset(merged_DT, Support>0)$SNP))
# unique(snp_info$consequence_type_tv)
missense <- snp_info %>% 
  dplyr::group_by(refsnp_id) %>% 
  dplyr::summarise(Missense = ifelse(any(consequence_type_tv=="missense_variant",na.rm = T),T,F)) %>%
  data.table::data.table()

merged_DT <- data.table:::merge.data.table(merged_DT, missense,
                                           all.x=T,
                                           by.x="SNP",
                                           by.y = "refsnp_id")
missense_counts <- merged_DT %>% dplyr::group_by(Locus) %>%
  dplyr::summarise(Missense=sum(Missense, na.rm=T))
printer(sum(subset(merged_DT, Support>0)$Missense, na.rm = T),"missense mutations detected in UCS.")
```

## Consensus SNP Annotations

### Nott data

```{r Nott data}
# regions <- NOTT_2019.get_regulatory_regions(merged_DT)
# gr.regions <- GenomicRanges::makeGRangesFromDataFrame(df = regions, 
#                                                       seqnames.field = "chr", 
#                                                        start.field = "start", 
#                                                       end.field = "end", 
#                                                       keep.extra.columns = T)
# gr.peaks <- NOTT_2019.get_epigenomic_peaks(peak.dir="../../data/Nott_2019/peaks",
#                                              narrow_peaks = T,
#                                              broad_peaks = F)
regions <- readRDS("/Volumes/Steelix/fine_mapping_files/narrow_peaks_granges.RDS") 
## $$$$$ HERE $$$$$$ : use for fig 1, third col
# regions <- NOTT_2019.get_regulatory_regions(finemap_DT = merged_DT)  
# Report overlap 
gr.hits <- NOTT_2019.report_regulatory_overlap(finemap_DT=subset(merged_DT,Consensus_SNP),
                                              regions=regions)
# gr.hits <- data.frame(gr.hits)[,c("SNP","Name","Cell_type","Element","start","end","middle")]
gr.hits <- data.frame(gr.hits)[,c("SNP","name","score","Cell_type","Assay","marker")]

 
merged_annot <- data.table:::merge.data.table(data.table::data.table(merged_DT), 
                                               data.table::data.table(gr.hits),
                                               all.x = T,
                                               by = "SNP") 
if(nrow( subset(merged_annot, !is.na(Cell_type)) )==0){warning("No SNPs overlapping with peaks...")} 
```

### Melt Consensus Annotation Data        
 
```{r Melt Consensus Annotation Data}
# Identify whether consensus SNPs are GWAS lead
# You have to subset within summarise (not beforehand) to avoid  dropping any loci without consensus SNPs  
merged_annot <- merged_annot %>% 
  dplyr::group_by(Locus, Cell_type, Assay) %>% 
  dplyr::mutate(topConsensus = ifelse(Consensus_SNP & mean.PP==max(mean.PP,na.rm = T),T,F)) %>%
  data.table::data.table()


consensus_melt <- # merged_annot %>%
  # dplyr::group_by(Locus, Cell_type, Assay) %>% 
  # dplyr::summarise(Count =  n_distinct(SNP[topConsensus], na.rm = T)) %>%
  setDT(merged_annot)[, .(Count = n_distinct(SNP[Consensus_SNP==T], na.rm = T)),
                                       by=c("Locus","Cell_type","Assay")] %>%
  subset(!is.na(Cell_type) & !is.na(Assay)) %>%
  dplyr::mutate(Annotation = paste0(Cell_type,"_",Assay)) %>%
  # Merge with missense data
  merge(missense_counts, on="Locus", all=T)
consensus_melt$Locus <- factor(consensus_melt$Locus,  levels = locus_order$Locus, ordered = T)
length(unique(consensus_melt$Locus)) 


# consensus_melt %>% group_by(Locus) %>% summarise(cell_types = paste(unique(Cell_type), collapse="; "))
# subset(consensus_melt, Locus %in% c("CRLS1","CLCN3","HIP1R"))
```


```{r merge data and plot} 
library(patchwork)

# Group and melt CS sizes
CS_cols <- grep(".Credible_Set$", colnames(merged_DT), value = T)
CS_cols <- CS_cols[!startsWith(CS_cols, "mean")]
melt.dat <- merged_DT %>%
  dplyr::select(c("Locus","SNP",CS_cols)) %>% 
  unique() %>%
  dplyr::group_by(Locus, .drop=F) %>%
    summarise_at(vars(ends_with(".Credible_Set")),
                 funs(n_distinct(SNP[.>0], na.rm = T))) %>% 
  reshape2:::melt.data.frame(measure.vars = CS_cols, 
                             variable.name = "CS", 
                             value.name = "Credible Set size") %>%  
  merge(locus_order, by="Locus", all = T) %>%
   dplyr::mutate(Method=gsub(".Credible_Set$","", CS),
                Locus_UCS=paste0(Locus,"  (",UCS_size,")")) %>%
  arrange(UCS_size, Method) %>%
  dplyr::mutate(Locus = factor(Locus,  levels = locus_order$Locus, ordered = T),
                Method=factor(Method ))
melt.dat$Locus <- factor(melt.dat$Locus,  levels = locus_order$Locus, ordered = T)
 

## UCS
# gg_CS_count <- ggplot(data=locus_order,aes(x=Locus, y=UCS_size, fill=-UCS_size)) +
#   geom_col(color="white", show.legend = F) +
#   geom_text(aes(label = UCS_size, color=-UCS_size), nudge_y = .5, size=3, show.legend = F) +
#   labs(y="Union Credible Set size") +
#   coord_flip() + 
#  theme_classic()

## Method-specific CS
melt.dat[melt.dat$`Credible Set size`==0 | is.na(melt.dat$`Credible Set size`),"Credible Set size"] <- NA
gg_method_CS_count <- ggplot(data = melt.dat,
                             aes(y=Locus, x=`Credible Set size`, fill=Method)) + 
  geom_bar(stat = "identity", color="white", size=.05) + 
  geom_text(aes(label = `Credible Set size`), color="grey20",
            size=3, show.legend = F, position = position_stack(vjust = .5)) +
  geom_text(aes(x=sum(`Credible Set size`), label = UCS_size),
            size=3, show.legend = F, position = position_stack(vjust = 1)) +
  
  labs(x=NULL, y="Locus  (UCS size)\n") + 
  # coord_flip() +
  theme_bw() +
  theme(legend.position = "top", 
        # axis.text.y = element_blank(),
        # axis.title.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        legend.text = element_text(size = 8), 
        legend.key.size = unit(.5, units = "cm" )) +
  guides(fill = guide_legend(nrow = 2, 
                             title.position = "top",
                             title.hjust = .5)) 

# gg_method_CS_count
consensus_melt$dummy <- "UCS missense\nmutations"
consensus_melt[consensus_melt$Missense==0 | is.na(consensus_melt$Missense),"Missense"] <- NA
consensus_melt$Missense <- factor(consensus_melt$Missense, ordered = T)

gg_missense <- ggplot(data=consensus_melt, aes(x=dummy, y=Locus, fill=Missense)) +
  geom_tile(show.legend = F) +
  geom_text(aes(label = Missense)) +
  scale_fill_discrete(na.value = "transparent") +
  # scale_fill_viridis_d(na.value = "transparent") +
  theme_bw() +
  labs(y=NULL, x=NULL) +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(angle=40, hjust=1),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())
# gg_missense

## Variant annotations
# Extend palette beyond limit
# consensus_colors <-  c("white",colorRampPalette(RColorBrewer::brewer.pal(8, "PuRd"))(max(consensus_melt$Count)))
consensus_melt[consensus_melt$Count==0 | is.na(consensus_melt$Count),"Count"] <- NA
consensus_melt$Cell_type <- gsub("oligodendrocytes","oligo",consensus_melt$Cell_type)
consensus_melt$Cell_type <- gsub("peripheral PU1\\+","periph",consensus_melt$Cell_type)
gg_consensus_annot <- ggplot(data=subset(consensus_melt, !is.na(Assay)), 
                             aes(x=Assay, y=Locus, fill=Count, group=Count)) +
   geom_tile(color="white") +  
   # scale_fill_manual(values = consensus_colors) +
  # scale_fill_discrete(na.value = "transparent") +
   scale_fill_gradient(na.value = "transparent",low = scales::alpha("blue",.7), high = scales::alpha("red",.7)) +
   # geom_point(aes(size=ifelse(Count>0, "dot", "no_dot")), show.legend = F, alpha=.8, color="white") +
  facet_grid(facets = .~Cell_type) +  
   scale_size_manual(values=c(dot=.5, no_dot=NA), guide="none") +
   labs(fill = "UCS SNPs within\nepigenomic peaks") +
   theme_bw() +
   theme(legend.position = "top",  
         legend.title.align = .5,
         axis.text.x = element_text(angle = 40, hjust = 1),
         # legend.background =  element_rect(fill = "lightgray"),
         legend.key = element_rect(colour = "gray60"),
         axis.text.y = element_blank(), 
         axis.title.y = element_blank(), 
         strip.background = element_rect(fill="grey90"),
         legend.text = element_text(size = 8),
         legend.text.align = .5,
         # legend.key.size = unit(.5, units = "cm" ),
         legend.box="horizontal",
         panel.background = element_rect(fill = 'transparent'),
         # panel.grid = element_line(color="gray", size=5),
         # panel.grid.major.x = element_line(color="gray", size=.5),
         # panel.grid.major.x = element_line(color="gray", size=.5),

         panel.grid.minor = element_line(color="white", size=.5)) +  
  guides(color = guide_legend(nrow = 1, reverse = F,
                               title.position = "top",
                               # label.position = "top",
                               title.hjust = .5,
                               label.hjust = -1)) +
  # Keep unused levels/Loci
  scale_y_discrete(drop=FALSE)
# gg_consensus_annot

# Merge
gg_merged <- (patchwork::plot_spacer() + gg_locus_summary + plot_layout(widths = c(.4,.6)))  / 
  (gg_method_CS_count + gg_missense + gg_consensus_annot + plot_layout(widths = c(1,.1,1))) +
  patchwork::plot_layout(heights = c(.15,1),ncol = 1)
gg_merged

ggsave("./Data/GWAS/Nalls23andMe_2019/_genome_wide/Fig2.svg",dpi = 500,
       height=16, width=10)
```


## Regulatory Variants

(3) Regulatory - does it overlap chromatin accessibility in Glass data (or single cell) or histone modification and PLAC-seq? This class of variants will overlap TF binding sites, etc.  

```{r Regulatory Variants}

```


## QTLs

### catalogueR / eQTL Catalogue

```{r catalogueR / eQTL Catalogue}
if(startsWith(getwd(),"/sc/")){
  source("/sc/orga/projects/pd-omics/brian/catalogueR/functions/catalogueR.R")
} else {
  source("~/Desktop/catalogueR/functions/catalogueR.R")
}
# Query
sumstats_paths <- list.files("./Data/GWAS/Nalls23andMe_2019",
                             pattern = "*Multi-finemap_results.txt|*Multi-finemap.tsv.gz",
                             recursive = T, full.names = T)

no_no_loci <- c("HLA-DRB5","MAPT","ATG14","SP1","LMNB1","ATP6V0A1")
sumstats_paths <- sumstats_paths[!basename(dirname(dirname(sumstats_paths))) %in% no_no_loci]
# meta <- catalogueR.list_eQTL_datasets(save_path = F, force_new = T)
gwas.qtl_paths <- catalogueR.run(
                           sumstats_paths = sumstats_paths,
                           loci_names = basename(dirname(dirname(sumstats_paths))),
                           # qtl_search = c("ROSMAP","BrainSeq"),
                           # qtl_search =c("ROSMAP","Alasoo_2018","Fairfax_2014",
                           #               "Nedelec_2016","BLUEPRINT","HipSci.iPSC",
                           #               "Lepik_2017","BrainSeq","TwinsUK","Schmiedel_2018",
                           #               "blood","brain"),
                           qtl_search=NULL,
                           
                           nThread = 20,
                           multithread_qtl = F,
                           multithread_loci = F,
                           multithread_tabix = T,
                           split_files = T,
                           merge_with_gwas = T,
                           # use_tabix = F,
                           force_new_subset = T,
                           # output_path = "/Volumes/Scizor/eQTL_catalogue/Nalls23andMe_2019",
                           output_path = "../eQTL_catalogue/Nalls23andMe_2019",
                           progress_bar = F, 
                           genome_build="hg19"
                           ) 
# gwas.qtl_paths <- list.files("/pd-omics/brian/eQTL_catalogue/Nalls23andMe_2019", full.names = T, recursive = T)
```

#### top QTLs

```{r top QTLs, fig.height=7, fig.width=5}
qtl_root <- "./Data/GWAS/Nalls23andMe_2019/_genome_wide/eQTL_Catalogue"

# Gather top eVariants per QTL.ID per locus
# topQTL <- catalogueR.gather_top_eVariants(root_dir="/pd-omics/brian/eQTL_catalogue/Nalls23andMe_2019", 
#                                             save_path=file.path(qtl_root, "eQTL_catalogue_topHits.tsv.gz"),
#                                             nThread=4)

# Or just read in preprocessed file
topQTL <- data.table::fread(file.path(qtl_root,"eQTL_catalogue_topHits.tsv.gz"))

# Merge with fine-mapping results and filter by SNP groups ()
## Also filter by QTL p-value 
topQTL_sub <- catalogueR.top_eVariants_overlap(topQTL, 
                                             gwas_dataset="./Data/GWAS/Nalls23andMe_2019",
                                             gwas_min_support=1,
                                             qtl_pvalue_thresh=NULL)
# Construct the names of files in the topQTL ddf
topQTL_files <- catalogueR.top_files(topQTL = topQTL_sub, root_dir = root_dir)
# PLot heatmap with SNP group annotations
## THis plot is kind of unncessary now that the labels are integrated into the coloc plots
topQTL_plot <- catalogueR.plot_top_eVariants_overlap(topQTL_sub,
                                                     qtl_pvalue_thresh=NULL,
                                                     save_path=F)  
```


#### Colocalization

```{r Colocalization, fig.height=5, fig.width=10}
coloc_root <- "./Data/GWAS/Nalls23andMe_2019/_genome_wide/COLOC"
# Run COLOC
coloc_QTLs <- catalogueR.run_coloc(gwas.qtl_paths = gwas.qtl_paths,#gwas.qtl_paths,
save_path=file.path(coloc_root,"coloc.eQTL_Catalogue_topQTLs.tsv.gz"),
                                 nThread=4, split_by_group = F)


coloc_QTLs <- data.table::fread(file.path(coloc_root,"coloc.eQTL_Catalogue_topQTL.tsv.gz")) 

# PLOT
# PLot colocalized eGenes for each locus-qtl.id comparison 
coloc_plot_df <- catalogueR.plot_coloc_summary(coloc_QTLs, 
                                               topQTL,
                                               PP_thresh = .8,
                                               save_path=file.path(coloc_root,
                                                                   "coloc_topQTL.png"))
```


### Manually gathered QTL data

```{r}
dat <- data.table::fread(file.path("./Data/GWAS/Nalls23andMe_2019/_genome_wide",
                                       "Nalls23andMe_2019.QTL_overlaps.txt.gz"), 
                              nThread = 4)
query.snps <- subset(merged_DT, Locus %in% c("LRRK2","MED12L"))$SNP
# subset(qtl.dat, SNP %in% query.snps)
top.snps <- subset(dat, SNP %in% query.snps)

# top.snps_melt <-  setDT(top.snps)[, .( by=c("Locus","Cell_type","Assay")]
top.snps_melt <- top.snps %>% reshape2:::melt.data.frame(id.vars = c("SNP","Gene","leadSNP","Consensus_SNP"), 
                                        measure.vars = grep("*.FDR", colnames(top.snps), value = T),
                                        variable.name = "QTL",
                                        value.name ="FDR") %>% 
  dplyr::mutate(QTL=gsub(".FDR$","",QTL)) %>%
  subset(SNP %in% c("rs7294619","rs73159907") | leadSNP )
top.snps_melt <- top.snps_melt %>% 
  dplyr::group_by(SNP) %>%  
  dplyr::mutate(SNP.group = ifelse(leadSNP ,"red","goldenrod"))
# top.snps_melt <- top.snps_melt %>% 
#   dplyr::group_by(QTL) %>% 
#   dplyr::mutate(leadQTL = ifelse(FDR==max(FDR,na.rm = T),T,F))
# top.snps_melt$SNP <- factor(top.snps_melt$SNP, unique(top.snps_melt$SNP),ordered = T)
ggplot(subset(top.snps_melt,FDR<0.05), aes(x=SNP, y=QTL, fill=-log10(FDR))) + 
  geom_raster() + 
  # geom_point(data=subset(top.snps_melt, leadQTL), shape="*", color="cyan", size=5) +
  facet_grid(facets = .~Gene, scales = "free_x") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_continuous(low="thistle2", high="darkred", 
                       guide="colorbar",na.value="transparent") 


```

## Raj sQTLs

(2) Splicing variant (look up splicing QTL datasets that _WE_ generated). sQTLs  

  - [Paper](https://www.nature.com/articles/s41588-018-0238-1#Sec30)
    + Table S10 contains the top sQTLs from ROSMAP.
  - [sQTLviz](https://rajlab.shinyapps.io/sQTLviz_ROSMAP/)
  - [Raw Data](https://www.radc.rush.edu)  
  
```{r sQTLs}
sqtls <- readxl::read_excel("./Data/QTL/sQTLviz/Raj2018_TableS10.xlsx")  
colnames(sqtls) <- paste0("sQTL.",colnames(sqtls)) 

merged_DT <- merged_DT %>% dplyr::mutate(snp_id=paste0(CHR,":",POS)) %>% data.table::data.table()
merged_DT <- data.table:::merge.data.table(subset(merged_DT, Support>0),
                              data.table::data.table(sqtls), 
                              all.x = T,
                              by.x="snp_id", by.y="sQTL.snp_id")
merged_DT$Raj_2019.sQTL <- ifelse(is.na(merged_DT$sQTL.pval), 0, 1)
createDT(subset(merged_DT, !is.na(sQTL.pval)))
```


## Deep Learning Predictions
(5) Check to see DeepSEA and SpliceAI scores.  

### DeepSEA  
- Either use their server, or get pre-computed values from in silico mutagenesis?
```{r}
```

### SpliceAI
```{r}
```


(6) identify Transcription Factor Binding motifs for known TFs. Not sure which database to use.  
  - Does he mean TFBS? SNPs aren't good for identifying motifs bc they're not the full sequence.
```{r}
```

# Singleton comparison

Compare echolocatoR to Singleton results

```{r}
merged_DT <- merge_finemapping_results(minimum_support=0,
                                      include_leadSNPs=T,
                                      dataset = "./Data/GWAS/Nalls23andMe_2019",
                                      xlsx_path=F,
                                      from_storage=T,
                                      consensus_thresh = 2,
                                      haploreg_annotation=F,
                                      biomart_annotation=F, 
                                      verbose = F) 
dat <- data.table::fread("/Volumes/Steelix/fine_mapping_files/pd_meta5_sum_stats_fm_results.csv.gz")
DAT <- data.table::merge.data.table(merged_DT, dat,
                                    by.x =c("SNP","A1","A2"),
                                    by.y =c("SNP","A1","A2")) %>% 
  dplyr::group_by(SNP) %>% 
  top_n(n = 1, wt = -P) %>% 
  data.frame()
remove(merged_DT, dat)


CS_overlap <- function(DAT){
  OVERLAP <- lapply(c("ABF","FINEMAP","SUSIE","POLYFUN_SUSIE","mean"), function(x){
    print(x)
    cs1 <- unique(subset(DAT, prob>.95)$SNP)
    cs2 <- unique( DAT[DAT[[paste0(x,".Credible_Set")]]>0,]$SNP )
    cs2 <- cs2[!is.na(cs2)]
  
   
    overlap <- intersect(cs1, cs2)
    comp_df <- data.frame(Singleton_CS_size=length(cs1), 
                          Raj_CS_size=length(cs2),
                          Singleton_CS.overlap_proportion=round(length(overlap)/length(cs1),2),
                          Raj_CS.overlap_proportion=round(length(overlap)/length(cs2),2), 
                          overlap=length(overlap),
                          method=x)
    return(comp_df)
  }) %>% data.table::rbindlist()
  return(OVERLAP)
}


finemap_correlations <- function(DAT, 
                                 x_var="prob",
                                 y_var="FINEMAP.PP",
                                 dropna=T,
                                 method="pearson",
                                 y_CS_filter=F){
  if(y_CS_filter){
    fm_method <- gsub("\\.PP",".Credible_Set",y_var)
    DAT <- DAT[DAT[fm_method]>0,]
  }
 
  prob_dat <- data.frame(x=DAT[[x_var]], 
                         y=DAT[[y_var]])
  
  if(dropna){
     prob_dat <- prob_dat[complete.cases(prob_dat),] # or fillna with 0 
  } else { prob_dat[is.na(prob_dat)] <- 0}
 
  prob_thresh <- 0
  prob_dat <- subset(prob_dat, x>prob_thresh | y>prob_thresh)
  # if(method=="phi"){
  #   sqrt(chisq.test(table(prob_dat$x,prob_dat$y),
  #                   correct=FALSE)$statistic/length(prob_dat$x))
  # }
  out <- cor.test(prob_dat$x, prob_dat$y, method = method)   
  res <- data.table::data.table(x_var=x_var, y_var=y_var, 
                                t=out$statistic, 
                                pval=out$p.value,
                                coef=out$estimate,
                                method=method)
  return(res)
}

iterate_finemap_correlations <- function(DAT){
  RES <- lapply(c("ABF.PP","FINEMAP.PP","SUSIE.PP","POLYFUN_SUSIE.PP","mean.PP"), function(y){
  print(y)
  res <- finemap_correlations(DAT, x_var="prob", y_var=y, 
                              dropna = T, method = "pearson",
                              y_CS_filter=T)
  return(res)
}) %>% data.table::rbindlist()
  return(RES)
}

# RES 


plot_finemap_correlations <- function(DAT,
                                      x_var="prob",
                                      y_var="FINEMAP.PP"){
  prob_dat <- data.frame(x=DAT[[x_var]], 
                         y=DAT[[y_var]])
  ggplot(prob_dat, aes(x=x, y=y)) + 
    geom_point(alpha=.5) + 
    labs(title=paste("Singleton FINEMAP PP x Raj",gsub("\\."," ",y_var))) +
    theme_bw()
}
 
 
```


# Session Information

```{r Session Information}
sessionInfo()
print(paste("susieR ", packageVersion("susieR")))
```