Autoformat

Signed-off-by: Chaichontat Sriworarat <[email protected]>
gofflab · Nov 4, 2022 · a87a28e · a87a28e
1 parent 868a472
commit a87a28e
Show file tree

Hide file tree

Showing 2 changed files with 110 additions and 60 deletions.
diff --git a/.lintr b/.lintr
@@ -0,0 +1,4 @@
+linters: with_defaults(
+    line_length_linter=line_length_linter(120L),
+    object_name_linter = NULL
+  )
diff --git a/modules/module_9/notebooks/Module_9-EDA_and_plotting.rmd b/modules/module_9/notebooks/Module_9-EDA_and_plotting.rmd
@@ -20,7 +20,7 @@ knitr::opts_chunk$set(echo = TRUE)
 # setwd("modules/module_9/notebooks")
 library(SummarizedExperiment)
 library(tidyverse)
-#GEO <- "GSE63482"
+# GEO <- "GSE63482"
 
 # Read in the data
 data <- read.csv("https://github.com/gofflab/Quant_mol_neuro_2022/raw/main/modules/module_9/notebooks/GSE63482_Expression_matrix.tsv", sep = " ", header = T, row.names = 1)
@@ -114,41 +114,61 @@ se.pca$x %>%
 
 # Grammar of Graphics (ggplot2)
 
--   A layer-based framework to describe and construct visualizations in a structured manner.
+- A layer-based framework to describe and construct visualizations in a
+  structured manner.
 
--   Original grammar of graphics framework proposed by Leland Wilkinson
+- Original grammar of graphics framework proposed by Leland Wilkinson
 
--   Modern layered grammar of graphics framework was developed by Hadley Wickham (author of ggplot2 package)
+- Modern layered grammar of graphics framework was developed by Hadley Wickham
+  (author of ggplot2 package)
 
-    ![](images/paste-96C18E8F.png)
+  ![](images/paste-96C18E8F.png)
 
-    When designing a visualization, consider these components in hierarchical order:
+  When designing a visualization, consider these components in hierarchical
+  order:
 
-    1.  **Data**: Exactly *what* data do you plan on representing in your visualization? How does this need to be prepared in order to display your concept?
+  1.  **Data**: Exactly _what_ data do you plan on representing in your
+      visualization? How does this need to be prepared in order to display your
+      concept?
 
-    2.  **Aesthetics**: What axes are needed for the data dimensions, or positions of various data points in the plot? Also check if any form of encoding is needed including size, shape, color and so on which are useful for plotting multiple data dimensions. These 'aesthetics' will define how your multi-dimensional data will be represented.
+  2.  **Aesthetics**: What axes are needed for the data dimensions, or positions
+      of various data points in the plot? Also check if any form of encoding is
+      needed including size, shape, color and so on which are useful for
+      plotting multiple data dimensions. These 'aesthetics' will define how your
+      multi-dimensional data will be represented.
 
-    3.  **Scale:** Do we need to scale the potential values, use a specific scale to represent multiple values or a range?
+  3.  **Scale:** Do we need to scale the potential values, use a specific scale
+      to represent multiple values or a range?
 
-    4.  **Geometric objects:** What 'geometries' or data visualization types are needed (e.g. points, bars, lines, tiles, etc)? How will you depict the data on the visualization.
+  4.  **Geometric objects:** What 'geometries' or data visualization types are
+      needed (e.g. points, bars, lines, tiles, etc)? How will you depict the
+      data on the visualization.
 
-    5.  **Statistics:** Do we need to show some statistical measures in the visualization like measures of central tendency, spread, confidence intervals? Will different dimensions of your data be summarized in a specific way?
+  5.  **Statistics:** Do we need to show some statistical measures in the
+      visualization like measures of central tendency, spread, confidence
+      intervals? Will different dimensions of your data be summarized in a
+      specific way?
 
-    6.  **Facets:** Do we need to create subplots based on specific data dimensions (e,g, making the same plot but for different subsets of the data)?
+  6.  **Facets:** Do we need to create subplots based on specific data
+      dimensions (e,g, making the same plot but for different subsets of the
+      data)?
 
-    7.  **Coordinate system:** What kind of a coordinate system should the visualization be based on --- should it be cartesian or polar?
+  7.  **Coordinate system:** What kind of a coordinate system should the
+      visualization be based on --- should it be cartesian or polar?
 
--   [ggplot2 Cheat Sheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf)
+- [ggplot2 Cheat Sheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf)
 
 ## Plotting with ggplot2
 
 ### Start with organizing and planning your data
 
-ggplot2 works by binding a dataset to a ggplot object. For the most effective use of ggplot, data are expected to be in a data.frame-style organization:
+ggplot2 works by binding a dataset to a ggplot object. For the most effective
+use of ggplot, data are expected to be in a data.frame-style organization:
 
 ![](images/paste-5590FC1F.png)
 
-If we are interested in plotting the expression information for our se object, we might start with looking at the `assay` data:
+If we are interested in plotting the expression information for our se object,
+we might start with looking at the `assay` data:
 
 ```{r}
 head(assay(se, "fpkm"))
@@ -157,19 +177,18 @@ head(assay(se, "fpkm"))
 With this layout we can plot samples (variables) against each other like so.
 
 ```{r}
-plotData<-as.data.frame(assay(se,'fpkm')+1)
+plotData <- as.data.frame(assay(se, "fpkm") + 1)
 
 p <- ggplot(plotData)
-
-p<- p + geom_point(aes(x=E15_subcereb,y=E16_subcereb))
+p <- p + geom_point(aes(x = E15_subcereb, y = E16_subcereb))
 p
-
 ```
 
-we can layer-on additional visualizations that might help our interpretation of the data
+we can layer-on additional visualizations that might help our interpretation of
+the data
 
 ```{r}
-p <- p + geom_smooth(aes(x=E15_subcereb,y=E16_subcereb))
+p <- p + geom_smooth(aes(x = E15_subcereb, y = E16_subcereb))
 
 p
 ```
@@ -182,7 +201,10 @@ p + scale_y_log10() + scale_x_log10()
 
 ### Melting a SE object
 
-To add sample information to this plot, we'll need to reorganize the data to make it compatible with ggplot's input expectations, Lets grab the information for a single gene and organize it for plotting with ggplot2. First let's grab the gene expression information from assay(se,'fpkm')
+To add sample information to this plot, we'll need to reorganize the data to
+make it compatible with ggplot's input expectations, Lets grab the information
+for a single gene and organize it for plotting with ggplot2. First let's grab
+the gene expression information from `assay(se, 'fpkm')`
 
 ```{r}
 gene <- "Fezf2"
@@ -192,42 +214,46 @@ geneData <- assay(se, "fpkm")[gene, ]
 geneData
 ```
 
-In order to transform these data from a wide-form format into a long-form format, we need to 'melt' geneData.
+In order to transform these data from a wide-form format into a long-form
+format, we need to 'melt' geneData.
 
 ```{r}
 geneData.melt <- reshape2::melt(geneData)
 geneData.melt
 ```
 
-Furthermore, to add all of the other useful information for each sample, we need to retrieve it from colData(se). We need to `merge` the melted data with the column data using a matching key (rownames of the melted expression values and colData!)
+Furthermore, to add all of the other useful information for each sample, we need
+to retrieve it from colData(se). We need to `merge` the melted data with the
+column data using a matching key (rownames of the melted expression values and
+colData!)
 
 ```{r}
-geneData.melt <- merge(geneData.melt, colData(se), by.x = 0, by.y = 0)
+geneData.melt <- merge(geneData.melt, colData(se), by.x = "SampleId", by.y = "SampleId")
 geneData.melt <- as.data.frame(geneData.melt)
 geneData.melt
 ```
 
 We can now use this melted & merged data to make a ggplot plot.
 
 ```{r}
-p <- ggplot(geneData.melt, aes(x = SampleID, y = value))
+p <- ggplot(geneData.melt, aes(x = "SampleID", y = "value"))
 
 p + geom_point()
 ```
 
 ### Group by Age
 
 ```{r}
-p + geom_line(aes(color=CellType,group=CellType))
+p + geom_line(aes(color = CellType, group = CellType))
 
 p <- ggplot(geneData.melt) +
-  geom_point(aes(x=Age,y=value,color=CellType)) + 
-  geom_line(aes(x=Age,y=value, group=CellType, color=CellType))
+  geom_point(aes(x = Age, y = value, color = CellType)) +
+  geom_line(aes(x = Age, y = value, group = CellType, color = CellType))
 p
 
-p <- ggplot(geneData.melt) + 
-  geom_boxplot(aes(x=CellType,y=value,fill=CellType)) + 
-  geom_point(aes(x=CellType,y=value,size=Age), color="black")
+p <- ggplot(geneData.melt) +
+  geom_boxplot(aes(x = CellType, y = value, fill = CellType)) +
+  geom_point(aes(x = CellType, y = value, size = Age), color = "black")
 ```
 
 ### Color by CellType
@@ -249,37 +275,42 @@ p +
 
 ## Functionalizing this process
 
-Let's make a new function that takes as arguments a SummarizedExperiment object and a gene name, and melts and merges the data as above so that we can quickly make a data structure to help with plotting for any single gene
+Let's make a new function that takes as arguments a SummarizedExperiment object
+and a gene name, and melts and merges the data as above so that we can quickly
+make a data structure to help with plotting for any single gene
 
 ```{r}
 
 meltSE <- function(se, geneName = "Sox2") {
-  tmp<-assay(se,'fpkm')[geneName,]
-  tmp.melt<-reshape2::melt(tmp)
-  tmp.melt<-merge(tmp.melt,colData(se),by.x=0,by.y=0)
-  tmp.melt<-as.data.frame(tmp.melt)
+  tmp <- assay(se, "fpkm")[geneName, ]
+  tmp.melt <- reshape2::melt(tmp)
+  tmp.melt <- merge(tmp.melt, colData(se), by.x = 0, by.y = 0)
+  tmp.melt <- as.data.frame(tmp.melt)
   return(tmp.melt)
 }
 ```
 
-Use your new function to get the melted gene information for three different genes, "Satb2", "Bcl11b", "Tle4" and quickly make line plots by CellType across the sample Ages (Age vs. Expression grouped by CellType)
+Use your new function to get the melted gene information for three different
+genes, "Satb2", "Bcl11b", "Tle4" and quickly make line plots by CellType across
+the sample Ages (Age vs. Expression grouped by CellType)
 
 ```{r}
-gene<-"Tle4"
+gene <- "Tle4"
 
-gene<-"Sox2"
-plotData<-meltSE(se)
+gene <- "Sox2"
+plotData <- meltSE(se)
 
-p <- ggplot(plotData,aes(x=Age,y=value))
+p <- ggplot(plotData, aes(x = Age, y = value))
 p +
   geom_line(aes(color = CellType, group = CellType)) +
-  geom_point() + 
+  geom_point() +
   ggtitle(gene)
 ```
 
 ## Multiple genes
 
-Let's see what happens when we try and get the gene information for a handful of genes using our meltSE function
+Let's see what happens when we try and get the gene information for a handful of
+genes using our meltSE function
 
 ```{r}
 geneset <- c("Satb2", "Cux2", "Cux1", "Lhx2")
@@ -288,11 +319,11 @@ genesetData <- meltSE(se, geneName = geneset)
 
 # meltSE version for multiple genes
 multiMeltSE <- function(se, geneset) {
-  tmp<-assay(se,'fpkm')[geneset,]
-  tmp.melt<-reshape2::melt(tmp)
-  colnames(tmp.melt)<-c("gene","SampleID","value")
-  tmp.melt<-merge(tmp.melt,colData(se),by.x="SampleID",by.y="SampleID")
-  tmp.melt<-as.data.frame(tmp.melt)
+  tmp <- assay(se, "fpkm")[geneset, ]
+  tmp.melt <- reshape2::melt(tmp)
+  colnames(tmp.melt) <- c("gene", "SampleID", "value")
+  tmp.melt <- merge(tmp.melt, colData(se), by.x = "SampleID", by.y = "SampleID")
+  tmp.melt <- as.data.frame(tmp.melt)
   return(tmp.melt)
 }
 
@@ -304,8 +335,8 @@ geneset.melt <- multiMeltSE(se, geneset)
 ```{r}
 p <- ggplot(geneset.melt)
 
-p + geom_boxplot(aes(x = Age, y = value, fill = Age)) + 
-  geom_point(aes(x=Age,y=value))
+p + geom_boxplot(aes(x = Age, y = value, fill = Age)) +
+  geom_point(aes(x = Age, y = value))
 
 p + geom_boxplot(aes(x = CellType, y = value, fill = CellType))
 p + geom_violin(aes(x = CellType, y = value, fill = CellType))
@@ -328,7 +359,15 @@ p + geom_line(aes(x = Age, y = value, color = CellType, group = CellType)) +
 
 ## Heatmaps
 
-Heatmaps are a staple of gene expression analysis and are an information dense way of exploring or summarizing the results of many types of tests or gene set selections. While you *can* generate a heatmap using ggplot2 (hint: geom_tile()), there are also several specialized tools for generating heatmaps that can be used instead. Here we will be using the R package '[ComplexHeatmap](https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html)' to generate and manipulate heatmaps of a subset of the genes in our SE object. The complete manual for ComplexHeatmap can be found [here](https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html).
+Heatmaps are a staple of gene expression analysis and are an information dense
+way of exploring or summarizing the results of many types of tests or gene set
+selections. While you _can_ generate a heatmap using ggplot2 (hint:
+geom_tile()), there are also several specialized tools for generating heatmaps
+that can be used instead. Here we will be using the R package
+'[ComplexHeatmap](https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html)'
+to generate and manipulate heatmaps of a subset of the genes in our SE object.
+The complete manual for ComplexHeatmap can be found
+[here](https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html).
 
 ```{r}
 geneset <- c(
@@ -340,7 +379,8 @@ geneset <- c(
 )
 ```
 
-Again, let's make a subset of the SE assay data for only the genes in this list (we can do this because the rownames of the se object are the gene names)
+Again, let's make a subset of the SE assay data for only the genes in this list
+(we can do this because the rownames of the se object are the gene names)
 
 ```{r}
 se.subset <- se[geneset, ]
@@ -365,9 +405,16 @@ heatData <- assay(se.subset, "logfpkm")
 Heatmap(heatData)
 ```
 
-Let's add some annotation to the top of the heatmap to help us visually identify different sample parameterizations. Heatmap annotations are important components of a heatmap that it shows additional information that associates with rows or columns in the heatmap. ComplexHeatmap package provides very flexible supports for setting annotations and defining new annotation graphics. The annotations can be put on the four sides of the heatmap, by top_annotation, bottom_annotation, left_annotation and right_annotation arguments.
+Let's add some annotation to the top of the heatmap to help us visually identify
+different sample parameterizations. Heatmap annotations are important components
+of a heatmap that it shows additional information that associates with rows or
+columns in the heatmap. ComplexHeatmap package provides very flexible supports
+for setting annotations and defining new annotation graphics. The annotations
+can be put on the four sides of the heatmap, by top_annotation,
+bottom_annotation, left_annotation and right_annotation arguments.
 
-The value for the four arguments should be in the HeatmapAnnotation class and should be constructed by HeatmapAnnotation() function
+The value for the four arguments should be in the HeatmapAnnotation class and
+should be constructed by HeatmapAnnotation() function
 
 ```{r}
 sampleAnnot <- HeatmapAnnotation(
@@ -379,11 +426,10 @@ sampleAnnot <- HeatmapAnnotation(
 Heatmap(heatData, top_annotation = sampleAnnot)
 ```
 
-
 What do the dendrograms represent? What do they tell us about the relationship
-between genes? between samples? Can we improve the sample clustering?
-Lets try a version where we scale the genes (row z-score) so that
-higher-expressing genes don't dominate the clustering.
+between genes? between samples? Can we improve the sample clustering? Lets try a
+version where we scale the genes (row z-score) so that higher-expressing genes
+don't dominate the clustering.
 
 ```{r}
 heatData.scaled <- t(scale(t(heatData)))