Skip to content

Commit

Permalink
Autoformat
Browse files Browse the repository at this point in the history
Signed-off-by: Chaichontat Sriworarat <[email protected]>
  • Loading branch information
chaichontat committed Nov 4, 2022
1 parent 868a472 commit a87a28e
Show file tree
Hide file tree
Showing 2 changed files with 110 additions and 60 deletions.
4 changes: 4 additions & 0 deletions .lintr
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
linters: with_defaults(
line_length_linter=line_length_linter(120L),
object_name_linter = NULL
)
166 changes: 106 additions & 60 deletions modules/module_9/notebooks/Module_9-EDA_and_plotting.rmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ knitr::opts_chunk$set(echo = TRUE)
# setwd("modules/module_9/notebooks")
library(SummarizedExperiment)
library(tidyverse)
#GEO <- "GSE63482"
# GEO <- "GSE63482"
# Read in the data
data <- read.csv("https://github.com/gofflab/Quant_mol_neuro_2022/raw/main/modules/module_9/notebooks/GSE63482_Expression_matrix.tsv", sep = " ", header = T, row.names = 1)
Expand Down Expand Up @@ -114,41 +114,61 @@ se.pca$x %>%

# Grammar of Graphics (ggplot2)

- A layer-based framework to describe and construct visualizations in a structured manner.
- A layer-based framework to describe and construct visualizations in a
structured manner.

- Original grammar of graphics framework proposed by Leland Wilkinson
- Original grammar of graphics framework proposed by Leland Wilkinson

- Modern layered grammar of graphics framework was developed by Hadley Wickham (author of ggplot2 package)
- Modern layered grammar of graphics framework was developed by Hadley Wickham
(author of ggplot2 package)

![](images/paste-96C18E8F.png)
![](images/paste-96C18E8F.png)

When designing a visualization, consider these components in hierarchical order:
When designing a visualization, consider these components in hierarchical
order:

1. **Data**: Exactly *what* data do you plan on representing in your visualization? How does this need to be prepared in order to display your concept?
1. **Data**: Exactly _what_ data do you plan on representing in your
visualization? How does this need to be prepared in order to display your
concept?

2. **Aesthetics**: What axes are needed for the data dimensions, or positions of various data points in the plot? Also check if any form of encoding is needed including size, shape, color and so on which are useful for plotting multiple data dimensions. These 'aesthetics' will define how your multi-dimensional data will be represented.
2. **Aesthetics**: What axes are needed for the data dimensions, or positions
of various data points in the plot? Also check if any form of encoding is
needed including size, shape, color and so on which are useful for
plotting multiple data dimensions. These 'aesthetics' will define how your
multi-dimensional data will be represented.

3. **Scale:** Do we need to scale the potential values, use a specific scale to represent multiple values or a range?
3. **Scale:** Do we need to scale the potential values, use a specific scale
to represent multiple values or a range?

4. **Geometric objects:** What 'geometries' or data visualization types are needed (e.g. points, bars, lines, tiles, etc)? How will you depict the data on the visualization.
4. **Geometric objects:** What 'geometries' or data visualization types are
needed (e.g. points, bars, lines, tiles, etc)? How will you depict the
data on the visualization.

5. **Statistics:** Do we need to show some statistical measures in the visualization like measures of central tendency, spread, confidence intervals? Will different dimensions of your data be summarized in a specific way?
5. **Statistics:** Do we need to show some statistical measures in the
visualization like measures of central tendency, spread, confidence
intervals? Will different dimensions of your data be summarized in a
specific way?

6. **Facets:** Do we need to create subplots based on specific data dimensions (e,g, making the same plot but for different subsets of the data)?
6. **Facets:** Do we need to create subplots based on specific data
dimensions (e,g, making the same plot but for different subsets of the
data)?

7. **Coordinate system:** What kind of a coordinate system should the visualization be based on --- should it be cartesian or polar?
7. **Coordinate system:** What kind of a coordinate system should the
visualization be based on --- should it be cartesian or polar?

- [ggplot2 Cheat Sheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf)
- [ggplot2 Cheat Sheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf)

## Plotting with ggplot2

### Start with organizing and planning your data

ggplot2 works by binding a dataset to a ggplot object. For the most effective use of ggplot, data are expected to be in a data.frame-style organization:
ggplot2 works by binding a dataset to a ggplot object. For the most effective
use of ggplot, data are expected to be in a data.frame-style organization:

![](images/paste-5590FC1F.png)

If we are interested in plotting the expression information for our se object, we might start with looking at the `assay` data:
If we are interested in plotting the expression information for our se object,
we might start with looking at the `assay` data:

```{r}
head(assay(se, "fpkm"))
Expand All @@ -157,19 +177,18 @@ head(assay(se, "fpkm"))
With this layout we can plot samples (variables) against each other like so.

```{r}
plotData<-as.data.frame(assay(se,'fpkm')+1)
plotData <- as.data.frame(assay(se, "fpkm") + 1)
p <- ggplot(plotData)
p<- p + geom_point(aes(x=E15_subcereb,y=E16_subcereb))
p <- p + geom_point(aes(x = E15_subcereb, y = E16_subcereb))
p
```

we can layer-on additional visualizations that might help our interpretation of the data
we can layer-on additional visualizations that might help our interpretation of
the data

```{r}
p <- p + geom_smooth(aes(x=E15_subcereb,y=E16_subcereb))
p <- p + geom_smooth(aes(x = E15_subcereb, y = E16_subcereb))
p
```
Expand All @@ -182,7 +201,10 @@ p + scale_y_log10() + scale_x_log10()

### Melting a SE object

To add sample information to this plot, we'll need to reorganize the data to make it compatible with ggplot's input expectations, Lets grab the information for a single gene and organize it for plotting with ggplot2. First let's grab the gene expression information from assay(se,'fpkm')
To add sample information to this plot, we'll need to reorganize the data to
make it compatible with ggplot's input expectations, Lets grab the information
for a single gene and organize it for plotting with ggplot2. First let's grab
the gene expression information from `assay(se, 'fpkm')`

```{r}
gene <- "Fezf2"
Expand All @@ -192,42 +214,46 @@ geneData <- assay(se, "fpkm")[gene, ]
geneData
```

In order to transform these data from a wide-form format into a long-form format, we need to 'melt' geneData.
In order to transform these data from a wide-form format into a long-form
format, we need to 'melt' geneData.

```{r}
geneData.melt <- reshape2::melt(geneData)
geneData.melt
```

Furthermore, to add all of the other useful information for each sample, we need to retrieve it from colData(se). We need to `merge` the melted data with the column data using a matching key (rownames of the melted expression values and colData!)
Furthermore, to add all of the other useful information for each sample, we need
to retrieve it from colData(se). We need to `merge` the melted data with the
column data using a matching key (rownames of the melted expression values and
colData!)

```{r}
geneData.melt <- merge(geneData.melt, colData(se), by.x = 0, by.y = 0)
geneData.melt <- merge(geneData.melt, colData(se), by.x = "SampleId", by.y = "SampleId")
geneData.melt <- as.data.frame(geneData.melt)
geneData.melt
```

We can now use this melted & merged data to make a ggplot plot.

```{r}
p <- ggplot(geneData.melt, aes(x = SampleID, y = value))
p <- ggplot(geneData.melt, aes(x = "SampleID", y = "value"))
p + geom_point()
```

### Group by Age

```{r}
p + geom_line(aes(color=CellType,group=CellType))
p + geom_line(aes(color = CellType, group = CellType))
p <- ggplot(geneData.melt) +
geom_point(aes(x=Age,y=value,color=CellType)) +
geom_line(aes(x=Age,y=value, group=CellType, color=CellType))
geom_point(aes(x = Age, y = value, color = CellType)) +
geom_line(aes(x = Age, y = value, group = CellType, color = CellType))
p
p <- ggplot(geneData.melt) +
geom_boxplot(aes(x=CellType,y=value,fill=CellType)) +
geom_point(aes(x=CellType,y=value,size=Age), color="black")
p <- ggplot(geneData.melt) +
geom_boxplot(aes(x = CellType, y = value, fill = CellType)) +
geom_point(aes(x = CellType, y = value, size = Age), color = "black")
```

### Color by CellType
Expand All @@ -249,37 +275,42 @@ p +

## Functionalizing this process

Let's make a new function that takes as arguments a SummarizedExperiment object and a gene name, and melts and merges the data as above so that we can quickly make a data structure to help with plotting for any single gene
Let's make a new function that takes as arguments a SummarizedExperiment object
and a gene name, and melts and merges the data as above so that we can quickly
make a data structure to help with plotting for any single gene

```{r}
meltSE <- function(se, geneName = "Sox2") {
tmp<-assay(se,'fpkm')[geneName,]
tmp.melt<-reshape2::melt(tmp)
tmp.melt<-merge(tmp.melt,colData(se),by.x=0,by.y=0)
tmp.melt<-as.data.frame(tmp.melt)
tmp <- assay(se, "fpkm")[geneName, ]
tmp.melt <- reshape2::melt(tmp)
tmp.melt <- merge(tmp.melt, colData(se), by.x = 0, by.y = 0)
tmp.melt <- as.data.frame(tmp.melt)
return(tmp.melt)
}
```

Use your new function to get the melted gene information for three different genes, "Satb2", "Bcl11b", "Tle4" and quickly make line plots by CellType across the sample Ages (Age vs. Expression grouped by CellType)
Use your new function to get the melted gene information for three different
genes, "Satb2", "Bcl11b", "Tle4" and quickly make line plots by CellType across
the sample Ages (Age vs. Expression grouped by CellType)

```{r}
gene<-"Tle4"
gene <- "Tle4"
gene<-"Sox2"
plotData<-meltSE(se)
gene <- "Sox2"
plotData <- meltSE(se)
p <- ggplot(plotData,aes(x=Age,y=value))
p <- ggplot(plotData, aes(x = Age, y = value))
p +
geom_line(aes(color = CellType, group = CellType)) +
geom_point() +
geom_point() +
ggtitle(gene)
```

## Multiple genes

Let's see what happens when we try and get the gene information for a handful of genes using our meltSE function
Let's see what happens when we try and get the gene information for a handful of
genes using our meltSE function

```{r}
geneset <- c("Satb2", "Cux2", "Cux1", "Lhx2")
Expand All @@ -288,11 +319,11 @@ genesetData <- meltSE(se, geneName = geneset)
# meltSE version for multiple genes
multiMeltSE <- function(se, geneset) {
tmp<-assay(se,'fpkm')[geneset,]
tmp.melt<-reshape2::melt(tmp)
colnames(tmp.melt)<-c("gene","SampleID","value")
tmp.melt<-merge(tmp.melt,colData(se),by.x="SampleID",by.y="SampleID")
tmp.melt<-as.data.frame(tmp.melt)
tmp <- assay(se, "fpkm")[geneset, ]
tmp.melt <- reshape2::melt(tmp)
colnames(tmp.melt) <- c("gene", "SampleID", "value")
tmp.melt <- merge(tmp.melt, colData(se), by.x = "SampleID", by.y = "SampleID")
tmp.melt <- as.data.frame(tmp.melt)
return(tmp.melt)
}
Expand All @@ -304,8 +335,8 @@ geneset.melt <- multiMeltSE(se, geneset)
```{r}
p <- ggplot(geneset.melt)
p + geom_boxplot(aes(x = Age, y = value, fill = Age)) +
geom_point(aes(x=Age,y=value))
p + geom_boxplot(aes(x = Age, y = value, fill = Age)) +
geom_point(aes(x = Age, y = value))
p + geom_boxplot(aes(x = CellType, y = value, fill = CellType))
p + geom_violin(aes(x = CellType, y = value, fill = CellType))
Expand All @@ -328,7 +359,15 @@ p + geom_line(aes(x = Age, y = value, color = CellType, group = CellType)) +

## Heatmaps

Heatmaps are a staple of gene expression analysis and are an information dense way of exploring or summarizing the results of many types of tests or gene set selections. While you *can* generate a heatmap using ggplot2 (hint: geom_tile()), there are also several specialized tools for generating heatmaps that can be used instead. Here we will be using the R package '[ComplexHeatmap](https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html)' to generate and manipulate heatmaps of a subset of the genes in our SE object. The complete manual for ComplexHeatmap can be found [here](https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html).
Heatmaps are a staple of gene expression analysis and are an information dense
way of exploring or summarizing the results of many types of tests or gene set
selections. While you _can_ generate a heatmap using ggplot2 (hint:
geom_tile()), there are also several specialized tools for generating heatmaps
that can be used instead. Here we will be using the R package
'[ComplexHeatmap](https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html)'
to generate and manipulate heatmaps of a subset of the genes in our SE object.
The complete manual for ComplexHeatmap can be found
[here](https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html).

```{r}
geneset <- c(
Expand All @@ -340,7 +379,8 @@ geneset <- c(
)
```

Again, let's make a subset of the SE assay data for only the genes in this list (we can do this because the rownames of the se object are the gene names)
Again, let's make a subset of the SE assay data for only the genes in this list
(we can do this because the rownames of the se object are the gene names)

```{r}
se.subset <- se[geneset, ]
Expand All @@ -365,9 +405,16 @@ heatData <- assay(se.subset, "logfpkm")
Heatmap(heatData)
```

Let's add some annotation to the top of the heatmap to help us visually identify different sample parameterizations. Heatmap annotations are important components of a heatmap that it shows additional information that associates with rows or columns in the heatmap. ComplexHeatmap package provides very flexible supports for setting annotations and defining new annotation graphics. The annotations can be put on the four sides of the heatmap, by top_annotation, bottom_annotation, left_annotation and right_annotation arguments.
Let's add some annotation to the top of the heatmap to help us visually identify
different sample parameterizations. Heatmap annotations are important components
of a heatmap that it shows additional information that associates with rows or
columns in the heatmap. ComplexHeatmap package provides very flexible supports
for setting annotations and defining new annotation graphics. The annotations
can be put on the four sides of the heatmap, by top_annotation,
bottom_annotation, left_annotation and right_annotation arguments.

The value for the four arguments should be in the HeatmapAnnotation class and should be constructed by HeatmapAnnotation() function
The value for the four arguments should be in the HeatmapAnnotation class and
should be constructed by HeatmapAnnotation() function

```{r}
sampleAnnot <- HeatmapAnnotation(
Expand All @@ -379,11 +426,10 @@ sampleAnnot <- HeatmapAnnotation(
Heatmap(heatData, top_annotation = sampleAnnot)
```


What do the dendrograms represent? What do they tell us about the relationship
between genes? between samples? Can we improve the sample clustering?
Lets try a version where we scale the genes (row z-score) so that
higher-expressing genes don't dominate the clustering.
between genes? between samples? Can we improve the sample clustering? Lets try a
version where we scale the genes (row z-score) so that higher-expressing genes
don't dominate the clustering.

```{r}
heatData.scaled <- t(scale(t(heatData)))
Expand Down

0 comments on commit a87a28e

Please sign in to comment.