-
Notifications
You must be signed in to change notification settings - Fork 0
/
rna-seq-eda.Rmd
290 lines (246 loc) · 9.75 KB
/
rna-seq-eda.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
# RNA-seq EDA
Objective: learn how to make basic EDA plots (plotting scaled counts)
of RNA-seq data using tidy-style verbs.
The *tidybulk* Bioconductor package and associated
[tidy transcriptomics](https://stemangiola.github.io/tidytranscriptomics/post/tidy-transcriptomics-manifesto/)
ecosystem (tidySummarizedExperiment, tidySingleCell, tidyseurat, etc.)
provide a bridge between the *SummarizedExperiment*-style
representation of matrix data with attached metadata, to the
tidy-style representation of data (one row per observation).
For more details on the package, check out the
[tidybulk](https://stemangiola.github.io/tidybulk/)
website, and the publication: @tidybulk.
Here we will briefly examine the difference between tidy manipulation
of these matrix-style objects compared to how these objects would be
manipulated in other Bioconductor workflows (e.g. DESeq2, edgeR, or
limma workflows). As you will see, there are multiple ways to generate
the same or similar exploratory plots and analyses. we recommend to
consider which is more appropriate to your purpose. For package code,
you may prefer to have fewer dependencies, going with core
Bioconductor objects and packages. For scripting, the "fluent" and
piped style of *tidybulk* may be preferable, and easier for others to
read and modify your code at a future date.
Here we just compare code for some basic tasks, scaling counts and
performing DE with DESeq2. However, note that *tidybulk* has
many functionalities implemented, including dimension reduction and
visualization (PCA, MDS, tSNE, UMAP), clustering, gene set testing,
cell type composition analysis and cell abundance testing, unwanted
variation modeling, imputation, etc.
Start by loading this RNA-seq dataset from the following paper:
@king_klose "The pioneer factor OCT4 requires the chromatin
remodeller BRG1 to support gene regulatory element function in mouse
embryonic stem cells" <https://doi.org/10.7554/eLife.22631>. The
experiment focused on OCT4 as it is a "core pluripotency transcription
factor" which "occupies sites that would otherwise be inaccessible and
is required to shape the occupancy of additional pluripotency
transcription factors".
In this experiment, transcription in mouse embryonic stem cells (ESC)
was compared with and without OCT4. This was done using a conditional
mouse ESC line where treatment with a compound leads to loss of OCT4
expression. The experiment also involved the same approach to another
transcription factor BRG1, but we focus here on the OCT4 samples.
First we load the metadata about the samples:
```{r}
library(oct4)
dir <- system.file("extdata", package="oct4")
coldata <- read.csv(file.path(dir,"coldata.csv"))
coldata
coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
```
Read in the count data with *tximeta*, which automatically imports the
information about the gene provenance (which transcripts were used to
quantify the gene and isoform abundance). We then summarize the
quantification to the gene-level, and add the gene `SYMBOL`.
```{r eval=FALSE}
library(tximeta)
se <- tximeta(coldata)
gse <- summarizeToGene(se)
library(org.Mm.eg.db)
gse <- addIds(gse, "SYMBOL")
```
For this workflow, we don't need the inferential replicates (about
uncertainty regarding the quantification), so we keep just the counts,
abundances (TPM), and gene lengths. We also manipulate the metadata a
little bit.
```{r eval=FALSE}
library(SummarizedExperiment)
assayNames(gse)
assays(gse) <- assays(gse)[1:3]
gse$rep <- rep(1:3, 4)
colnames(gse) <- paste(gse$line,gse$condition,gse$rep,sep="-")
assay(gse, "counts") <- round(assay(gse, "counts")) # for DE consistency
# save for easy loading later
save(gse, file="data/oct4_obj.rda")
```
```{r message=FALSE}
library(SummarizedExperiment)
load("data/oct4_obj.rda")
```
The dataset looks like this (remember it has untreated and treated
samples for both OCT4 and BRG1).
```{r}
gse
```
We will be interested in the gene set from the Gene Ontology project,
which describes maintenance of pluripotency. We can extra this from
the mouse organism data package with the following three lines of
code:
```{r}
library(AnnotationDbi)
library(org.Mm.eg.db)
# pluripotency
tab <- AnnotationDbi::select(org.Mm.eg.db, "GO:0019827", "SYMBOL", "GO")
tab <- tab[!duplicated(tab$SYMBOL),]
pluri <- tab$SYMBOL
```
Now we start with *tidybulk* code, comparing in turn to base R.
Loading *tidySummarizedExperiment* allows us to operate on the `gse`
object using familiar "tidy" verbs, thanks to an abstraction described
in the [tidySummarizedExperiment](https://stemangiola.github.io/tidySummarizedExperiment/)
documentation.
We filter the samples that correspond to the OCT4 experiment, and then
modify the sample names. Samples are referred to with the special
`.sample` string, while features are referred to with `.feature`, so
here we create a new variable to plot the samples `sample_name`.
```{r message=FALSE}
library(tidySummarizedExperiment)
library(dplyr)
library(stringr)
oct4 <- gse %>%
filter(line == "OCT4") %>%
mutate(sample_name = .sample %>%
str_remove("OCT4-") %>%
factor(levels = unique(.)),
condition = condition %>% factor(c("untrt","trt")))
```
Note, the third line in the `mutate` call uses the `unique` function
to set the levels: the consequence of this is that the `sample_name`
factor will have levels in the order in which they are present in the
current character vector and not alphabetical.
```{r}
oct4
```
*tidybulk* provides access to many steps in bulk analysis, including
filtering and count scaling. For details on what is happening behind
the scene, see the help, e.g. `?keep_abundant` describes that it makes
use of `edgeR::filterByExpr`.
```{r message=FALSE}
library(tidybulk)
oct4 <- oct4 %>%
keep_abundant(factor_of_interest = condition) %>%
scale_abundance(method="RLE") # DESeq2 scaling
```
It is straightforward to pipe the data directly into plots:
```{r tidy-boxplot}
library(ggplot2)
oct4 %>%
ggplot(aes(sample_name, counts_scaled + 1)) +
geom_boxplot() +
scale_y_log10()
```
For comparing code, let's pull out the genes that remain:
```{r}
gene_idx <- oct4 %>% pivot_transcript() %>% pull(.feature)
head(gene_idx)
```
The equivalent code in DESeq2. Understanding this code requires
knowledge that `boxplot` plots columns of a matrix.
```{r deseq-boxplot}
library(DESeq2)
gse_sub <- gse[ gene_idx , gse$line == "OCT4" ]
gse_sub$condition <- factor(gse_sub$condition)
dds <- gse_sub %>%
DESeqDataSet(~condition) %>%
estimateSizeFactors()
boxplot(counts(dds, normalized=TRUE) + 1, log="y")
```
We can also make more interesting plots. E.g. for the genes involved
in pluripotency, make a line plot, highlighting OCT4. In addition,
center the log counts for each gene (subtract the mean of log counts
across samples).
```{r tidy-lines}
oct4 %>%
filter(SYMBOL %in% pluri) %>%
mutate(logcounts = log10(counts_scaled + 1)) %>%
mutate(Oct4 = ifelse(SYMBOL == "Pou5f1", "red", "black")) %>%
group_by(.feature) %>%
mutate(logcounts = logcounts - mean(logcounts)) %>%
ungroup() %>%
ggplot(aes(sample_name, logcounts, group=.feature, color=Oct4)) +
geom_point() +
geom_line() +
scale_color_identity()
```
The equivalent code for base R requires defining more intermediate
variables and control flow code (the `for` loop). While there's
nothing particularly right or wrong about the two choices, the above
prioritizes the operations in a way that is human readable. In some
cases, e.g. performing linear algebra operations on matrices, base R
code may prove to be more efficient, which is a consideration for what
to use in package source code.
```{r deseq-lines}
pluri_idx <- mcols(dds)$SYMBOL %in% pluri
mat <- log10(counts(dds, normalized=TRUE)[pluri_idx,] + 1)
mat <- mat - rowMeans(mat)
hilite <- rownames(dds)[which(mcols(dds)$SYMBOL == "Pou5f1")]
plot(mat[1,], type="n", ylim=c(-1,1), xlab="samples", ylab="logcounts")
for (i in 1:nrow(mat)) {
col <- ifelse(rownames(mat)[i] == hilite, "red", "black")
points(mat[i,], type="b", col=col)
}
```
We can test for differential expression with DESeq2:
```{r}
res <- dds[gene_idx,] %>%
DESeq(quiet=TRUE) %>%
results()
```
Or equivalently with *tidybulk*:
```{r}
oct4 <- oct4 %>%
test_differential_abundance(~condition, method="deseq2")
tidy_res <- oct4 %>%
pivot_transcript()
```
Because we have filtered the two objects identically, we obtain the
same test results:
```{r}
all.equal(rownames(res), tidy_res$.feature)
table(base_sig = res$padj < .1, tidy_sig = tidy_res$padj < .1)
```
Finally, we build up to a more interesting plot. Suppose we now want
to split the genes involved in pluripotency by the DE result (the
significance and LFC), and then add the gene symbol to the side.
We begin by building the dataset:
```{r}
plot_data <- oct4 %>%
filter(SYMBOL %in% pluri) %>%
mutate(logcounts = log10(counts_scaled + 1)) %>%
mutate(Oct4 = ifelse(SYMBOL == "Pou5f1", "red", "black")) %>%
group_by(.feature) %>%
mutate(logcounts = logcounts - mean(logcounts)) %>%
ungroup() %>%
mutate(gene_type = case_when(
padj < .1 & log2FoldChange > 0 ~ "up",
padj < .1 & log2FoldChange < 0 ~ "down",
TRUE ~ "null"))
```
Now we repeat the code from before, but now faceting by
`gene_type`. Furthermore, we use `geom_text_repel` to add labels to
the right side.
```{r tidy-lines-faceted}
library(ggrepel)
plot_data %>%
filter(gene_type != "null") %>%
ggplot(aes(sample_name, logcounts, group=.feature, color=Oct4)) +
geom_point() +
geom_line() +
geom_text_repel(data=plot_data %>%
filter(sample_name == "trt-3", gene_type != "null"),
aes(sample_name, logcounts, label=SYMBOL),
nudge_x=.5, seed=1, max.overlaps=Inf) +
scale_color_identity() +
facet_wrap(~gene_type) +
scale_x_discrete(expand = expansion(add = 2)) +
xlab("sample")
```