-
Notifications
You must be signed in to change notification settings - Fork 13
/
monocle3-example.Rmd
397 lines (323 loc) · 15.3 KB
/
monocle3-example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
In order to run monocle3 a few steps must first be followed. Monocle3 requires R version 3.5 or higher, bioconductor 3.5 or higher and monocle3 0.1.0 to run the latest features.
1) Installing bioconductor:
```{r message=FALSE, warning=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
```
A few bioconductor dependencies are not automatically installed so you should install the,.
2) Installing bioconductor dependencies:
```{r message=FALSE, warning=FALSE}
BiocManager::install(c('BiocGenerics', 'DelayedArray', 'DelayedMatrixStats',
'limma', 'S4Vectors', 'SingleCellExperiment',
'SummarizedExperiment', 'batchelor'))
```
In order to install monocle3 a few preparations must be made. First devtools must be installed into your R environment via the R console.
3) Installing devtools:
```{r message=FALSE, warning=FALSE}
install.packages('devtools')
```
If your rlang version is out of data this can cause a failure when trying to install monocle3 and garnett later on.
4) Updating rlang
```{r message=FALSE, warning=FALSE}
install.packages('rlang')
library(rlang)
```
Next your must install libudunits2-dev in your terminal.
5) Installing libudunits2-dev:
sudo apt-get update
sudo apt-get install libudunits2-dev
Finally libgdal-dev must also be installed on your terminal.
6) Installing libgdal-dev:
sudo apt-get update
sudo apt-get install --fix-missing libgdal-dev
Now we can finally installmonocle3.
7)Installing monocle3:
```{r message=FALSE, warning=FALSE}
devtools::install_github('cole-trapnell-lab/monocle3')
```
8) Installing garnett:
```{r message=FALSE, warning=FALSE}
BiocManager::install('org.Mm.eg.db')
BiocManager::install('org.Hs.eg.db')
devtools::install_github("cole-trapnell-lab/garnett", ref = 'monocle3')
```
Run the following code chunk to check everything has been installed correctly, and to load all relevant packages for this tutorial.
```{r message=FALSE, warning=FALSE}
library(monocle3)
library(shiny)
library(ggplot2)
library(dplyr)
library(garnett)
library(Matrix)
library(irlba)
```
#Clustering and classifying cells
The first thing we must do is read cellranger data into R. In order for the to correctly be done you must first ensure the data is in the right folder system. The folder system used depends on whether you have 10X version 2 or 10X version 3 data. Incase your aren't sure which version of 10X data you have: v2 barcode and gene files end .tsv and matrix file ends .mtx while v3 barcode and gene files end .tsv.gz and matrix file ends .mtx.gz.
7) a) If 10X version 2 the folder system is:
~/data/monocle-data/cellranger-data/outs/filtered_gene_bc_matrices/genome(e.g 'hg19' or 'mm10' etc depending on genome used)/
b) If 10X version 3 the folder system is:
~/data/monocle-data/cellranger-data/outs/filtered_feature_bc_matrix/genome(e.g 'hg19' or 'mm10' etc depending on genome used)/
Next we preprocess the data. Firstly we state how may principle components (PC) we want monocle to use, this can be changed by changing the value num_dimis equal to. Then we plot the data set to see if increasing the PC number would capture significantly more variance. The elbow plot produced will then inform you whether increasing PC number will capture a significant amount of more variance. When visualising the data you can use t-SNE or UMAP, UMAP is used by default as the people at monocle believe it is both faster and more suited to clustering. When doing gene expression analysis it is important to remove batch effect. You should always check for batch effect when performing dimensionality reduction, batch effect is reduced by the preprocess_cds function. When order_cells() runs it will try to load a shiny web page but will fail. That is expected. Go to your Jupyter terminal and run ngrok http PORT (where PORT is the 4 digit number at the end of the http message ran by shiny). This will produce multiple lines of code, copy the http message on the penultimate line (should start with the word 'fowarding'). Select from 'http' to before the arrow '->'. Enter this web address into a new tab and the then select the node on the graph you wish to use as a basis. You can add umap.fast_sgd=TRUE to the reduce dimensions command to speed it up but will produce slightly different output each time it runs.
```{r}
# Load RDS data
expression_matrix <- readRDS("../data/monocle-data/cao_l2_expression.rds")
cell_metadata <- readRDS("../data/monocle-data/cao_l2_colData.rds")
gene_annotation <- readRDS("../data/monocle-data/cao_l2_rowData.rds")
# Make the cell dataset object from the RDS data
cds <- new_cell_data_set(expression_matrix,
cell_metadata = cell_metadata,
gene_metadata = gene_annotation)
```
```{r}
# Reading in cellranger data
cdr <- load_cellranger_data("/home/jovyan/data/monocle-data/cellranger-data")
```
```{r}
# Principle component analysis
cds = preprocess_cds(cds, num_dim = 100)
```
```{r}
# Checking if 100 principle components captures enough variance
plot_pc_variance_explained(cds)
```
```{r}
# Reducing dimensionality for visualisation of the cells
# can add umap.fast_sgd=TRUE to the reduce dimensions command to speed it up
# but will produce slightly different output each time its run
cds = reduce_dimension(cds)
```
```{r}
# Clustering cells using community detection
cds = cluster_cells(cds, resolution=c(10^seq(-6,-1)))
```
```{r}
# Plotting the data colouring by partition
plot_cells(cds, color_cells_by="partition", group_cells_by="partition")
```
The data frame marker_test_res contains metrics which tell us how differently expressed each gene is in each partition. You can rank the cells by one or more specifity metrics and take the top genes for each cluster, the way in which is do this is pseudo_R2. The default number of marker genes in this alogrithm is 3 but if you want to change this edit the 3 in top_n(3, pseudo_R2) to however many marker genes you want. Then we can plot the expression and fraction of cells that express each marker in each group usng plot_genes_by_group().
```{r}
pheatmap::pheatmap(log(table(clusters(cds), colData(cds)$cao_cell_type)+1),
clustering_method="ward.D2",
fontsize=6)
```
```{r results = FALSE, echo=FALSE, message=FALSE, warning=FALSE}
marker_test_res = top_markers(cds, group_cells_by="partition", reference_cells=1000, cores=8)
```
```{r}
# Find marker genes expressed by each cluster
top_specific_markers = marker_test_res %>%
filter(fraction_expressing >= 0.10) %>%
group_by(cell_group) %>%
top_n(3, pseudo_R2)
top_specific_marker_ids = unique(top_specific_markers %>% pull(gene_id))
```
```{r message=FALSE, warning=FALSE}
# Plotting marker genes
plot_genes_by_group(cds,
top_specific_marker_ids,
group_cells_by="partition",
ordering_type="cluster_row_col",
max.size=3)
```
Monocle3 has developed some software which automates the the annotating of cells called Garnett. Garnett classifies cells based on marker genes. Firstly we must fnd the top marker that each annotated cell type expresses using top_markers(). Next we filter these markers based on the conditions required (in this example it is JS specificity > 0.5 and be significant in logistic test marker_test_q_value < 0.01). We then generate a marker file called ./marker_file.txt which contains each cell type and top 5 expressed markers.
```{r results = FALSE, echo=FALSE, message=FALSE, warning=FALSE}
# Annotate your cells according to type
assigned_type_marker_test_res = top_markers(cds,
group_cells_by="cluster",
reference_cells=1000,
cores=8)
```
```{r}
# Require that markers have at least JS specificty score > 0.5 and be significant
# in the logistic test for identifying their cell type:
garnett_markers = assigned_type_marker_test_res %>%
filter(marker_test_q_value < 0.01 & specificity >= 0.5) %>%
group_by(cell_group) %>%
top_n(5, marker_score)
# Exclude genes that are good markers for more than one cell type:
garnett_markers = garnett_markers %>% group_by(gene_short_name) %>%
filter(n() == 1)
generate_garnett_marker_file(garnett_markers, file="./marker_file.txt")
```
```{r message=FALSE, warning=FALSE}
plot_cells(cds,
group_cells_by="cluster",
color_cells_by="cluster")
```
#Constructing Single Cell Trajectories
Loading a different data set in for example purposes. If you are usin same data set as earlier the first 2 steps can be missed (assuming data has already been downloaded and preprocessed)
```{r}
# Loading more RDS data
expression_matrix2 = readRDS("../data/monocle-data/packer_embryo_expression.rds")
cell_metadata2 = readRDS("../data/monocle-data/packer_embryo_colData.rds")
gene_annotation2 = readRDS("../data/monocle-data/packer_embryo_rowData.rds")
```
```{r}
cdt <- new_cell_data_set(expression_matrix2,
cell_metadata = cell_metadata2,
gene_metadata = gene_annotation2)
```
```{r message=FALSE, warning=FALSE}
cdt <- preprocess_cds(cdt, num_dim = 100, residual_model_formula_str = "~ bg.300.loading + bg.400.loading + bg.500.1.loading + bg.500.2.loading + bg.r17.loading + bg.b01.loading + bg.b02.loading")
cdt <- reduce_dimension(cdt)
```
```{r}
cdt <- cluster_cells(cdt, reduction_method = 'UMAP')
```
A principal graph is fitted within each partition in the next component.
```{r results = FALSE, echo=FALSE, message=FALSE, warning=FALSE}
cdt <- learn_graph(cdt)
```
The next part is creating a list of genes to try and identify in the trajectory of the sample. If you have different genes you want to check then switch out the names in the ciliated genes list and add or remove genes as required.
```{r}
ciliated_genes = c("che-1",
"hlh-17",
"nhr-6",
"dmd-6",
"ceh-36",
"ham-1")
```
The following is plotting the trajectory multiple times, one for each gene in ciliated genes. And identifying where on the trajectory of cells this gene is expressed.
```{r}
# Visualise how individual genes vary along the trajectory
plot_cells(cdt,
genes=ciliated_genes,
label_cell_groups=FALSE,
show_trajectory_graph=FALSE)
```
Now you have to select which cells you want as you starting point. A pop up will appear with options "trust browser" or cancel. Click "try again". Then the graph should appear and you can select your starting cells.
```{r}
# a helper function to identify the root principal points:
get_earliest_principal_node <- function(cds, time_bin="130-170"){
cell_ids <- which(colData(cds)[, "embryo.time.bin"] == time_bin)
closest_vertex <-
cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
root_pr_nodes <-
igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(names
(which.max(table(closest_vertex[cell_ids,]))))]
root_pr_nodes
}
```
```{r}
cdt = order_cells(cdt, root_pr_nodes=get_earliest_principal_node(cdt))
```
```{r}
# Visualising pseudotime on the graph when manually selecting the root principal points
plot_cells(cdt,
color_cells_by = "pseudotime",
label_cell_groups=FALSE,
label_leaves=FALSE,
label_branch_points=FALSE,
graph_label_size=1.5)
```
```{r}
# A helper function to identify the root principal points:
get_earliest_principal_node <- function(cdt, time_bin="130-170"){
cell_ids <- which(colData(cdt)[, "embryo.time.bin"] == time_bin)
closest_vertex <-
cdt@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
closest_vertex <- as.matrix(closest_vertex[colnames(cdt), ])
root_pr_nodes <-
igraph::V(principal_graph(cdt)[["UMAP"]])$name[as.numeric(names
(which.max(table(closest_vertex[cell_ids,]))))]
root_pr_nodes
}
cdt = order_cells(cdt, root_pr_nodes=get_earliest_principal_node(cdt))
```
```{r}
# Visualising pseudotime on the graph when the helper function decideds the root principal points
plot_cells(cdt,
color_cells_by = "pseudotime",
label_cell_groups=FALSE,
label_leaves=FALSE,
label_branch_points=FALSE,
graph_label_size=1.5)
```
```{r results = FALSE, echo=FALSE, message=FALSE, warning=FALSE}
# Processing data for a 3D plot
cds_3d = reduce_dimension(cdt, max_components = 3)
cds_3d = cluster_cells(cds_3d)
cds_3d = learn_graph(cds_3d)
cds_3d = order_cells(cds_3d, root_pr_nodes=get_earliest_principal_node(cdt))
```
```{r message=FALSE, warning=FALSE}
# 3D visualisation of data
plot_cells_3d(cds_3d, color_cells_by="pseudotime")
```
#Differential expression analysis
```{r}
ciliated_genes = c("che-1",
"hlh-17",
"nhr-6",
"dmd-6",
"ceh-36",
"ham-1")
cds_subset = cdt[rowData(cdt)$gene_short_name %in% ciliated_genes,]
gene_fits = fit_models(cds_subset, model_formula_str = "~embryo.time")
fit_coefs = coefficient_table(gene_fits)
emb_time_terms = fit_coefs %>% filter(term == "embryo.time")
emb_time_terms %>% filter (q_value < 0.05) %>%
select(gene_short_name, term, q_value, estimate)
```
```{r message=FALSE, warning=FALSE}
plot_genes_violin(cds_subset, group_cells_by="embryo.time.bin", ncol=2) +
theme(axis.text.x=element_text(angle=45, hjust=1))
```
```{r}
gene_fits = fit_models(cds_subset, model_formula_str = "~embryo.time + batch")
fit_coefs = coefficient_table(gene_fits)
fit_coefs %>% filter(term != "(Intercept)") %>%
select(gene_short_name, term, q_value, estimate)
```
```{r message=FALSE, warning=FALSE}
evaluate_fits(gene_fits)
```
```{r}
time_batch_models = fit_models(cds_subset,
model_formula_str = "~embryo.time + batch",
expression_family="negbinomial")
time_models = fit_models(cds_subset,
model_formula_str = "~embryo.time",
expression_family="negbinomial")
compare_models(time_batch_models, time_models) %>% select(gene_short_name, q_value)
```
```{r}
# reload the data
expression_matrix3 <- readRDS("../data/monocle-data/cao_l2_expression.rds")
cell_metadata3 <- readRDS("../data/monocle-data/cao_l2_colData.rds")
gene_annotation3 <- readRDS("../data/monocle-data/cao_l2_rowData.rds")
```
```{r}
# Make the CDS object
cdn <- new_cell_data_set(expression_matrix3,
cell_metadata = cell_metadata3,
gene_metadata = gene_annotation3)
```
```{r results = FALSE, echo=FALSE, message=FALSE, warning=FALSE}
cdn <- preprocess_cds(cdn)
cdn <- reduce_dimension(cdn, reduction_method = 'UMAP')
cdn <- cluster_cells(cdn)
cdn<- learn_graph(cdn)
```
```{r}
neurons_cds <- cdn[,grepl("neurons", colData(cdn)$assigned_cell_type, ignore.case=TRUE)]
plot_cells(neurons_cds)
```
```{r message=FALSE, warning=FALSE}
plot_cells(cdt,
color_cells_by = "cell.type",
label_groups_by_cluster=FALSE,
label_leaves=FALSE,
label_branch_points=FALSE)
```
```{r}
plot_cells(cdt, genes=c("hlh-4", "gcy-8", "dac-1", "oig-8"),
show_trajectory_graph=FALSE,
label_cell_groups=FALSE,
label_leaves=FALSE)
```
```{r}
plot_cells(cdt, show_trajectory_graph=FALSE)
```