-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathIntro_to_COSICC_pdf.Rmd
406 lines (315 loc) · 19.2 KB
/
Intro_to_COSICC_pdf.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
---
title: "Intro to COSICC"
author: "Magdalena Strauss"
date: "`r Sys.Date()`"
bibliography: vignettes/references.bib
output: pdf_document
---
Welcome to COSICC! COSICC R an R package for **CO**mparative analysis of **S**ingle-**C**ell-RNA-seq data for **C**omplex cell populations. This vignette gives an introduction and overview to COSICC.
# Overview and use cases
For exact reproducibility, we set a fixed seed.
```{r}
set.seed(112244)
suppressPackageStartupMessages(library(COSICC))
suppressPackageStartupMessages(library(SingleCellExperiment))
```
# COSICC_DA_group
COSICC_DA_group tests for differential abundance of groups of cells (.e.g. cell types) after perturbations, compared to a control experiment without perturbations. This type of set-up occurs for many experiments. One example is patient data. Assume we have an scRNA-seq data set of a cancer patient at the day of diagnosis, and another one from after treatment. Both samples contain normal and malignant cells, and different cell types.
## Simulating data
### Simulating the control experiment
The control experiment is an experiment without the perturbation studied. For instance, it could be a sample from a cancer patient at the day of diagnosis if the perturbation we want to look at is the treatment following the diagnosis. Or it could be a control chimera experiment as in @strauss, where unperturbed stem cells were injected into a mouse embryo at an early developmental stage, which served as a control experiment for injected cells with a gene knockout.
We simulate data using the Splatter Bioconductor package (@splatter). We simulate 10 cell types with an average of 500 cells each, and visualise the data on a UMAP.
```{r,fig.width=7}
sce_sim_cell_type_control <- splatSimulate(
nGenes = 10,
batchCells = 5000,
group.prob = rep(0.1,10),
method = "groups",
verbose = FALSE
)
sce_sim_cell_type_control <- logNormCounts(sce_sim_cell_type_control)
sce_sim_cell_type_control <- runPCA(sce_sim_cell_type_control,ncomponents=3)
sce_sim_cell_type_control <- runUMAP(sce_sim_cell_type_control)
names(colData(sce_sim_cell_type_control))[3] <- "cell_type"
plotReducedDim(sce_sim_cell_type_control, dimred="UMAP",colour_by = "cell_type") +
theme_classic()
```
For each of the cell types, we label 50% of the cells as normal and 50% as marked. In practice, this may be a mock perturbation (e.g. injected cells for chimeras marked with fluorescent markers), malignant cells in a cancer at time of diganosis etc.
```{r}
sce_sim_cell_type_control$marked <- FALSE
for (j in 1:10){
xx <- which(sce_sim_cell_type_control$cell_type == paste0("Group",j))
sce_sim_cell_type_control$marked[sample(xx,floor(length(xx) * 0.5+
(runif(1)<0.5)))] <- TRUE
}
```
We plot the distribution of cell types.
```{r,fig.width=7}
table_cell_types <-
as.data.frame(colData(sce_sim_cell_type_control)[,c("cell_type","marked")]) %>%
dplyr::group_by_all() %>% dplyr::count()
ggplot(table_cell_types,aes(x=cell_type,y=n,fill=marked)) +
geom_bar(stat="identity",position="dodge")+theme_classic()+
xlab("cell type") + ylab("number of cells") +
ggtitle("Control experiment: Distribution of cells across cell types")+
scale_fill_manual(values=c("FALSE"="darkblue","TRUE"="red"))
```
### Simulating the experiment with perturbation
For the experiment with perturbation, we first perform the same simulation as above. Note that while the parameters are identical, this is a separate independent simulation.
```{r,fig.width=7}
sce_sim_cell_type_case <- splatSimulate(
nGenes = 10,
batchCells = 5000,
group.prob = rep(0.1,10),
method = "groups",
verbose = FALSE
)
sce_sim_cell_type_case <- logNormCounts(sce_sim_cell_type_case)
sce_sim_cell_type_case <- runPCA(sce_sim_cell_type_case,ncomponents=3)
sce_sim_cell_type_case <- runUMAP(sce_sim_cell_type_case)
names(colData(sce_sim_cell_type_case))[3] <- "cell_type"
plotReducedDim(sce_sim_cell_type_case, dimred="UMAP",colour_by = "cell_type") +
theme_classic()
sce_sim_cell_type_case$marked <- FALSE
for (j in 1:10){
xx <- which(sce_sim_cell_type_case$cell_type == paste0("Group",j))
sce_sim_cell_type_case$marked[sample(xx,floor(length(xx) * 0.5 +
(runif(1) <= 0.5)))] <- TRUE
}
```
Now we assume that Group1 and Group2 and Group3 have been affected by a perturbation and therefore the marked cells have been depleted to 50%, 20% and 5% of their original number of cells, respectively.
```{r}
marked_cells_group1 <- which(sce_sim_cell_type_case$marked &
sce_sim_cell_type_case$cell_type=="Group1")
marked_cells_group2 <- which(sce_sim_cell_type_case$marked &
sce_sim_cell_type_case$cell_type=="Group2")
marked_cells_group3 <- which(sce_sim_cell_type_case$marked &
sce_sim_cell_type_case$cell_type=="Group3")
depleted_cells <- c(sample(marked_cells_group1,floor(0.5*length(marked_cells_group1) +
(runif(1)<0.5))),
sample(marked_cells_group2,floor(0.8*length(marked_cells_group2) +
(runif(1)<0.5))),
sample(marked_cells_group3,floor(0.95*length(marked_cells_group3) +
(runif(1)<0.5))))
sce_sim_cell_type_case <- sce_sim_cell_type_case[,-depleted_cells]
```
We now plot the distribution of cells across cell types.
```{r,fig.width=7}
table_cell_types <-
as.data.frame(colData(sce_sim_cell_type_case)[,c("cell_type","marked")]) %>%
dplyr::group_by_all() %>% dplyr::count()
ggplot(table_cell_types,aes(x=cell_type,y=n,fill=marked)) +
geom_bar(stat="identity",position="dodge")+theme_classic()+
xlab("cell type") + ylab("number of cells") +
ggtitle("Case experiment: Distribution of cells across cell types")+
scale_fill_manual(values=c("FALSE"="darkblue","TRUE"="red"))
```
For experiments with visually marked cells, e.g. using fluorescent markers as in @pijuan-sala, @guibentif, or @strauss, the following sampling bias occurs: experimentalists aim for approximately 50% (or a different fixed proportion) of cells with the fluorescent markers, which means that the non-depleted cell types will appear enriched in marked cells.
We simulate the experimental bias described above by sampling an equal number of marked and unmarked cells.
```{r}
marked_cells <- which(sce_sim_cell_type_case$marked)
unmarked_cells <- which(!(sce_sim_cell_type_case$marked))
sce_sim_cell_type_case_sampled <-
sce_sim_cell_type_case[,c(sample(marked_cells,length(marked_cells)),
sample(unmarked_cells,length(marked_cells)))]
```
We plot the resulting distribution of cells across cell types. The unaffected cell types now appear to be enriched for marked cells. However, this is only because of the experimental bias.
```{r,fig.width=7}
table_cell_types <-
as.data.frame(colData(sce_sim_cell_type_case_sampled)[,c("cell_type","marked")]) %>%
dplyr::group_by_all() %>% dplyr::count()
ggplot(table_cell_types,aes(x=cell_type,y=n,fill=marked)) +
geom_bar(stat="identity",position="dodge")+theme_classic()+
xlab("cell type") + ylab("number of cells") +
ggtitle("Case experiment: Distribution of cells across cell types after experiment")+
scale_fill_manual(values=c("FALSE"="darkblue","TRUE"="red"))
```
Below we show that, by running COSICC_DA_group, we are able to identify the different levels of depletion for the three cell type correctly, as COSICC_DA_group is able to correct for the spurious enrichment for marked cells resulting from the experimental bias explained above.
```{r,fig.width=7}
COSICC_DA_result <-
COSICC_DA_group(sce_case=sce_sim_cell_type_case_sampled,
sce_control=sce_sim_cell_type_control)
```
## Note about method and applicability
COSICC_DA_group assumes that the true median depletion or enrichment of marked cells across all cell types is 0. In our example, at most 4 groups could be depleted for the method to identify the true enrichment and depletion.
# COSICC_kinetics
COSICC_kinetics identifies whether, as a result of a perturbation, there is significant delay or acceleration along a lineage trajectory.
# Simulating data
Like for COSICC_DA_group above, we first simulate control data using the Splatter package.
## Simulating the control experiment
```{r}
sce_kinetics_control <-
splatSimulatePaths(batchCells=500,verbose=FALSE,nGenes=10,de.prob=0.5)
```
Splatter simulates developmental progression as "Step", we rename as pt (pseudotime), to apply COSICC_kinetics.
```{r}
sce_kinetics_control$pt <- sce_kinetics_control$Step
sce_kinetics_control$Step <- NULL
```
We mark cells, with somewhat more cells marked from the earlier part of the lineage trajectory. This simulates an effect in the control experiment caused by experimental and not biological reasons.
```{r}
sce_kinetics_control$marked <- (1:ncol(sce_kinetics_control)) %in% sample(1:ncol(sce_kinetics_control),floor(0.5*ncol(sce_kinetics_control)))
cells_higher_pt_marked <- colnames(sce_kinetics_control[,sce_kinetics_control$marked & sce_kinetics_control$pt>median(sce_kinetics_control$pt)])
cells_depleted_experimental <- sample(cells_higher_pt_marked,40)
sce_kinetics_control <- sce_kinetics_control[,!(sce_kinetics_control$Cell %in% cells_depleted_experimental)]
```
We compute normalised log-counts and PCA.
```{r}
sce_kinetics_control <- logNormCounts(sce_kinetics_control)
sce_kinetics_control <- runPCA(sce_kinetics_control,ncomponents=3)
```
We create a plot of a diffusion map for the data set to illustrate pseudotime and progression along the lineage trajectory.
```{r,fig.width=7}
dm <- DiffusionMap(reducedDims(sce_kinetics_control)$PCA)
xx <- sample(1:length(dm$DC1))
tmp <- data.frame(DC1 = eigenvectors(dm)[xx, 1],
DC2 = eigenvectors(dm)[xx, 2],
pt = sce_kinetics_control$pt[xx],
marked = sce_kinetics_control$marked[xx])
ggplot(tmp,mapping=aes(x=DC1,y=DC2,color=pt)) + geom_point() + theme_classic() +
xlab("diffusion component 1") + ylab("diffusion component 2") +
scale_colour_viridis_c()+
ggtitle("Diffusion map sce_kinetics_control")
```
Now we repeat the plot, but coloured by whether the cells are marked.
```{r,fig.width=7}
ggplot(tmp,mapping=aes(x=DC1,y=DC2,color=marked)) + geom_point() + theme_classic() +
xlab("diffusion component 1") + ylab("diffusion component 2") +
scale_color_manual(values=c("FALSE"="darkblue","TRUE"="red"))+
ggtitle("Diffusion map sce_kinetics_control")
```
We also plot the distributions of marked an unmarked cells across pseudotime. The figure below shows that there is some difference between pseudotimes for marked and unmarked cells for the control data set. This illustrates a possible experimental effect.
```{r,fig.width=7}
ggplot(colData(sce_kinetics_control),aes(fill=marked,x=pt)) + geom_density(alpha=0.3)+theme_classic() + scale_fill_manual(values=c("FALSE"="darkblue","TRUE"="red"))+
ggtitle("Distribution of cells across pseudotime for sce_kinetics_control")+
xlab("pseudotime")
```
## Simulating the experiment with perturbation
To simulate a perturbation that delays progression along a lineage trajectory, we define probabilites for marking cells along the lineage trajectory that decrease with progression along the lineage trajectory.
```{r}
probs <- sort(runif(ncol(sce_kinetics_control)),decreasing=TRUE)
cells_higher_pt_marked <- colnames(sce_kinetics_control[,sce_kinetics_control$marked & sce_kinetics_control$pt>median(sce_kinetics_control$pt)])
kinetics_depleted_biological <- sample(cells_higher_pt_marked,80)
sce_kinetics_case <- sce_kinetics_control[,!(colnames(sce_kinetics_control)%in%kinetics_depleted_biological)]
```
We plot the perturbed data set on a UMAP.
```{r,fig.width=7}
tmp <- data.frame(DC1 = eigenvectors(dm)[xx, 1],
DC2 = eigenvectors(dm)[xx, 2],
pt = sce_kinetics_case$pt[xx],
marked = sce_kinetics_control$marked[xx])
ggplot(tmp,mapping=aes(x=DC1,y=DC2,color=marked)) + geom_point() +
theme_classic() +
xlab("diffusion component 1") + ylab("diffusion component 2") +
scale_color_manual(values=c("FALSE"="darkblue","TRUE"="red"))+
ggtitle("Diffusion map sce_kinetics_case")
```
We also plot the distributions of marked an unmarked cells across pseudotime. The figure below shows that a difference between pseudotimes for marked and unmarked cells for the perturbation data set, which is larger than the effect seen for the control data. This illustrates a biological effect.
```{r,fig.width=7}
ggplot(colData(sce_kinetics_case),aes(fill=marked,x=pt)) +
geom_density(alpha=0.3)+theme_classic() +
scale_fill_manual(values=c("FALSE"="darkblue","TRUE"="red"))+
ggtitle("Distribution of cells across pseudotime for sce_kinetics_case")+
xlab("pseudotime")
```
## Identification of developmental delay with COSICC_kinetics
We now run COSICC_kinetics to test whether the delay in development along the lineage trajectory seen in the figure above is significant. Indeed, COSICC_kinetics identifies a significant delay of marked versus unmarked cells for the perturbed case data. The fact that convidence_interval_overlapping_with_control = FALSE means that the 95% confidence intervals for the case and the control data do not overlap. Therefore, while the control data also shows a developmental delay, the one for the perturbed case data set may be seen as significantly more pronounced. COSICC_kinetics thereby allows the detection of significant developmental delay, while also comparing the this delay to the control data set in a principled manner.
```{r}
COSICC_kinetics(sce_case=sce_kinetics_case,sce_control=sce_kinetics_control)
```
# Illustration of COSICC_DA_lineage
COSICC_DA_lineage tests for differential abundance of lineages after perturbations, compared to a control experiment without perturbations. The input for lineage contribution of cells is a score for each cell that reflects the probability of the cell being part of the lineage, e.g. Waddington-OT scores (@Schiebinger).
## Simulation of lineage scores
We simulate scores for 5 lineages, with one lineage (lineage_1) reduced for marked cells in the perturbed condition. First, we simulate the scores for the control condition.
```{r}
sim_lineage_scores_control <- matrix(rnorm(1500) * 0.1,nrow=300,ncol=5)
for (j in 1:5){
sim_lineage_scores_control[((j-1)*60+1):(j*60),j] <- rnorm(60) *0.1+0.85
}
sim_lineage_scores_control <- apply(sim_lineage_scores_control,1,function(x) x/colSums(sim_lineage_scores_control))
sim_lineage_scores_control <- mapply(function(x) min(x,1),sim_lineage_scores_control)
sim_lineage_scores_control <- mapply(function(x) max(x,0),sim_lineage_scores_control)
dim(sim_lineage_scores_control) <- c(300,5)
colnames(sim_lineage_scores_control) <- paste0("lineage_",1:5)
rownames(sim_lineage_scores_control) <- paste0("cell_",1:300,"_control")
sim_lineage_scores_control <- as.data.frame(sim_lineage_scores_control)
colData_DA_lineage_control <- data.frame(cell=rownames(sim_lineage_scores_control),
marked = sample(c(TRUE,FALSE),300,replace=TRUE))
sim_lineage_scores_control$id <- colData_DA_lineage_control$cell
colData_DA_lineage_control$cell <- sim_lineage_scores_control$id
sce_DA_lineage_control <- SingleCellExperiment(colData=colData_DA_lineage_control)
```
We plot the scores for lineage 1 for the control condition.
```{r,fig.width=4,fig.height=2}
df <- as.data.frame(sim_lineage_scores_control)
df$marked <- colData_DA_lineage_control$marked
ggplot(df,aes(x=lineage_1,color=marked,fill=marked)) +
geom_density(alpha=0.5) +
scale_color_manual(values=c("FALSE"="darkblue","TRUE"="red"))+
scale_fill_manual(values=c("FALSE"="darkblue","TRUE"="red"))+
theme_classic()+
labs(title="Scores for lineage 1 - control condition")
```
Now we simulate the unmarked cells for the perturbed condition.
```{r}
sim_lineage_scores_case_unmarked <- matrix(rnorm(750) * 0.1,nrow=150,ncol=5)
for (j in 1:5){
sim_lineage_scores_case_unmarked[((j-1)*30+1):(j*30),j] <- rnorm(30) *0.1+0.85
}
sim_lineage_scores_case_unmarked <- mapply(function(x) min(x,1),sim_lineage_scores_case_unmarked)
sim_lineage_scores_case_unmarked <- mapply(function(x) max(x,0),sim_lineage_scores_case_unmarked)
dim(sim_lineage_scores_case_unmarked) <- c(150,5)
colnames(sim_lineage_scores_case_unmarked) <- paste0("lineage_",1:5)
rownames(sim_lineage_scores_case_unmarked) <- paste0("cell_",1:150,"_case")
sim_lineage_scores_case_unmarked <- as.data.frame(sim_lineage_scores_case_unmarked)
colData_DA_lineage_case_unmarked <- data.frame(cell=rownames(sim_lineage_scores_case_unmarked),
marked = rep(FALSE,150))
sim_lineage_scores_case_unmarked$id <- colData_DA_lineage_case_unmarked$cell
colData_DA_lineage_case_unmarked$cell <- sim_lineage_scores_case_unmarked$id
sce_DA_lineage_case_unmarked <- SingleCellExperiment(colData=colData_DA_lineage_case_unmarked)
```
We simulate the marked cells for the perturbed condition, with fewer cells assigned to lineage 1.
```{r}
sim_lineage_scores_case_marked <- matrix(rnorm(750) * 0.1,nrow=150,ncol=5)
sim_lineage_scores_case_marked[sample(1:30,5),1] <- rnorm(5) *0.1+0.85 # a
# smaller number of cells is assigned to lineage 1 for the perturbed condition
for (j in 2:5){
sim_lineage_scores_case_marked[((j-1)*30+1):(j*30),j] <- runif(30) *0.1+0.85
}
sim_lineage_scores_case_marked <- mapply(function(x) min(x,1),sim_lineage_scores_case_marked)
sim_lineage_scores_case_marked <- mapply(function(x) max(x,0),sim_lineage_scores_case_marked)
dim(sim_lineage_scores_case_marked) <- c(150,5)
colnames(sim_lineage_scores_case_marked) <- paste0("lineage_",1:5)
rownames(sim_lineage_scores_case_marked) <- paste0("cell_",151:300,"_case")
sim_lineage_scores_case_marked <- as.data.frame(sim_lineage_scores_case_marked)
colData_DA_lineage_case_marked <- data.frame(cell=rownames(sim_lineage_scores_case_marked),
marked = rep(TRUE,150))
sim_lineage_scores_case_marked$id <- colData_DA_lineage_case_marked$cell
colData_DA_lineage_case_marked$cell <- sim_lineage_scores_case_marked$id
sce_DA_lineage_case <-
SingleCellExperiment(colData=rbind(colData_DA_lineage_case_marked,
colData_DA_lineage_case_unmarked),
)
```
We plot the scores for lineage 1 for the perturbed condition.
```{r,fig.width=4,fig.height=2}
df <- rbind(sim_lineage_scores_case_marked,sim_lineage_scores_case_unmarked)
df$marked <- sce_DA_lineage_case$marked
ggplot(df,aes(x=lineage_1,color=marked,fill=marked)) +
geom_density(alpha=0.5) +
scale_color_manual(values=c("FALSE"="darkblue","TRUE"="red"))+
scale_fill_manual(values=c("FALSE"="darkblue","TRUE"="red"))+
theme_classic()+
labs(title="Scores for lineage 1 - perturbed condition")
```
We apply COSICC_DA_lineage and identify the depletion of lineage 1, illustrated in the plot below.
```{r}
lineage_scores <- rbind(sim_lineage_scores_case_marked,
sim_lineage_scores_case_unmarked,sim_lineage_scores_control)
COSICC_DA_lineage(sce_DA_lineage_case,sce_DA_lineage_control,lineage_scores,alpha=0.1)
```
# Session Information
```{r}
sessionInfo()
```
# References