forked from BodenmillerGroup/IMCDataAnalysis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-batch_correction.Rmd
393 lines (303 loc) · 15.6 KB
/
07-batch_correction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
# Batch effect correction {#batch-effects}
In Section \@ref(cell-quality) we observed staining/expression differences
between the individual samples. This can arise due to technical (e.g.,
differences in sample processing) as well as biological (e.g. differential
expression between patients/indications) reasons. However, the combination of these effects
hinders cell phenotyping via clustering as highlighted in Section \@ref(clustering).
To integrate cells across samples, we can use computational
strategies developed for correcting batch effects in single-cell RNA sequencing
data. In the following sections, we will use functions of the
[batchelor](https://www.bioconductor.org/packages/release/bioc/html/batchelor.html),
[harmony](https://github.com/immunogenomics/harmony) and
[Seurat](https://satijalab.org/seurat/articles/integration_introduction.html)
packages to correct for such batch effects.
Of note: the correction approaches presented here aim at removing any
differences between samples. This will also remove biological differences
between the patients/indications. Nevertheless, integrating cells across samples
can facilitate the detection of cell phenotypes via clustering.
First, we will read in the `SpatialExperiment` object containing the single-cell
data.
```{r read-data-batch-correction, message=FALSE}
spe <- readRDS("data/spe.rds")
```
## fastMNN correction
The `batchelor` package provides the `mnnCorrect` and `fastMNN` functions to
correct for differences between samples/batches. Both functions build up on
finding mutual nearest neighbors (MNN) among the cells of different samples and
correct expression differences between the cells [@Haghverdi2018]. The `mnnCorrect` function
returns corrected expression counts while the `fastMNN` functions performs the
correction in reduced dimension space. As such, `fastMNN` returns integrated
cells in form of a low dimensional embedding.
Paper: [Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors](https://www.nature.com/articles/nbt.4091)
Documentation: [batchelor](https://www.bioconductor.org/packages/release/bioc/vignettes/batchelor/inst/doc/correction.html)
### Perform sample correction
Here, we apply the `fastMNN` function to integrate cells between
patients. By setting `auto.merge = TRUE` the function estimates the best
batch merging order by maximizing the number of MNN pairs at each merging step.
This is more time consuming than merging sequentially based on how batches appear in the
dataset (default). We again select the markers defined in Section \@ref(cell-processing)
for sample correction.
The function returns a `SingleCellExperiment` object which contains corrected
low-dimensional coordinates for each cell in the `reducedDim(out, "corrected")`
slot. This low-dimensional embedding can be further used for clustering and
non-linear dimensionality reduction. We transfer the corrected coordinates
to the main `SpatialExperiment` object.
```{r, echo=FALSE}
start_time <- Sys.time()
```
```{r batch-correction-fastMNN, message=FALSE, warning=FALSE}
library(batchelor)
set.seed(220228)
out <- fastMNN(spe, batch = spe$patient_id,
auto.merge = TRUE,
subset.row = rowData(spe)$use_channel,
assay.type = "exprs")
# Transfer the correction results to the main spe object
reducedDim(spe, "fastMNN") <- reducedDim(out, "corrected")
```
```{r, echo=FALSE}
end_time <- Sys.time()
```
The computational time of the `fastMNN` function call is
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.
### Quality control of correction results
The `fastMNN` function further returns outputs that can be used to assess the
quality of the batch correction. The `metadata(out)$merge.info` entry collects
diagnostics for each individual merging step. Here, the `batch.size` and
`lost.var` entries are important. The `batch.size` entry reports the relative
magnitude of the batch effect and the `lost.var` entry represents the percentage
of lost variance per merging step. A large `batch.size` and low `lost.var`
indicate sufficient batch correction.
```{r batch-correction-fastMNN-QC, message=FALSE}
merge_info <- metadata(out)$merge.info
DataFrame(left = merge_info$left,
right = merge_info$right,
batch.size = merge_info$batch.size,
max_lost_var = rowMax(merge_info$lost.var))
```
We observe that Patient4 and Patient2 are most similar with a low batch effect.
Merging cells of Patient3 into the combined batch of Patient1,
Patient2 and Patient4 resulted in the highest percentage of lost variance and
the detection of the largest batch effect. In the next paragraph we can
visualize the correction results.
### Visualization
The simplest option to check if the sample effects were corrected is by using
non-linear dimensionality reduction techniques and observe mixing of cells across
samples. We will recompute the UMAP embedding using the corrected
low-dimensional coordinates for each cell.
```{r dimred-batch-correction-fastMNN, message=FALSE, warning=FALSE}
library(scater)
set.seed(220228)
spe <- runUMAP(spe, dimred= "fastMNN", name = "UMAP_mnnCorrected")
```
Next, we visualize the corrected UMAP while overlaying patient IDs.
```{r visualizing-batch-correction-fastMNN-1, message=FALSE, warning=FALSE, fig.height=3}
library(cowplot)
library(dittoSeq)
library(viridis)
# visualize patient id
p1 <- dittoDimPlot(spe, var = "patient_id",
reduction.use = "UMAP", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
ggtitle("Patient ID on UMAP before correction")
p2 <- dittoDimPlot(spe, var = "patient_id",
reduction.use = "UMAP_mnnCorrected", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
ggtitle("Patient ID on UMAP after correction")
plot_grid(p1, p2)
```
We observe an imperfect merging of Patient3 into all other samples. This
was already seen when displaying the merging information above.
We now also visualize the expression of selected markers across all cells
before and after batch correction.
```{r visualizing-batch-correction-fastMNN-2, warning=FALSE, message=FALSE, fig.height=8}
markers <- c("Ecad", "CD45RO", "CD20", "CD3", "FOXP3", "CD206", "MPO", "SMA", "Ki67")
# Before correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP",
assay = "exprs", size = 0.2, list.out = TRUE)
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list)
# After correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP_mnnCorrected",
assay = "exprs", size = 0.2, list.out = TRUE)
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list)
```
We observe that immune cells across patients are merged after batch correction
using `fastMNN`. However, the tumor cells of different patients still cluster
separately.
## harmony correction
The `harmony` algorithm performs batch correction by iteratively clustering
and correcting the positions of cells in PCA space [@Korsunsky2019]. It
requires a matrix of transformed expression counts and internally performs
PCA before kmeans clustering. We will first create the expression matrix and
call the `HarmonyMatrix` function to perform the correction.
Paper: [Fast, sensitive and accurate integration of single-cell data with Harmony](https://www.nature.com/articles/s41592-019-0619-0)
Documentation: [harmony](https://portals.broadinstitute.org/harmony/index.html)
Similar to the `fastMNN` function, `harmony` returns the corrected
low-dimensional coordinates for each cell. These can be saved in the
`reducedDim` slot of the original `SpatialExperiment` object.
```{r, echo=FALSE}
start_time <- Sys.time()
```
```{r batch-correction-harmony, message=FALSE, warning=FALSE}
library(harmony)
mat <- t(assay(spe, "exprs")[rowData(spe)$use_channel,])
harmony_emb <- HarmonyMatrix(mat, spe$patient_id, do_pca = TRUE)
reducedDim(spe, "harmony") <- harmony_emb
```
```{r, echo=FALSE}
end_time <- Sys.time()
```
The computational time of the `HarmonyMatrix` function call is
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.
### Visualization
We will now again visualize the cells in low dimensions after UMAP embedding.
```{r dimred-batch-correction-harmony, message=FALSE}
set.seed(220228)
spe <- runUMAP(spe, dimred = "harmony", name = "UMAP_harmony")
```
```{r visualizing-batch-correction-harmony-1, message=FALSE, warning=FALSE, fig.height=3}
# visualize patient id
p1 <- dittoDimPlot(spe, var = "patient_id",
reduction.use = "UMAP", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
ggtitle("Patient ID on UMAP before correction")
p2 <- dittoDimPlot(spe, var = "patient_id",
reduction.use = "UMAP_harmony", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
ggtitle("Patient ID on UMAP after correction")
plot_grid(p1, p2)
```
And we visualize selected marker expression as defined above.
```{r visualizing-batch-correction-harmony-2, warning=FALSE, message=FALSE, fig.height=8}
# Before correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP",
assay = "exprs", size = 0.2, list.out = TRUE)
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list)
# After correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP_harmony",
assay = "exprs", size = 0.2, list.out = TRUE)
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list)
```
We observe a more aggressive merging of cells from different patients compared
to the results after `fastMNN` correction. Important immune cell and epithelial
markers are expressed in distinct regions of the UMAP.
## Seurat correction
The `Seurat` package provides a number of functionalities to analyze single-cell
data. As such it also allows the integration of cells across different samples.
Conceptually, `Seurat` performs batch correction similarly to `fastMNN` by
finding mutual nearest neighbors (MNN) in low dimensional space before
correcting the expression values of cells [@Stuart2019].
Paper: [Comprehensive Integration of Single-Cell Data](https://www.cell.com/cell/fulltext/S0092-8674(19)30559-8)
Documentation: [Seurat](https://satijalab.org/seurat/index.html)
To use `Seurat`, we will first create a `Seurat` object from the `SpatialExperiment`
object and add relevant metadata. The object also needs to be split by patient
prior to integration.
```{r prepare-seurat, message=FALSE, warning=FALSE}
library(Seurat)
library(SeuratObject)
seurat_obj <- as.Seurat(spe, counts = "counts", data = "exprs")
seurat_obj <- AddMetaData(seurat_obj, as.data.frame(colData(spe)))
seurat.list <- SplitObject(seurat_obj, split.by = "patient_id")
```
To avoid long run times, we will use an approach that relies on reciprocal PCA
instead of canonical correlation analysis for dimensionality reduction and
initial alignment. For an extended tutorial on how to use `Seurat` for data
integration, please refer to their
[vignette](https://satijalab.org/seurat/articles/integration_rpca.html).
We will first define the features used for integration and perform PCA on cells
of each patient individually. The `FindIntegrationAnchors` function detects MNNs between
cells of different patients and the `IntegrateData` function corrects the
expression values of cells. We slightly increase the number of neighbors to be
considered for MNN detection (the `k.anchor` parameter). This increases the integration
strength.
```{r, echo=FALSE}
start_time <- Sys.time()
```
```{r seurat-correction, message=FALSE, warning=FALSE}
features <- rownames(spe)[rowData(spe)$use_channel]
seurat.list <- lapply(X = seurat.list, FUN = function(x) {
x <- ScaleData(x, features = features, verbose = FALSE)
x <- RunPCA(x, features = features, verbose = FALSE)
return(x)
})
anchors <- FindIntegrationAnchors(object.list = seurat.list,
anchor.features = features,
reduction = "rpca",
k.anchor = 20)
combined <- IntegrateData(anchorset = anchors)
```
```{r, echo=FALSE}
end_time <- Sys.time()
```
We now select the `integrated` assay and perform PCA dimensionality reduction.
The cell coordinates in PCA reduced space can then be transferred to the
original `SpatialExperiment` object.
```{r message=FALSE, warning=FALSE}
DefaultAssay(combined) <- "integrated"
combined <- ScaleData(combined, verbose = FALSE)
combined <- RunPCA(combined, npcs = 30, verbose = FALSE)
reducedDim(spe, "seurat") <- combined@[email protected]
```
The computational time of the `Seurat` function calls is
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.
### Visualization
As above, we recompute the UMAP embeddings based on `Seurat` integrated results
and visualize the embedding.
```{r umap-seurat, message=FALSE, warning=FALSE}
set.seed(220228)
spe <- runUMAP(spe, dimred = "seurat", name = "UMAP_seurat")
```
Visualize patient IDs.
```{r visualizing-batch-correction-seurat-1, message=FALSE, warning=FALSE, fig.height=3}
# visualize patient id
p1 <- dittoDimPlot(spe, var = "patient_id",
reduction.use = "UMAP", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
ggtitle("Patient ID on UMAP before correction")
p2 <- dittoDimPlot(spe, var = "patient_id",
reduction.use = "UMAP_seurat", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
ggtitle("Patient ID on UMAP after correction")
plot_grid(p1, p2)
```
Visualization of marker expression.
```{r visualizing-batch-correction-seurat-2, warning=FALSE, message=FALSE, fig.height=8}
# Before correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP",
assay = "exprs", size = 0.2, list.out = TRUE)
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list)
# After correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP_seurat",
assay = "exprs", size = 0.2, list.out = TRUE)
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list)
```
Similar to the methods presented above, `Seurat` integrates immune cells correctly.
When visualizing the patient IDs, slight patient-to-patient differences within tumor
cells can be detected.
Choosing the correct integration approach is challenging without having ground truth
cell labels available. It is recommended to compare different techniques and different
parameter settings. Please refer to the documentation of the individual tools
to become familiar with the possible parameter choices. Furthermore, in the following
section, we will discuss clustering and classification approaches in light of
expression differences between samples.
In general, it appears that MNN-based approaches are less conservative in terms
of merging compared to `harmony`. On the other hand, `harmony` could well merge
cells in a way that regresses out biological signals.
## Save objects
The modified `SpatialExperiment` object is saved for further downstream analysis.
```{r save-objects-batch-correction}
saveRDS(spe, "data/spe.rds")
```
## Session Info
<details>
<summary>SessionInfo</summary>
```{r, echo = FALSE}
sessionInfo()
```
</details>