07-batch_correction.Rmd

# Batch effect correction {#batch-effects}

In Section \@ref(cell-quality) we observed staining/expression differences
between the individual samples. This can arise due to technical (e.g.,
differences in sample processing) as well as biological (e.g. differential
expression between patients/indications) reasons. However, the combination of these effects
hinders cell phenotyping via clustering as highlighted in Section \@ref(clustering).

To integrate cells across samples, we can use computational
strategies developed for correcting batch effects in single-cell RNA sequencing
data. In the following sections, we will use functions of the
[batchelor](https://www.bioconductor.org/packages/release/bioc/html/batchelor.html),
[harmony](https://github.com/immunogenomics/harmony) and
[Seurat](https://satijalab.org/seurat/articles/integration_introduction.html)
packages to correct for such batch effects.

Of note: the correction approaches presented here aim at removing any
differences between samples. This will also remove biological differences
between the patients/indications. Nevertheless, integrating cells across samples
can facilitate the detection of cell phenotypes via clustering.

First, we will read in the `SpatialExperiment` object containing the single-cell
data.

```{r read-data-batch-correction, message=FALSE}
spe <- readRDS("data/spe.rds")
```

## fastMNN correction

The `batchelor` package provides the `mnnCorrect` and `fastMNN` functions to
correct for differences between samples/batches. Both functions build up on
finding mutual nearest neighbors (MNN) among the cells of different samples and
correct expression differences between the cells [@Haghverdi2018]. The `mnnCorrect` function 
returns corrected expression counts while the `fastMNN` functions performs the 
correction in reduced dimension space. As such, `fastMNN` returns integrated
cells in form of a low dimensional embedding.

Paper: [Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors](https://www.nature.com/articles/nbt.4091)  
Documentation: [batchelor](https://www.bioconductor.org/packages/release/bioc/vignettes/batchelor/inst/doc/correction.html)

### Perform sample correction

Here, we apply the `fastMNN` function to integrate cells between 
patients. By setting `auto.merge = TRUE` the function estimates the best 
batch merging order by maximizing the number of MNN pairs at each merging step. 
This is more time consuming than merging sequentially based on how batches appear in the 
dataset (default). We again select the markers defined in Section \@ref(cell-processing)
for sample correction.

The function returns a `SingleCellExperiment` object which contains corrected
low-dimensional coordinates for each cell in the `reducedDim(out, "corrected")`
slot. This low-dimensional embedding can be further used for clustering and
non-linear dimensionality reduction. We transfer the corrected coordinates
to the main `SpatialExperiment` object.

```{r, echo=FALSE}
start_time <- Sys.time()
```

```{r batch-correction-fastMNN, message=FALSE, warning=FALSE}
library(batchelor)
set.seed(220228)
out <- fastMNN(spe, batch = spe$patient_id,
               auto.merge = TRUE,
               subset.row = rowData(spe)$use_channel,
               assay.type = "exprs")

# Transfer the correction results to the main spe object
reducedDim(spe, "fastMNN") <- reducedDim(out, "corrected")
```

```{r, echo=FALSE}
end_time <- Sys.time()
```

The computational time of the `fastMNN` function call is 
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.

### Quality control of correction results

The `fastMNN` function further returns outputs that can be used to assess the
quality of the batch correction. The `metadata(out)$merge.info` entry collects
diagnostics for each individual merging step. Here, the `batch.size` and
`lost.var` entries are important. The `batch.size` entry reports the relative
magnitude of the batch effect and the `lost.var` entry represents the percentage
of lost variance per merging step. A large `batch.size` and low `lost.var`
indicate sufficient batch correction.

```{r batch-correction-fastMNN-QC, message=FALSE}
merge_info <- metadata(out)$merge.info 

DataFrame(left = merge_info$left,
          right = merge_info$right,
          batch.size = merge_info$batch.size,
          max_lost_var = rowMax(merge_info$lost.var))
```

We observe that Patient4 and Patient2 are most similar with a low batch effect. 
Merging cells of Patient3 into the combined batch of Patient1,
Patient2 and Patient4 resulted in the highest percentage of lost variance and
the detection of the largest batch effect. In the next paragraph we can
visualize the correction results.

### Visualization

The simplest option to check if the sample effects were corrected is by using
non-linear dimensionality reduction techniques and observe mixing of cells across
samples. We will recompute the UMAP embedding using the corrected
low-dimensional coordinates for each cell.

```{r dimred-batch-correction-fastMNN, message=FALSE, warning=FALSE}
library(scater)

set.seed(220228)
spe <- runUMAP(spe, dimred= "fastMNN", name = "UMAP_mnnCorrected") 
```

Next, we visualize the corrected UMAP while overlaying patient IDs.

```{r visualizing-batch-correction-fastMNN-1, message=FALSE, warning=FALSE, fig.height=3}
library(cowplot)
library(dittoSeq)
library(viridis)

# visualize patient id 
p1 <- dittoDimPlot(spe, var = "patient_id", 
                   reduction.use = "UMAP", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
    ggtitle("Patient ID on UMAP before correction")
p2 <- dittoDimPlot(spe, var = "patient_id", 
                   reduction.use = "UMAP_mnnCorrected", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
    ggtitle("Patient ID on UMAP after correction")

plot_grid(p1, p2)
```

We observe an imperfect merging of Patient3 into all other samples. This
was already seen when displaying the merging information above.
We now also visualize the expression of selected markers across all cells 
before and after batch correction.

```{r visualizing-batch-correction-fastMNN-2, warning=FALSE, message=FALSE, fig.height=8}
markers <- c("Ecad", "CD45RO", "CD20", "CD3", "FOXP3", "CD206", "MPO", "SMA", "Ki67")

# Before correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP", 
                   assay = "exprs", size = 0.2, list.out = TRUE) 
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list) 

# After correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP_mnnCorrected", 
                   assay = "exprs", size = 0.2, list.out = TRUE) 
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list) 
```

We observe that immune cells across patients are merged after batch correction 
using `fastMNN`. However, the tumor cells of different patients still cluster
separately.

## harmony correction

The `harmony` algorithm performs batch correction by iteratively clustering
and correcting the positions of cells in PCA space [@Korsunsky2019]. It
requires a matrix of transformed expression counts and internally performs
PCA before kmeans clustering. We will first create the expression matrix and
call the `HarmonyMatrix` function to perform the correction. 

Paper: [Fast, sensitive and accurate integration of single-cell data with Harmony](https://www.nature.com/articles/s41592-019-0619-0)  
Documentation: [harmony](https://portals.broadinstitute.org/harmony/index.html)

Similar to the `fastMNN` function, `harmony` returns the corrected
low-dimensional coordinates for each cell. These can be saved in the
`reducedDim` slot of the original `SpatialExperiment` object.

```{r, echo=FALSE}
start_time <- Sys.time()
```

```{r batch-correction-harmony, message=FALSE, warning=FALSE}
library(harmony)

mat <- t(assay(spe, "exprs")[rowData(spe)$use_channel,])

harmony_emb <- HarmonyMatrix(mat, spe$patient_id, do_pca = TRUE)

reducedDim(spe, "harmony") <- harmony_emb
```

```{r, echo=FALSE}
end_time <- Sys.time()
```

The computational time of the `HarmonyMatrix` function call is 
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.

### Visualization

We will now again visualize the cells in low dimensions after UMAP embedding.

```{r dimred-batch-correction-harmony, message=FALSE}
set.seed(220228)
spe <- runUMAP(spe, dimred = "harmony", name = "UMAP_harmony") 
```

```{r visualizing-batch-correction-harmony-1, message=FALSE, warning=FALSE, fig.height=3}
# visualize patient id 
p1 <- dittoDimPlot(spe, var = "patient_id", 
                   reduction.use = "UMAP", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
    ggtitle("Patient ID on UMAP before correction")
p2 <- dittoDimPlot(spe, var = "patient_id", 
                   reduction.use = "UMAP_harmony", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
    ggtitle("Patient ID on UMAP after correction")

plot_grid(p1, p2)
```

And we visualize selected marker expression as defined above.

```{r visualizing-batch-correction-harmony-2, warning=FALSE, message=FALSE, fig.height=8}
# Before correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP", 
                   assay = "exprs", size = 0.2, list.out = TRUE) 
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list) 

# After correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP_harmony", 
                   assay = "exprs", size = 0.2, list.out = TRUE) 
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list) 
```

We observe a more aggressive merging of cells from different patients compared
to the results after `fastMNN` correction. Important immune cell and epithelial
markers are expressed in distinct regions of the UMAP. 

## Seurat correction

The `Seurat` package provides a number of functionalities to analyze single-cell
data. As such it also allows the integration of cells across different samples.
Conceptually, `Seurat` performs batch correction similarly to `fastMNN` by
finding mutual nearest neighbors (MNN) in low dimensional space before
correcting the expression values of cells [@Stuart2019].

Paper: [Comprehensive Integration of Single-Cell Data](https://www.cell.com/cell/fulltext/S0092-8674(19)30559-8)  
Documentation: [Seurat](https://satijalab.org/seurat/index.html)

To use `Seurat`, we will first create a `Seurat` object from the `SpatialExperiment`
object and add relevant metadata. The object also needs to be split by patient
prior to integration.

```{r prepare-seurat, message=FALSE, warning=FALSE}
library(Seurat)
library(SeuratObject)
seurat_obj <- as.Seurat(spe, counts = "counts", data = "exprs")
seurat_obj <- AddMetaData(seurat_obj, as.data.frame(colData(spe)))

seurat.list <- SplitObject(seurat_obj, split.by = "patient_id")
```

To avoid long run times, we will use an approach that relies on reciprocal PCA
instead of canonical correlation analysis for dimensionality reduction and
initial alignment. For an extended tutorial on how to use `Seurat` for data
integration, please refer to their
[vignette](https://satijalab.org/seurat/articles/integration_rpca.html).

We will first define the features used for integration and perform PCA on cells
of each patient individually. The `FindIntegrationAnchors` function detects MNNs between
cells of different patients and the `IntegrateData` function corrects the
expression values of cells. We slightly increase the number of neighbors to be
considered for MNN detection (the `k.anchor` parameter). This increases the integration
strength.

```{r, echo=FALSE}
start_time <- Sys.time()
```

```{r seurat-correction, message=FALSE, warning=FALSE}
features <- rownames(spe)[rowData(spe)$use_channel]
seurat.list <- lapply(X = seurat.list, FUN = function(x) {
    x <- ScaleData(x, features = features, verbose = FALSE)
    x <- RunPCA(x, features = features, verbose = FALSE)
    return(x)
})

anchors <- FindIntegrationAnchors(object.list = seurat.list, 
                                  anchor.features = features,
                                  reduction = "rpca", 
                                  k.anchor = 20)

combined <- IntegrateData(anchorset = anchors)
```

```{r, echo=FALSE}
end_time <- Sys.time()
```

We now select the `integrated` assay and perform PCA dimensionality reduction.
The cell coordinates in PCA reduced space can then be transferred to the 
original `SpatialExperiment` object.

```{r message=FALSE, warning=FALSE}
DefaultAssay(combined) <- "integrated"

combined <- ScaleData(combined, verbose = FALSE)
combined <- RunPCA(combined, npcs = 30, verbose = FALSE)

reducedDim(spe, "seurat") <- combined@reductions$pca@cell.embeddings
```

The computational time of the `Seurat` function calls is 
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.

### Visualization

As above, we recompute the UMAP embeddings based on `Seurat` integrated results
and visualize the embedding.

```{r umap-seurat, message=FALSE, warning=FALSE}
set.seed(220228)
spe <- runUMAP(spe, dimred = "seurat", name = "UMAP_seurat") 
```

Visualize patient IDs.

```{r visualizing-batch-correction-seurat-1, message=FALSE, warning=FALSE, fig.height=3}
# visualize patient id 
p1 <- dittoDimPlot(spe, var = "patient_id", 
                   reduction.use = "UMAP", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
    ggtitle("Patient ID on UMAP before correction")
p2 <- dittoDimPlot(spe, var = "patient_id", 
                   reduction.use = "UMAP_seurat", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
    ggtitle("Patient ID on UMAP after correction")

plot_grid(p1, p2)
```

Visualization of marker expression.

```{r visualizing-batch-correction-seurat-2, warning=FALSE, message=FALSE, fig.height=8}
# Before correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP", 
                   assay = "exprs", size = 0.2, list.out = TRUE) 
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list) 

# After correction
plot_list <- multi_dittoDimPlot(spe, var = markers, reduction.use = "UMAP_seurat", 
                   assay = "exprs", size = 0.2, list.out = TRUE) 
plot_list <- lapply(plot_list, function(x) x + scale_color_viridis())
plot_grid(plotlist = plot_list) 
```

Similar to the methods presented above, `Seurat` integrates immune cells correctly.
When visualizing the patient IDs, slight patient-to-patient differences within tumor
cells can be detected. 

Choosing the correct integration approach is challenging without having ground truth
cell labels available. It is recommended to compare different techniques and different
parameter settings. Please refer to the documentation of the individual tools
to become familiar with the possible parameter choices. Furthermore, in the following
section, we will discuss clustering and classification approaches in light of
expression differences between samples.

In general, it appears that MNN-based approaches are less conservative in terms
of merging compared to `harmony`. On the other hand, `harmony` could well merge
cells in a way that regresses out biological signals. 

## Save objects

The modified `SpatialExperiment` object is saved for further downstream analysis.

```{r save-objects-batch-correction}
saveRDS(spe, "data/spe.rds")
```

## Session Info

<details>
   <summary>SessionInfo</summary>
   
```{r, echo = FALSE}
sessionInfo()
```
</details>