Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch effects and PCs blog posts #40

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
11 changes: 7 additions & 4 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@ Depends:
Imports:
tidybulk,
tidyseurat,
sccomp
sccomp,
Seurat,
SeuratData,
harmony
Suggests:
knitr,
rmarkdown,
Expand All @@ -46,10 +49,10 @@ Suggests:
survival,
dittoSeq,
GGally,
tidygate,
Seurat
tidygate
Remotes:
stemangiola/tidygate
stemangiola/tidygate,
satijalab/seurat-data
Biarch: true
biocViews: RNASeq, DifferentialExpression, GeneExpression, Normalization, Clustering, QualityControl, Sequencing, Transcription, Transcriptomics
URL: https://tidyomics.github.io/tidyomicsBlog/
Expand Down
119 changes: 119 additions & 0 deletions blog/content/post/2024-03-11-batch-effects/BatchEffects.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
title: "Batch Effect Visualization"
author: "Pau Paiz"
date: "2024-03-11"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

```{r library calls, results = FALSE}
library(Seurat)
library(SeuratData)
library(harmony)
library(tidyverse)
library(ggplot2)

# set seed for reproducibility
set.seed(1234)
```

One important step in single-cell RNA-Seq data analysis is batch correction. Its importance lies in the fact that batch effects occur when the variation in data is caused by technical factors instead of biological factors. Technical factors can result from using different sequencing platforms, capturing times, reagents, laboratories, and in general from experimental designs and handling. They occur in batches because they affect a group of samples that are processed differently in contrast to other samples or data. Moreover, batch effects can be highly nonlinear, which makes it difficult to align the datasets and keep biological variations.These batch effects can lead to erroneous conclusions. But, how do we detect batch effects?

There are various methods to detect them, however two simple ways are dissecting PCA and clusters.

Here we will detect batch effects from clusters. To do so, we will run clustering analysis and compare UMAP plots before and after batch correction. If there is batch effect in data, we will see in the first plot that cells from different batches will cluster together. After we perform batch correction we shouldn’t see that fragmentation.

Various algorithms are being used to perform batch correction, three of the most commonly used are:
1. Mutual Nearest Neighbor (MNN)
2. canonical correlation analysis (CCA)
3. [Harmony](https://github.com/immunogenomics/harmony)

Here we will use [Harmony](https://portals.broadinstitute.org/harmony/index.html), an algorithm that projects cells into a shared embedding so that they group together by cell type rather than dataset-specific conditions. The dataset we will be working with is from Peripheral Blood Mononuclear Cells (PBMC) from 8 lupus patients that were split into stimulated and control groups. The stimulated group was treated with interferon-beta therapy.

```{r}
InstallData("ifnb")
LoadData("ifnb")
str(ifnb)
```

## Quality control metrics
In this step we will filter out low quality cells. It is important to be looking at the number of genes in a cell (`nFeature_RNA`) and the number of total molecules (`nCount_RNA`). This parameters can give us an idea of thequality of the cell because a poor quality cell would have low number of genes or molecules detected. We can also have an extremely high number of genes or molecules detected due to doublets or multiple cells being sequenced together. The % of mitochondrial genes is also important because in dying or low quality cells we can see higher mitochondrial gene contamination.

```{r}
ifnb@images <- list()
ifnb$mito.percent <- PercentageFeatureSet(ifnb, pattern = '^MT-')
View([email protected])
ifnb <- UpdateSeuratObject(ifnb)

# filter out cells according to your desired properties
ifnb.filtered <- subset(ifnb, subset = nCount_RNA > 800 &
nFeature_RNA > 200 &
mito.percent < 5)
```

## Data normalization
Now we need to do data normalization in order to be able to compare gene expression across different cells. The function `NormalizeData` does this procedure, and here *normalization.method = "LogNormalize"*, and *scale.factor = 10000* are the default values.

```{r}
# standard workflow steps
ifnb.filtered <- NormalizeData(ifnb.filtered)
ifnb.filtered <- FindVariableFeatures(ifnb.filtered)
ifnb.filtered <- ScaleData(ifnb.filtered)
ifnb.filtered <- RunPCA(ifnb.filtered)
ElbowPlot(ifnb.filtered)
# if below function gives you an error downgrade to Matrix 1.6-1
# remotes::install_version("Matrix", version = "1.6-1")
ifnb.filtered <- RunUMAP(ifnb.filtered, dims = 1:20)
```

To start, lets plot the data again with the DimPlot function and group the cells by their origin identity to identify the cells from each repeat.

```{r}
before <- DimPlot(ifnb.filtered, reduction = 'umap', group.by = 'stim')
before
```

In the plot we can see that the cells were not well integrated into each other since they are clearly clustering by conditions (CTRL vs STIM). To perform data integration we can use [Harmony](https://portals.broadinstitute.org/harmony/index.html) by specifying what variable we want to use to perform the integration. In this case it will be based on the condition, which is stored in `stim`.

```{r}
# run Harmony -----------
ifnb.harmony <- ifnb.filtered %>%
# below will return corrected dimensionality values / embeddings
RunHarmony(group.by.vars = 'stim', plot_convergence = FALSE)
```

Below you can see in addition to the PCA and UMAP embeddings, Harmony adds a slot with its batch-corrected values.
```{r}
ifnb.harmony@reductions
```
We can always access and get the embeddings we are interested with the functions below. You can see each cell will have corresponding Harmony values.
```{r}
ifnb.harmony.embed <- Embeddings(ifnb.harmony, "harmony")
ifnb.harmony.embed[1:10,1:10]
```

```{r}
# run UMAP and clustering using Harmony embeddings instead of PCA
ifnb.harmony <- ifnb.harmony %>%
RunUMAP(reduction = 'harmony', dims = 1:20) %>%
FindNeighbors(reduction = "harmony", dims = 1:20) %>%
FindClusters(resolution = 0.5)
```

Finally, lets compare the results from the previous workflow and the results after data integration. For that we will plot the two UMAPs with the `DimPlot` function. Note that we are using the updated umap which was computed using the Harmony values. As you can see, there is overlap between cells from both conditions meaning our integration was performed successfully.

```{r}
# visualize
after <- DimPlot(ifnb.harmony, reduction = 'umap', group.by = 'stim')
before|after
```

## References
- Tutorial code is from [Bioinformagician](https://github.com/kpatel427/YouTubeTutorials/blob/main/singleCell_integrate_harmony.R).
- Tran, H.T.N., Ang, K.S., Chevrier, M. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020). https://doi.org/10.1186/s13059-019-1850-9
- Batch effect corrections, (2023), 10X Genomics, Source: <https://www.10xgenomics.com/resources/analysis-guides/introduction-batch-effect-correction>
- Hoffman p. Introduction to scRNA-seq integration, (2023), Source: <https://satijalab.org/seurat/articles/integration_introduction.html>

167 changes: 167 additions & 0 deletions blog/content/post/2024-03-11-pca-single-cell/PCs_SingleCell.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
---
title: "Interpreting PCs in relation to clusters in single cell analysis"
author: "Pau Paiz"
date: "2024-03-11"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

Single-cell RNA sequential analysis is implemented for cell types identification both existing and novel, tumor types classification, investigation of heterogeneity in different cells, and cell fate prediction and depiction. In this context, an unsupervised learning method such as single-cell clustering represents an important approach to help execute these applications, since it is a key component of cell identification and characterization of gene expression patterns. The upstream data processing includes quality control (QC), normalization and dimension reduction. There are several dimension reduction methods, but here we will focus on principal components analysis (PCA) due to its simplicity and efficiency. The dimension reduction step is important because it reduces the computational work in further steps of clustering, reduces noise and enables more efficient data plotting.

Clustering is a tool to explore data, and its main objective is to summarize complex scRNA-seq data to make it easier to understand for humans. This is achieved by computing euclidean distances across genes in order to identify cells with similar transcriptomic profiles, which allows us to describe population heterogeneity. Regardless of the method used, clustering is a critical step for extracting biological insights from scRNA-seq data. In R, the Seurat package is used for QC and exploration of single-cell RNA-seq data. Here, we will use Seurat as well as Tidyverse packages for the analysis.This post is inspired by Seurat's own clustering tutorial available [here](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html).

```{r results = FALSE}
library(Seurat)
library(tidyverse)
library(GEOquery)

# set seed for reproducibility
set.seed(1234)
```

The dataset we will be working with is from Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. It consists of 2,700 single cells sequenced on the Illumina NextSeq 500. We can read it in using the `Read10X` function.

```{r}
#Read data
pbmc <- Read10X("./filtered_gene_bc_matrices/hg19/")
```

It is in dgCmatrix format so we need to cast it as a Seurat object. We will keep all the features that are expressing at least 3 cells, keep all those cells that have at least 200 features or genes.

```{r}
pbmc <- CreateSeuratObject(counts= pbmc, project= "pbmc", min.cells=3, min.features=200)
```

## Quality control metrics
In this step we will filter out low quality cells. It is important to be looking at the number of genes in a cell (`nFeature_RNA`) and the number of total molecules (`nCount_RNA`). This parameters can give us an idea of thequality of the cell because a poor quality cell would have low number of genes or molecules detected. We canalso have an extremely high number of genes or molecules detected due to doublets or multiple cells being sequenced together. The % of mitochondrial genes is also important because in dying or low quality cells we can see higher mitochondrial gene contamination.

```{r}
View([email protected])
```

Let's calculate the percentage of mitochondrial genes with the function `PercentageFeatureSet`, for this function we need to provide a pattern. We are going to calculate the % in all the genes that start with MT.
```{r}
pbmc <- PercentageFeatureSet(pbmc, pattern= "^MT-", col.name = "percent.mt")
View([email protected])
```

We can visualize this QC metrics as a violin plot, in features we need to include all the columns that we want to visualize.
```{r}
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol=3)
```

In the *Percent.mt* graph we can see that the cells that have higher mitochondrial percentage and that are disperse are the ones that need to be filtered out.

We can also see the features together with the `FeatureScatter` function which allows us to plot two metrics.
```{r}
FeatureScatter(pbmc, feature2= "nFeature_RNA", feature1 = "nCount_RNA")+
geom_smooth(method = 'lm')
```

In this plot we are plotting the **number of genes** on the Y axes, and the **number of molecules** in the X axes. A good quality dataset should follow the straight line, we can see that the majority of the data follow the line but we can see some sparse points that need to be filtered.

## Filtering data
Now we need to filter the data, in this case we will set the boundaries to have the number of genes greater than 400, the number of molecules up to 2000 and the mitochondrial percentage minor to 5%.
In the plot we can see the difference with the cells already filtered out.

```{r}
pbmc <- subset(pbmc, subset= nFeature_RNA>400 & nCount_RNA <2000 & percent.mt <5)
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol=3)
pbmc
```

## Data normalization
Now we need to do data normalization in order to be able to compare gene expression across different cells. The function `NormalizeData` does this procedure, and here *normalization.method = "LogNormalize"*, and *scale.factor = 10000* are the default values.
```{r}
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
```

Next, we are going to select the features that show high cell to cell variation, so we will perform feature selection with the `FindVariableFeatures` function. *Selection.method = "vst", nfeatures = 2000* are the default values.
```{r}
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
```

We can visualize this top features with the `VariableFeaturePlot` function.
```{r}
VariableFeaturePlot(pbmc)
```

In red we can see the variable features, the points that are higher in the plot are the top variable features, and the points in black are the non-variable features.

## Data scaling
This step converts the absolute expression measurements to relative concentrations and simultaneously removes efficiency noise, we can do this with the `ScaleData` function. We are going to select all the genes as features, the names of all genes will be extracted from the Seurat object.
```{r}
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features= all.genes)
```

## Perform linear dimensionality reduction (PCA) on the scaled data
It is important to perform PCA on scRNA-Seq data because the real dimensionality of the data is much lower than the number of genes. For example, many genes are co-expressed or highly correlated suggesting the cells expressing those genes belong to similar celltypes. Sometimes technical noise is a significant source of variation in the data so it might be captured by some of the principal components. Researchers often have to decide which components to consider as representing biological signal versus technical noise, which can be a challenging task. We will run PCA with the `RunPCA` function, we just need to provide the Seurat object.

```{r}
pbmc <- RunPCA(pbmc)
```

## Examine and visualize PCA results
Here we will see the top 5 principal components, we can change the number of PCs we want to see by changing the number in *nfeatures*. We will also see the features that have negative and positive PCs scores.
```{r}
print(pbmc[["pca"]], dims= 1:3, nfeatures = 5)
```

Another way of visualize the PCs is with a Heatmap. We can do this with the `DimHeatmap` function, we can change the number of PCs by changing the number in *dmis*. Here we are plotting 500 cells.
```{r}
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
```

The heatmap is colored by the PCs scores and we can see the features that exhibit heterogeneity.

## Determine the dimensionality of the data
Now we are going to choose only the statistical significant PCs that capture the majority of the signal in a downstream analysis. The elbow plot shows an elbow after which the PCs do not vary much in the % of variance. It is better to use too many PCs rather than too few since this could affect the downstream analysis.

```{r}
ElbowPlot(pbmc)
#it only runs the first 20 cps by default use function below if you want to customize this
#ElbowPlot(pbmc, ndims = 50, reduction = "pca")
```

Here we can see that the point in which the elbow is formed is around 4.

## Clustering
We want to cluster similar cells, with similar feature expression patterns. For that we use the function `FindNeighbors` with the 4 PCs
```{r}
pbmc <- FindNeighbors(pbmc, dims= 1:4)
```
Now we want to assign the cells to the clusters. For that we can use the function `FindClusters`. Here, resolution is the granularity of the clusters, the lower the number, the lower the clusters. Tools such as [Clustree](https://github.com/lazappi/clustree) and [sigclust2](https://github.com/pkimes/sigclust2) can help you select the resolution granularity. In this example we will go with 0.5.
```{r}
pbmc <- FindClusters(pbmc, resolution= 0.5)
```
Next we are going to visualize how many clusters do we have for each resolution so that we can choose the resolution that works best. We are going to do that with `DimPlot`, grouping the cells by resolution, starting with resolution 0.1, then with resolution 0.3
```{r}
DimPlot(pbmc, group.by = "RNA_snn_res.0.5", label = TRUE)
```

## Setting identity of clusters

Below we are assigning each cell to the clusters we found.
```{r}
Idents(pbmc) <- "RNA_snn_res.0.5"
head(Idents(pbmc), 5)
```

## Non-linear dimensionality reduction
We can then see what cells are similar to each other using non-linear dimensional reduction techniques such as UMAP.
```{r}
pbmc <- RunUMAP(pbmc, dims= 1:20)
DimPlot(pbmc, reduction = "umap", label = TRUE, repel = TRUE)
```

There are many resources that can help you interpret UMAPs.[Here](https://alleninstitute.org/resource/what-is-a-umap/) is one example. In the next tutorial on Batch effects we will see how to make use of this technique.

## References
- Seurat Developers. (n.d.). Guided Tutorial: Analyzing PBMC scRNA-seq data. Satija Lab. Retrieved from https://satijalab.org/seurat/articles/pbmc3k_tutorial.html
- Seth S, Mallik S, Bhadra T, Zhao Z. Dimensionality Reduction and Louvain Agglomerative Hierarchical Clustering for Cluster-Specified Frequent Biomarker Discovery in Single-Cell Sequencing Data. Front Genet. 2022 Feb 7;13:828479. doi: 10.3389/fgene.2022.828479. PMID: 35198011; PMCID: PMC8859265.
- Zhang S, Li X, Lin J, Lin Q, Wong KC. Review of single-cell RNA-seq data clustering for cell-type identification and characterization. RNA. 2023 May;29(5):517-530. doi: 10.1261/rna.078965.121. Epub 2023 Feb 3. PMID: 36737104; PMCID: PMC10158997.
- Amezquita R, Lun A, Hicks S, Gottardo S, 2021. Bioconductor, Source: <https://github.com/OSCA-source/OSCA.basic>
Binary file not shown.
Loading
Loading