-
Notifications
You must be signed in to change notification settings - Fork 13
/
new-10kPBMC-Seurat.Rmd
450 lines (330 loc) · 21.5 KB
/
new-10kPBMC-Seurat.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
---
title: "Seurat tutorial using 10k PBMCs dataset"
output: html_document
---
This vignette should introduce you to some typical tasks, using Seurat eco-system. Seurat vignettes are available [here](https://satijalab.org/seurat/articles/get_started.html). An alternative to this vignette in Python (using scanpy) is also available in the same repository. The interconversion and exploration of datasets from Python to Seurat (and SCE) is described in a separate vignette. Let's install the packages currently missing in a default JupyterHub environment.
Only run the chunk below if you're running the code for the first time, and do not have the packages installed.
```{r eval=FALSE}
remotes::install_github('chris-mcginnis-ucsf/DoubletFinder')
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("celldex")
BiocManager::install("SingleR")
BiocManager::install("glmGamPoi")
```
Let's now load all the libraries that will be needed for the tutorial.
```{r warning=FALSE, message=FALSE}
library(Seurat)
library(DropletUtils)
library(ggplot2)
library(DoubletFinder)
library(SingleR)
library(dplyr)
library(celldex)
library(knitr)
library(RColorBrewer)
```
### PART 1. Basic quality control and filtering.
We start the analysis after two preliminary steps have been completed: 1) ambient RNA correction using soupX; 2) doublet detection using scrublet. Both vignettes can be found in this repository.
To start the analysis, let's read in the corrected matrices:
```{r}
adj.matrix <- Read10X("../data/soupX_pbmc10k_filt")
```
After this, we will make a `Seurat` object. Seurat object summary shows us that 1) number of cells ("samples") approximately matches
the description of each dataset (10194); 2) there are 36601 genes (features) in the reference.
```{r}
srat <- CreateSeuratObject(adj.matrix,project = "pbmc10k")
srat
```
Let's look at the combined object a bit closer. Str allows us to see all fields of the class:
```{r}
str(srat)
```
Meta.data is the most important field for next steps. It can be acessed using both @ and [[]] operators. Right now it has 3 fields per celL: dataset ID, number of UMI reads detected per cell (nCount_RNA), and the number of expressed (detected) genes per same cell (nFeature_RNA).
```{r}
meta <- [email protected]
dim(meta)
head(meta)
summary(meta$nCount_RNA)
summary(meta$nFeature_RNA)
```
Let's add several more values useful in diagnostics of cell quality. Michochondrial genes are useful indicators of cell state. For mouse datasets, change pattern to "Mt-", or explicitly list gene IDs with the _features = ..._ option.
```{r}
srat[["percent.mt"]] <- PercentageFeatureSet(srat, pattern = "^MT-")
```
Similarly, we can define ribosomal proteins (their names begin with __RPS__ or __RPL__), which often take substantial fraction of reads:
```{r}
srat[["percent.rb"]] <- PercentageFeatureSet(srat, pattern = "^RP[SL]")
```
Now, let's add the doublet annotation generated by `scrublet` to the Seurat object metadata.
```{r}
doublets <- read.table("../data/scrublet_calls.tsv",header = F,row.names = 1)
colnames(doublets) <- c("Doublet_score","Is_doublet")
srat <- AddMetaData(srat,doublets)
head(srat[[]])
```
Let's make violin plots of the selected metadata features. Note that the plots are grouped by categories named identity class.
Identity class can be seen in [email protected], or using `Idents()` function. Active identity can be changed using `SetIdents()`.
```{r fig.align="center"}
VlnPlot(srat, features = c("nFeature_RNA","nCount_RNA","percent.mt","percent.rb"),ncol = 4) &
theme(plot.title = element_text(size=10))
```
Let's plot some of the metadata features against each other and see how they correlate. The number above each plot is a Pearson correlation coefficient.
```{r fig.align="center"}
FeatureScatter(srat, feature1 = "nCount_RNA", feature2 = "percent.mt")
FeatureScatter(srat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
FeatureScatter(srat, feature1 = "nCount_RNA", feature2 = "percent.rb")
FeatureScatter(srat, feature1 = "percent.rb", feature2 = "percent.mt")
FeatureScatter(srat, feature1 = "nFeature_RNA", feature2 = "Doublet_score")
```
The plots above clearly show that high MT percentage strongly correlates with low UMI counts, and usually is interpreted as dead cells.
High ribosomal protein content, however, strongly anti-correlates with MT, and seems to contain biological signal. There's also a strong correlation between the doublet score and number of expressed genes.
Let's set QC column in metadata and define it in an informative way.
```{r}
srat[['QC']] <- ifelse([email protected]$Is_doublet == 'True','Doublet','Pass')
srat[['QC']] <- ifelse([email protected]$nFeature_RNA < 500 & [email protected]$QC == 'Pass','Low_nFeature',[email protected]$QC)
srat[['QC']] <- ifelse([email protected]$nFeature_RNA < 500 & [email protected]$QC != 'Pass' & [email protected]$QC != 'Low_nFeature',paste('Low_nFeature',[email protected]$QC,sep = ','),[email protected]$QC)
srat[['QC']] <- ifelse([email protected]$percent.mt > 15 & [email protected]$QC == 'Pass','High_MT',[email protected]$QC)
srat[['QC']] <- ifelse([email protected]$nFeature_RNA < 500 & [email protected]$QC != 'Pass' & [email protected]$QC != 'High_MT',paste('High_MT',[email protected]$QC,sep = ','),[email protected]$QC)
table(srat[['QC']])
```
We can see that doublets don't often overlap with cell with low number of detected genes; at the same time, the latter often co-insides with high mitochondrial content. Let's plot metadata only for cells that pass tentative QC:
```{r fig.align="center"}
VlnPlot(subset(srat, subset = QC == 'Pass'),
features = c("nFeature_RNA", "nCount_RNA", "percent.mt","percent.rb"), ncol = 4) &
theme(plot.title = element_text(size=10))
```
### PART 2. Normalization and dimensionality reduction.
In order to do further analysis, we need to normalize the data to account for sequencing depth. Conventional way is to scale it to 10,000 (as if all cells have 10k UMIs overall), and log2-transform the obtained values. Normalized data are stored in `srat[['RNA']]@data` of the 'RNA' assay.
```{r}
srat <- NormalizeData(srat)
```
Next step discovers the most variable features (genes) - these are usually most interesting for downstream analysis.
```{r}
srat <- FindVariableFeatures(srat, selection.method = "vst", nfeatures = 2000)
```
Identify the 10 most highly variable genes:
```{r}
top10 <- head(VariableFeatures(srat), 10)
top10
```
Plot variable features with and without labels:
```{r fig.align="center",warning=FALSE}
plot1 <- VariableFeaturePlot(srat)
LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)
```
ScaleData converts normalized gene expression to Z-score (values centered at 0 and with variance of 1). It's stored in srat[['RNA']]@scale.data and used in following PCA. Default is to run scaling only on variable genes.
```{r}
all.genes <- rownames(srat)
srat <- ScaleData(srat, features = all.genes)
```
We can now do PCA, which is a common way of linear dimensionality reduction. By default we use 2000 most variable genes.
```{r}
srat <- RunPCA(srat, features = VariableFeatures(object = srat))
```
Prinicpal component "loadings" should match markers of distinct populations for well behaved datasets. Note that you can change many plot parameters using ggplot2 features - passing them with & operator.
```{r fig.align="center"}
VizDimLoadings(srat, dims = 1:9, reduction = "pca") &
theme(axis.text=element_text(size=5), axis.title=element_text(size=8,face="bold"))
```
Alternatively, one can do heatmap of each principal component or several PCs at once:
```{r fig.align="center"}
DimHeatmap(srat, dims = 1:6, nfeatures = 20, cells = 500, balanced = T)
```
DimPlot is used to visualize all reduced representations (PCA, tSNE, UMAP, etc). Identity is still set to "orig.ident". DimPlot has built-in hiearachy of dimensionality reductions it tries to plot: first, it looks for UMAP, then (if not available) tSNE, then PCA.
```{r fig.align="center"}
DimPlot(srat, reduction = "pca")
```
It's often good to find how many PCs can be used without much information loss. In our case a big drop happens at 10, so seems like a good _initial_ choice:
```{r fig.align="center"}
ElbowPlot(srat)
```
We can now do clustering. Higher resolution leads to more clusters (default is 0.8). It would be very important to find the correct cluster resolution in the future, since cell type markers depends on cluster definition.
```{r}
srat <- FindNeighbors(srat, dims = 1:10)
srat <- FindClusters(srat, resolution = 0.5)
```
For visualization purposes, we also need to generate UMAP reduced dimensionality representation:
```{r}
srat <- RunUMAP(srat, dims = 1:10, verbose = F)
```
Once clustering is done, active identity is reset to clusters ("seurat_clusters" in metadata). Let's look at cluster sizes.
```{r}
table([email protected]$seurat_clusters)
```
DimPlot uses _UMAP_ by default, with Seurat clusters as _identity_:
```{r fig.align="center"}
DimPlot(srat,label.size = 4,repel = T,label = T)
```
In order to control for clustering resolution and other possible artifacts, we will take a close look at two minor cell populations: 1) dendritic cells (DCs), 2) platelets, aka thrombocytes. Let's visualise two markers for each of this cell type: LILRA4 and TPM2 for DCs, and PPBP and GP1BB for platelets.
```{r fig.align="center"}
FeaturePlot(srat, features = c("LILRA4", "TPM2", "PPBP", "GP1BB"))
```
Let's visualize other confounders:
```{r fig.align="center"}
FeaturePlot(srat, features = "Doublet_score") & theme(plot.title = element_text(size=10))
FeaturePlot(srat, features = "percent.mt") & theme(plot.title = element_text(size=10))
FeaturePlot(srat, features = "nFeature_RNA") & theme(plot.title = element_text(size=10))
```
Let's remove the cells that did not pass QC and compare plots. We can now see much more defined clusters. Our filtered dataset now contains 8824 cells - so approximately 12% of cells were removed for various reasons.
```{r fig.align="center"}
DimPlot(srat,label.size = 4,repel = T,label = T)
srat <- subset(srat, subset = QC == 'Pass')
DimPlot(srat,label.size = 4,repel = T,label = T)
```
Finally, let's calculate cell cycle scores, as described [here](https://satijalab.org/seurat/archive/v3.1/cell_cycle_vignette.html). This has to be done after normalization and scaling. Seurat has a built-in list, cc.genes (older) and cc.genes.updated.2019 (newer), that defines genes involved in cell cycle. For CellRanger reference GRCh38 2.0.0 and above, use cc.genes.updated.2019 (three genes were renamed: MLF1IP, FAM64A and HN1 became CENPU, PICALM and JPT). For mouse cell cycle genes you can use the solution detailed [here](https://github.com/satijalab/seurat/issues/2493).
```{r}
cc.genes.updated.2019
s.genes <- cc.genes.updated.2019$s.genes
g2m.genes <- cc.genes.updated.2019$g2m.genes
srat <- CellCycleScoring(srat, s.features = s.genes, g2m.features = g2m.genes)
table(srat[[]]$Phase)
```
Let's see if we have clusters defined by any of the technical differences. Mitochnondrial genes show certain dependency on cluster, being much lower in clusters 2 and 12.
```{r fig.align="center"}
FeaturePlot(srat,features = "percent.mt",label.size = 4,repel = T,label = T) & theme(plot.title = element_text(size=10))
```
```{r fig.align="center"}
VlnPlot(srat,features = "percent.mt") & theme(plot.title = element_text(size=10))
```
Ribosomal protein genes show very strong dependency on the putative cell type! Some cell clusters seem to have as much as 45%, and some as little as 15%.
```{r fig.align="center"}
FeaturePlot(srat,features = "percent.rb",label.size = 4,repel = T,label = T) & theme(plot.title = element_text(size=10))
```
```{r fig.align="center"}
VlnPlot(srat,features = "percent.rb") & theme(plot.title = element_text(size=10))
```
There are also differences in RNA content per cell type. From earlier considerations, clusters 6 and 7 are probably lower quality cells that will disapper when we redo the clustering using the QC-filtered dataset.
```{r fig.align="center"}
VlnPlot(srat,features = c("nCount_RNA","nFeature_RNA")) & theme(plot.title = element_text(size=10))
```
Finally, cell cycle score does not seem to depend on the cell type much - however, there are dramatic outliers in each group.
```{r fig.align="center"}
FeaturePlot(srat,features = c("S.Score","G2M.Score"),label.size = 4,repel = T,label = T) & theme(plot.title = element_text(size=10))
VlnPlot(srat,features = c("S.Score","G2M.Score")) & theme(plot.title = element_text(size=10))
```
### PART 3. SCTransform normalization and clustering.
Since we have performed extensive QC with doublet and empty cell removal, we can now apply [SCTransform normalization](https://satijalab.org/seurat/articles/sctransform_vignette.html), that was shown to be beneficial for finding rare cell populations by improving signal/noise ratio. Single SCTransform command replaces NormalizeData(), ScaleData(), and FindVariableFeatures(). We will also correct for % MT genes and cell cycle scores.
```{r}
srat <- SCTransform(srat, method = "glmGamPoi", ncells = 8824,
vars.to.regress = c("percent.mt","S.Score","G2M.Score"), verbose = F)
srat
```
After this let's do standard PCA, UMAP, and clustering. Note that SCT is the active assay now. It is conventional to use more PCs with SCTransform; the exact number can be adjusted depending on your dataset.
```{r fig.align="center"}
srat <- RunPCA(srat, verbose = F)
srat <- RunUMAP(srat, dims = 1:30, verbose = F)
srat <- FindNeighbors(srat, dims = 1:30, verbose = F)
srat <- FindClusters(srat, verbose = F)
table(srat[[]]$seurat_clusters)
DimPlot(srat, label = T)
```
It is very important to set clusters correctly. Let's check the markers of smaller cell populations we have mentioned before - namely, platelets and dendritic cells. Let's also try another color scheme - just to show how it can be done.
```{r fig.align="center",warning=F,message=F}
FeaturePlot(srat,"PPBP") & scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
FeaturePlot(srat,"LILRA4") & scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
```
We can see there's a cluster of platelets located between clusters 6 and 14, that has not been identified. Increasing clustering resolution in `FindClusters` to 2 would help separate the platelet cluster (try it!), but also generates too many clusters.
If need arises, we can separate some clusters manualy. There are also clustering methods geared towards indentification of rare cell populations. Let's try using fewer neighbors in the KNN graph, combined with Leiden algorithm (now default in `scanpy`) and slightly increased resolution:
```{r fig.align="center"}
srat <- FindNeighbors(srat, dims = 1:30, k.param = 15, verbose = F)
srat <- FindClusters(srat, verbose = F, algorithm = 4, resolution = 0.9)
table(srat[[]]$seurat_clusters)
DimPlot(srat, label = T)
```
We already know that cluster 16 corresponds to platelets, and cluster 15 to dendritic cells. Let's get a very crude idea of what the big cell clusters are.
```{r fig.align="center",warning=F,message=F}
FeaturePlot(srat,"MS4A1") +
scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral"))) + ggtitle("MS4A1: B cells")
FeaturePlot(srat,"LYZ") +
scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral"))) + ggtitle("LYZ: monocytes")
FeaturePlot(srat,"NKG7") +
scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral"))) + ggtitle("NKG7: natural killers")
FeaturePlot(srat,"CD8B") +
scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral"))) + ggtitle("CD8B: CD8 T cells")
FeaturePlot(srat,"IL7R") +
scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral"))) + ggtitle("IL7R: CD4 T cells")
```
We can see better separation of some subpopulations. These will be further addressed in PART 5.
### PART 4. Differential expression and marker selection.
Differential expression allows us to define gene markers specific to each cluster. By definition it is influenced by how clusters are defined, so it's important to find the correct resolution. If some clusters lack any notable markers, adjust the clustering. It is [recommended](https://github.com/satijalab/seurat/discussions/4032) to do differential expression on RNA assay, and not the SCTransform. Differential expression can be done between two clusters, as well as between a cluster and all other cells.
First, let's set the active assay back to "RNA", and re-do the normalization and scaling (since we removed a notable fraction of cells that failed QC):
```{r}
DefaultAssay(srat) <- "RNA"
srat <- NormalizeData(srat)
srat <- FindVariableFeatures(srat, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(srat)
srat <- ScaleData(srat, features = all.genes)
```
The following function allows to find markers for every cluster by comparing it to all remaining cells, while reporting only the positive ones. There are many tests that can be used to define markers, including a very fast and intuitive [tf-idf](https://constantamateur.github.io/2020-04-10-scDE/). By default, Wilcoxon Rank Sum test is used. This takes a while - take few minutes to make coffee or a cup of tea!
```{r warning = F,message = F}
all.markers <- FindAllMarkers(srat, only.pos = T, min.pct = 0.25, logfc.threshold = 0.25)
```
Let's take a quick glance at the markers.
```{r}
dim(all.markers)
table(all.markers$cluster)
top3_markers <- as.data.frame(all.markers %>% group_by(cluster) %>% top_n(n = 3, wt = avg_log2FC))
top3_markers
```
Some markers are less informative than others. For detailed dissection, it might be good to do differential expression between subclusters (see below).
### PART 5. Cell type annotation using SingleR.
Given the markers that we've defined, we can mine the literature and identify each observed cell type (it's probably the easiest for PBMC). However, we can try automaic annotation with SingleR is workflow-agnostic (can be used with Seurat, SCE, etc). Detailed signleR manual with advanced usage can be found [here](http://bioconductor.org/books/release/SingleRBook/).
Let's get reference datasets from celldex package. Note that there are two cell type assignments, `label.main` and `label.fine`:
```{r warning=F,message=F}
hpca.ref <- celldex::HumanPrimaryCellAtlasData()
dice.ref <- celldex::DatabaseImmuneCellExpressionData()
monaco.ref <- celldex::MonacoImmuneData()
```
Let's convert our Seurat object to single cell experiment (SCE) for convenience. After this, using `SingleR` becomes very easy:
```{r}
sce <- as.SingleCellExperiment(srat)
sce
hpca.main <- SingleR(test = sce,assay.type.test = 1,ref = hpca.ref,labels = hpca.ref$label.main)
hpca.fine <- SingleR(test = sce,assay.type.test = 1,ref = hpca.ref,labels = hpca.ref$label.fine)
dice.main <- SingleR(test = sce,assay.type.test = 1,ref = dice.ref,labels = dice.ref$label.main)
dice.fine <- SingleR(test = sce,assay.type.test = 1,ref = dice.ref,labels = dice.ref$label.fine)
monaco.main <- SingleR(test = sce,assay.type.test = 1,ref = monaco.ref,labels = monaco.ref$label.main)
monaco.fine <- SingleR(test = sce,assay.type.test = 1,ref = monaco.ref,labels = monaco.ref$label.fine)
```
Let's see the summary of general cell type annotations. These match our expectations (and each other) reasonably well.
```{r}
table(hpca.main$pruned.labels)
table(dice.main$pruned.labels)
table(monaco.main$pruned.labels)
```
Finer cell type annotations are harder to achieve; as you can see, many of these do not match:
```{r}
table(hpca.fine$pruned.labels)
table(dice.fine$pruned.labels)
table(monaco.fine$pruned.labels)
```
Let's add the annotations to the Seurat object metadata so we can use them:
```{r}
[email protected]$hpca.main <- hpca.main$pruned.labels
[email protected]$dice.main <- dice.main$pruned.labels
[email protected]$monaco.main <- monaco.main$pruned.labels
[email protected]$hpca.fine <- hpca.fine$pruned.labels
[email protected]$dice.fine <- dice.fine$pruned.labels
[email protected]$monaco.fine <- monaco.fine$pruned.labels
```
Finally, let's visualize the fine-grained annotations.
```{r fig.align="center"}
srat <- SetIdent(srat, value = "hpca.fine")
DimPlot(srat, label = T , repel = T, label.size = 3) + NoLegend()
srat <- SetIdent(srat, value = "dice.fine")
DimPlot(srat, label = T , repel = T, label.size = 3) + NoLegend()
srat <- SetIdent(srat, value = "monaco.fine")
DimPlot(srat, label = T , repel = T, label.size = 3) + NoLegend()
```
Comparing the labels obtained from the three sources, we can see many interesting discrepancies. Sorthing those out requires manual curation. However, many informative assignments can be seen.
For example, small cluster 17 is repeatedly identified as plasma B cells. This distinct subpopulation displays markers such as CD38 and CD59.
```{r fig.align="center",warning=F,message=F}
FeaturePlot(srat,"CD38") + scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
FeaturePlot(srat,"CD59") + scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
```
Similarly, cluster 13 is identified to be MAIT cells. Literature suggests that blood MAIT cells are characterized by high expression of CD161 (KLRB1), and chemokines like CXCR6. This indeed seems to be the case:
```{r fig.align="center",warning=F,message=F}
FeaturePlot(srat,"KLRB1") + scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
FeaturePlot(srat,"CXCR6") + scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
```
In general, even simple example of PBMC shows how complicated cell type assignment can be, and how much effort it requires. A detailed book on how to do cell type assignment / label transfer with `singleR` [is available](https://bioconductor.org/books/3.12/SingleRBook/).