-
Notifications
You must be signed in to change notification settings - Fork 1
/
vignette.Rmd
3602 lines (3012 loc) · 166 KB
/
vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Vignette - Computational proteomics of a THP-1 human leukaemia cell line
author: "Lisa M Breckels"
date: "`r format(Sys.time(), '%d %B, %Y')`"
fontsize: 11pt
bibliography: refs.bib
link-citations: yes
csl: nature-communications.csl
output:
rmdformats::robobook:
theme: default
fig_caption: true
use_bookdown: true
editor_options:
chunk_output_type: console
---
<style>
.book .book-body .page-inner section.normal h1 {
font-size: 16px;
}
.book .book-body .page-inner section.normal h2,
.book .book-body .page-inner section.normal h3,
.book .book-body .page-inner section.normal h4 {
font-size: 14px;
}
.book .book-body .page-inner section.normal table td {
font-size: 12px;
}
.book .book-body .page-inner section.normal pre>code.r {
font-size: 10px;
}
.book .book-body .page-inner section.normal pre {
font-size: 10px
}
</style>
# Abstract
This vignette contains the [R](https://www.r-project.org)[@R] pipeline used for the computational analysis reported in the paper "Spatiotemporal proteomic profiling of the pro-inflammatory response to lipopolysaccharide in the THP-1 human leukemia cell line" by Mulvey, Breckels et al [@Mulvey:2021].
# Introduction
In this vignette we work with the data at the PSM and protein level which which are freely and openly available as part of the [Bioconductor pRolocdata package](https://bioconductor.org/packages/release/data/experiment/html/pRolocdata.html) (≥v1.27.3) [@pRoloc:2014] and in the supplementary information as disseminated with the accompanying manuscript [@Mulvey:2021].
All analysis in the manuscript is done in the R programming language, unless otherwise stated, using the suite of mass spectrometry spatial proteomics packages [`MSnbase`](https://www.bioconductor.org/packages/release/bioc/html/MSnbase.html)[@MSnbase:2012], [`pRoloc`](https://www.bioconductor.org/packages/release/bioc/html/pRoloc.html)[@pRoloc:2014], [`pRolocGUI`](https://www.bioconductor.org/packages/release/bioc/html/pRolocGUI.html) and [`pRolocdata`](https://bioconductor.org/packages/release/data/experiment/html/pRolocdata.html).
All raw mass spectrometry proteomics data from this study have been deposited to the [ProteomeXchange](http://proteomecentral.proteomexchange.org/cgi/GetDataset) Consortium via the PRIDE partner repository [@pride:2019], with the dataset identifier PXD023509.
## Before you start {#start}
To run through the code in this vignette please download the [THP LOPIT Github repository](https://github.com/CambridgeCentreForProteomics/thp-lopit-2021), the [`R` programming language](https://www.r-project.org) and we also recommend you download [`RStudio` IDE](https://www.rstudio.com/products/rstudio/download/) which is a great interface to R.
Please open RStudio and navigate to the `thp-lopit-2021` directory you have just saved and downloaded. To do this go to menu
Session -> Set Working Directory -> Choose Directory ...
If you type `getwd` into the R console. You should see the directory name in your path. For more information on how to use RStudio please see [RStudio education website](https://education.rstudio.com) for some nice introductory tutorials.
## Accessing the data in R
The `pRolocdata` package is a R [Bioconductor](http://bioconductor.org/) [experiment package](http://bioconductor.org/packages/release/BiocViews.html#___ExperimentData) that collects mass spectrometry-based spatial/organelle and protein-complex datasets from published experiments. Each data is stored in a container called a `MSnSet` instance (see the [`MSnbase`](http://www.bioconductor.org/packages/release/bioc/html/MSnbase.html) for details) which allows users to work seamlessly with the R [`pRoloc`](http://bioconductor.org/packages/release/data/experiment/html/pRolocdata.html) and [`pRolocGUI`](http://bioconductor.org/packages/devel/bioc/html/pRolocGUI.html) software for spatial proteomics data analysis and visualisation.
The first step is to install the `pRoloc` and `pRolocdata` packages from Bioconductor
```{r getpkgs, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("pRoloc", "pRolocdata", "MSnbase")
## we also use the knitr and VennDiagram, ggplot2 packages in this
## vignette so please install if you do not have them
BiocManager::install(c("knitr", "VennDiagram", "ggplot2", "dplyr", "kableExtra",
"coda", "reshape2", "circlize", "ggalluvial", "tidyr",
"Rmisc", "gridExtra", "gplots", "scico", "plotrix",
"RColorBrewer"))
```
Note: To access the widest selection on proteomics datasets we advise users to install the latest version `pRolocdata` package from Bioconductor. The THP data is in version >=1.29.1
Now we can load the packages by calling `library` function in R
```{r loadpkgs, eval=TRUE, warning=FALSE, message=FALSE}
library("pRoloc")
library("pRolocdata")
library("MSnbase")
```
and subsequently we can now access all the datasets in `pRolocdata`. To list all available datasets we use the `data` function.
```{r loadthp}
data(package = "pRolocdata")
```
We see there are 27 datasets (including the PSM and protein level data) associated with the manuscript [1], 20 from the hyperLOPIT data anlysis and 7 from the LPS timecourse [@Mulvey:2021]. For more information on these datasets, we can type
```{r helpdata}
?thpLOPIT_lps_mulvey2021
?lpsTimecourse_mulvey2021
```
```{r ssfig1, echo=FALSE, fig.cap="Screenshot of the documentation files in R for the THP LOPIT and LPS timecourse datasets", fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/Screenshot-docs.png")
```
Each dataset can be loaded using the `data` function. For example, to load the final protein level LOPIT datasets for the unstimulated and 12 hour LPS stimulated datasets,
```{r loadadataset, results = "hide"}
data("thpLOPIT_unstimulated_mulvey2021")
thpLOPIT_unstimulated_mulvey2021
data("thpLOPIT_lps_mulvey2021")
thpLOPIT_lps_mulvey2021
```
In the next sections we introduce the data, experimental design and brief overview of the experimental protocol before walking users through the downstream analysis which was used to generate the datasets available in `pRolocdata` and the manuscript Mulvey et al 2021 [@Mulvey:2021].
## Source code
All the core packages that we use for data analysis, namely MSnbase, pRolocdata, pRolocdata and pRolocGUI are all freely and openly available on R/Bioconductor and Github. Specific code and functions used that have been used throughout this workflow for analysis and generating figures can be found in the `r` directory of this repo.
# Temporal proteomics data analysis
## Summary
We assessed the global proteomic landscape of the THP-1 monocytic leukaemic cell line using a shotgun proteomics time course. Total (unfractionated) cell lysates were collected at 0, 2, 4, 6, 12, and 24 hour time-points following LPS stimulation and processed for quantitative, MS-based proteomics analysis. We reproducibly quantified 4,292 proteins across 3 biological replicate experiments, at all time-points. In total 311 proteins were found to be altered in abundance during the time course of LPS stimulation, with statistical significance (adjusted p-value < 0.01, log_2 fold-change > 0.6). The results are reported in full in Mulvey et al. 2021 [@Mulvey:2021].
## PSM level data processing
The 3 PSM level datasets were imported into R and missing values were carefully assessed.
We can load the PSM and protein level data from the `pRolocdata` package by using the data function.
```{r psmdata_tc}
## Unstimulated data
data("psms_lpsTimecourse_rep1_mulvey2021")
data("psms_lpsTimecourse_rep2_mulvey2021")
data("psms_lpsTimecourse_rep3_mulvey2021")
psms_lpsTimecourse_rep1_mulvey2021
dim(psms_lpsTimecourse_rep1_mulvey2021)
psms_lpsTimecourse_rep2_mulvey2021
dim(psms_lpsTimecourse_rep2_mulvey2021)
psms_lpsTimecourse_rep3_mulvey2021
dim(psms_lpsTimecourse_rep3_mulvey2021)
```
The summary above tells us that 34640, 49806 and 41853 PSMs were detected in replicates 1, 2 and 3, respectively.
In the below code chunk we put the replicates into a `MSnSetList` for ease of data processing.
```{r makealistofpsms}
psms <- MSnSetList(list(psms_lpsTimecourse_rep1_mulvey2021,
psms_lpsTimecourse_rep2_mulvey2021,
psms_lpsTimecourse_rep3_mulvey2021))
```
### Missing values assessent
It is important to examine any missing values, whether they are biological or technical. In MS proteomics experiments that use isobaric labelling, such as we have here, we find very few missing values and it is easy to examine them on case-by-case basis. In the below code chunks we indeed see that there are very few PSMs with missing values after filtering. There are 19 missing values in rep 1, 7 in rep2, and 16 in rep3.
As we have such few missing values we begin by assessing if we have any other PSMs corresponding to the same protein group available for quantitation. The below code chunk counts the number of PSMs available for a protein group, for those protein groups which have missing values.
```{r countpsms_quant}
## Display number of PSMs for each protein with a MV
xx <- lapply(psms, function(z) which(apply(exprs(z), 1, anyNA)))
(numpsms <- sapply(seq(psms), function(y)
{sapply(fData(psms[[y]])$Master.Protein.Accessions[xx[[y]]],
function(z) length(which(fData(psms[[y]])$Master.Protein.Accessions
== z)))}))
sapply(xx, length)
```
We find that there is only one singleton PSM (one protein group with one only PSM for quantitation) in rep 1 (P06307). The below code chunk plots the PSM quantitation profiles of every protein group that has a missing PSM. Red triangles highlight the missing value.
```{r exmainpsmsplot, eval=FALSE, results = 'hide'}
## plot reps
plotMV <- function(r) {
tochk <- unique(names(numpsms[[r]]))
for (i in 1:length(tochk)) {
mm <- exprs(psms[[r]])[which(fData(psms[[r]])$Master.Protein.Accessions == tochk[i]), , drop = FALSE]
matplot(t(mm), type = "l", xlab = "", ylab = "m/z", lwd = 2,
main = tochk[i], col = "grey", axes = FALSE)
axis(2)
axis(1, at = 1:ncol(mm), labels = colnames(mm), las = 2)
if (any(is.na(mm))) {
ycoord <- sapply(1:ncol(mm), function(z) which(is.na(mm[, z])))
.l <- sapply(ycoord, length)
.xi <- which(.l > 0)
xcoord <- unlist(lapply(1:length(.xi), function(z) rep(.xi[z], .l[.xi[z]])))
ycoord <- unlist(ycoord)
points(x = xcoord, y = ycoord, bg = "red", col = "black", pch = 24, cex = 1.5)}
}
}
par(mfrow = c(4, 4))
plotMV(r = 1)
par(mfrow = c(2, 4))
plotMV(r = 2)
par(mfrow = c(3, 4))
plotMV(r = 3)
```
```{r mvplot1, echo=FALSE, fig.cap="", fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/shotgun_rep1_mv.png")
```
```{r mvplot2, echo=FALSE, fig.cap="", fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/shotgun_rep2_mv.png")
```
```{r mvplot3, echo=FALSE, fig.cap="PSM quantitation profile plots are shown for protein groups which contained a PSM with a missing value. The red triangles show the location of the missing values in the timecourse", fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/shotgun_rep3_mv.png")
```
Looking at the above plots we see for the PSMs that are missing there are two general trends (1) they occur at the lower end/start of the time course, (2) missing PSMs tend to appear at the beginning of the time-course and show the same trend over time (a few exceptions listed below)
Given these two observations we decide for the majority of PSMs to use a left-censored imputation approach. It wouldn't be appropriate for a few cases and these are P06307, P04075 in rep 1, P113388-4, P12956 in rep 2 and so for these PSMs we will use k-nearest neighbour (k-NN), and thus we use a mixed imputation strategy. We remove the PSM at row 31435 for the protein group P12956 as it has 5 missing values out of 6, and there are many other PSMs available for protein quantitation.
Type `?impute` in your R console for more information on missing values imputation methods available in the `MSnbase` and `pRoloc` pipeline.
Imputation is done using the `impute` function in `MSnbase`
```{r imputepsms, message=FALSE, warning=FALSE, eval = TRUE}
## Split the data by replicate
psms_rep1 <- psms[[1]]
psms_rep2 <- psms[[2]]
psms_rep3 <- psms[[3]]
## define psms which are missing at random - see ?impute
fData(psms_rep1)$randna <- FALSE
fData(psms_rep1)$randna[11237] <- TRUE # P06307
fData(psms_rep1)$randna[26194] <- TRUE # P04075
# remove missing this PSM for protein P12956 as it has 5 missing values out of the 6 timepoints
# remove PSM 48162 for protein P113388-4 as 42 other PSMs available and it has no clear trend
psms_rep2 <- psms_rep2[-c(31435, 48162), ]
## replicate 1 imputation
psms_rep1 <- MSnbase::impute(psms_rep1, "mixed",
randna = fData(psms_rep1)$randna,
mar = "knn",
mnar = "MinDet")
## replicate 2 imputation
psms_rep2 <- MSnbase::impute(psms_rep2, "MinDet")
## replicate 3 imputation
psms_rep3 <- MSnbase::impute(psms_rep3, "MinDet")
## Re-create MSnSetList of the data
psms <- MSnSetList(list(rep1 = psms_rep1, rep2 = psms_rep2, rep3 = psms_rep3))
```
We can check these look sensible by re-plotting them again.
```{r replotchk, eval=FALSE, results = 'hide'}
## Display number of PSMs for each protein with a MV
numpsms <- sapply(seq(psms), function(y)
{sapply(fData(psms[[y]])$Master.Protein.Accessions[xx[[y]]],
function(z) length(which(fData(psms[[y]])$Master.Protein.Accessions
== z)))})
par(mfrow = c(4, 4))
plotMV(r = 1)
par(mfrow = c(2, 4))
plotMV(r = 2)
par(mfrow = c(3, 4))
plotMV(r = 3)
```
### Data normalisation
The data was visually assessed by examining pairwise scatter plots and `MAplots` of the data and then normalised using `vsn` followed by `diff.median`.
```{r normalise, message=FALSE, warning=FALSE, fig.show="hide", results = 'hide'}
## make all PSMs uppercase so that PSMs with different PTMs are not
## considered as different sequences
for (i in seq(psms))
fData(psms@x[[i]])$Annotated.Sequence <- toupper(fData(psms@x[[i]])$Annotated.Sequence)
## combine the data and normalise
cmb <- MSnbase::combine(psms[[1]], psms[[2]])
cmb <- MSnbase::combine(cmb, psms[[3]])
dim(cmb)
## Let's have a look at the intensities
boxplot(log(exprs(cmb)))
## Try a few different normalisation approaches
cmb.vsn <- normalise(cmb, "vsn")
cmb.median <- normalise(cmb, "diff.median")
cmb.vsn.median <- normalise(cmb.vsn, "diff.median")
cmb.quantiles <- normalise(cmb, "quantiles")
## Plot the data
par(mfrow = c(2, 3), las = 2, oma = c(6, 1, 1, 1))
boxplot(log(exprs(cmb)), main = "method = untransformed", ylab = "log intensity")
boxplot(log(exprs(cmb.median)), main = "method = diff.median", ylab = "log intensity")
boxplot(exprs(cmb.vsn), main = "method = vsn", ylab = "log intensity") # NOTE: VSN internally logs the data
boxplot(log(exprs(cmb.vsn.median)), main = "method = vsn + diff.median", ylab = "log intensity")
boxplot(log(exprs(cmb.quantiles)), main = "method = quantiles", ylab = "log intensity")
## We decide to go with 'vsn' followed by 'diff.median'
cmb <- cmb.vsn.median
```
### Aggregation to protein
We use the `combineFeatures` function from `MSnbase` to aggregate to protein level.
```{r aggtoprot}
psms1 <- filterNA(cmb[, 1:6])
psms2 <- filterNA(cmb[, 7:12])
psms3 <- filterNA(cmb[, 13:18])
psms <- MSnSetList(list(psms1, psms2, psms3))
## Aggregate to protein using median PSM per protein group
prots <- lapply(seq(psms), function(z)
combineFeatures(psms[[z]],
groupBy = fData(psms[[z]])$Master.Protein.Accession,
method = "median", verbose = FALSE))
## Do data fusion - create concatenated complete dataset
prots[[1]] <- updateFvarLabels(prots[[1]])
prots[[2]] <- updateFvarLabels(prots[[2]])
prots[[3]] <- updateFvarLabels(prots[[3]])
cmbprots <- MSnbase::combine(prots[[1]], prots[[2]])
cmbprots <- MSnbase::combine(cmbprots, prots[[3]])
cmbprots <- MSnbase::filterNA(cmbprots)
dim(cmbprots)
```
We identify 5221, 5812 and 5210 proteins in replicates 1, 2 and 3, respectively, and a total of 4292 common across all channels and timepoints in all samples.
## Protein level data processing
We load the individual protein level datasets from `pRolocdata` using the `data` function.
```{r loadprotdata}
data("lpsTimecourse_rep1_mulvey2021")
data("lpsTimecourse_rep2_mulvey2021")
data("lpsTimecourse_rep3_mulvey2021")
lpsTimecourse_rep1_mulvey2021
dim(lpsTimecourse_rep1_mulvey2021)
lpsTimecourse_rep2_mulvey2021
dim(lpsTimecourse_rep2_mulvey2021)
lpsTimecourse_rep3_mulvey2021
dim(lpsTimecourse_rep3_mulvey2021)
```
We see we have 5221, 5812 and 5210 proteins for replicates 1, 2 and 3, respectively.
### Data fusion
Data is concatenated into one matrix using the `combine` function in `MSnbase` and then proteins not common across all replicates are removed by using the `filterNA` function.
```{r concatenate}
cmb <- MSnbase::combine(lpsTimecourse_rep1_mulvey2021,
lpsTimecourse_rep2_mulvey2021)
cmb <- MSnbase::combine(cmb, lpsTimecourse_rep3_mulvey2021)
cmb <- filterNA(cmb)
dim(cmb)
```
We find 4292 proteins common across all timepoints in all replicates.
```{r addgn}
## Function to add Gene Names to the data for adding Gene Name labels when we plot the results
source("r/getGN.R")
cmb <- addGeneName(cmb, fcol = "Protein.Descriptions.rep1", col.name = "GN")
```
Note: the combined dataset is also available directly from `pRolocdata` type `data("lpsTimecourse_mulvey2021")`
We use the `venn.diagram` function from the `VennDiagram` package to draw venns of the data.
```{r venntc, eval=FALSE}
library("VennDiagram")
## lps timecourse
venn.diagram(
x = list(featureNames(lpsTimecourse_rep1_mulvey2021),
featureNames(lpsTimecourse_rep2_mulvey2021),
featureNames(lpsTimecourse_rep3_mulvey2021)),
category.names = c("Replicate 1" , "Replicate 2 " , "Replicate 3"),
filename = "figures/venn_TC_replicates.png",
main = "LPS timecourse: 0-24h",
col=c("#440154ff", '#21908dff', '#D6BA0D'),
fill = c(alpha("#440154ff",0.3), alpha('#21908dff',0.3), alpha('#D6BA0D',0.3)),
fontfamily = "sans",
cat.fontfamily = "sans",
main.fontfamily = "sans",
cat.col = c("#440154ff", '#21908dff', '#D6BA0D'),
cat.cex = 1.2,
main.cex = 1.6,
cex = 1.7,
cat.fontface = 2,
cat.dist = c(0.07, 0.07, 0.03),
sub = "",
output = FALSE
)
```
```{r venntcplot, echo=FALSE, fig.cap="Proteins common across LPS timecourse replicates", fig.show='hold',fig.align='center', out.height = '320px'}
knitr::include_graphics("figures/venn_TC_replicates.jpg")
```
## Protein expression analysis
### Batch effects
Before we can have a look at changes in protein expression we first need to check if we see any batch effects. We plot the data by 'Tag' and 'Replicate' to see if this may be the case.
```{r checkbatch, fig.height=4, fig.cap="PCA plots showing of the effect of TMT tag and replicate on the intensities"}
par(mfrow = c(1, 2))
plot2D(t(cmb), fcol = "Tag", main = "PCA: TMT tag")
plot2D(t(cmb), fcol = "Replicate", main = "PCA: Replicates")
```
We see clearly from the PCA that the samples are confounded by replicate. We can correct for this by using `limma`. The `limma` package in Bioconductor provides a number of linear models that can accommodate complex experimental designs and account for experimental technicalities such as batch effects and confounding factors. To statistically analyse relative abundance we use `limma`'s empirical Bayes moderated t-test, a shrinkage method appropriate for small sample sizes (see Smyth's paper: [http://www.ncbi.nlm.nih.gov/pubmed/16646809](http://www.ncbi.nlm.nih.gov/pubmed/16646809)). A paired t-test is valid as experiments within each replicate use the same lysate.
### Differential protein expression analysis
We use the classic `eBayes` method in the `limma` package using `lmFit` and add `Replicate` to the model matrix so we can account for this confounding factor
i.e. `design <- model.matrix(~Replicate+Time)`.
`limma` takes care of these potential batch effects.
In the next sections we perform differential expression analysis using the `limma` package [@Ritchie2015]. We first load some code for generating volcano plots.
```{r sourceVolcano}
source("r/ggVolcano.R")
```
#### 24 hour timepoint
The below results of the differential analysis at each time point are written to the `csv` directory of this repo, Supplementary Table 1 of Mulvey et al 2021 and are also stored in the `fData` of the `lpsTimecourse_mulvey2021` dataset found in `pRolocdata` for easy access.
```{r dolimmat24, message=FALSE, warning=FALSE, fig.cap="Histogram of raw p-values from differential analysis at 24 h post LPS"}
library("limma")
## 0hr and 24hr
i <- grep("X0.hr", sampleNames(cmb))
j <- grep("X24.hr", sampleNames(cmb))
eset <- cmb[, c(i, j)]
data <- exprs(eset)
rownames(data) <- fData(cmb)[["GN"]]
head(data)
## Create a model matrix
Time <- as.factor(pData(eset)$Time)
Replicate <- as.factor(pData(eset)$Replicate)
(design <- model.matrix(~Replicate+Time))
## Perform the fit
fit <- lmFit(data, design)
fit <- eBayes(fit)
## Get the results
res_t24 <- topTable(fit, coef = "Time24",
adjust.method="BH", number = Inf)
res_t24 <- addPlottingInfo(res_t24)
msnset_t24 <- addLimmaInfo(cmb, res_t24)
## Look at p-value distribution
hist(res_t24[, "P.Value"], breaks = 50,
main = "Histogram of p-values for 0/24hr LPS \n(limma's paired moderated t-test)",
xlab = "p-value")
## save data as a TSV/CSV
write.csv(res_t24, file = "csv/limma_t24.csv")
write.exprs(msnset_t24, fDataCols = 131:138, sep = "\t",
file = "csv/timecourse_t24.tsv")
```
It's always a good idea to check the distribution of your raw uncorrected p-values. The easiest way to do this is to make a histogram of your p-values as per the above code chunk. [This is a great post](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/) on how to interpret your p-values. We find we have a nice anti-conservative distribution.
The object `res_24` is a `data.frame` containing the output from running `topTable` which is a function to extract a table of the top-ranked genes/proteins from the limma linear model fit.
We can now take this output and append it to our `MSnSet` and then plot the results. The classic way to present gene/protein expression data is a volcano plot.
The `ggVolcano` function plots the results on a volcano plot.
```{r plotVolcanot24, message=FALSE, warning=FALSE, fig.show="hide"}
library("ggplot2")
library("dplyr")
library("ggrepel")
ggVolcano(res_t24, "24 hours", N = 20, lfc = 0.6, p = 0.01,
addLegend = TRUE, text.size = 12, base.text.size = 12,
legend.text.size = 12, geom.point.size = 2,
label.text.size = 4)
```
```{r fig4vol, echo=FALSE, fig.cap="Volcano plot showing up/downregulated proteins at 24 hours post LPS", fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/volcano_LPS_24h.png")
```
#### 12 hour timepoint
We repeat the analysis at each time point and look to see if we find any proteins that are significantly up- or downregulated.
```{r dolimmat12, message=FALSE, warning=FALSE, fig.cap="Histogram of raw p-values from differential analysis at 12 h post LPS"}
## 0hr and 12hr
i <- grep("X0.hr", sampleNames(cmb))
j <- grep("X12.hr", sampleNames(cmb))
eset <- cmb[, c(i, j)]
data <- exprs(eset)
rownames(data) <- fData(cmb)[["GN"]]
head(data)
## Create a model matrix
Time <- as.factor(pData(eset)$Time)
Replicate <- as.factor(pData(eset)$Replicate)
design <- model.matrix(~Replicate+Time)
## Perform the fit
fit <- lmFit(data, design)
fit <- eBayes(fit)
## Get the results
res_t12 <- topTable(fit, coef = "Time12",
adjust.method="BH", number = Inf)
res_t12 <- addPlottingInfo(res_t12)
msnset_t12 <- addLimmaInfo(cmb, res_t12)
## Look at p-value distribution
hist(res_t12[, "P.Value"], breaks = 50,
main = "Histogram of p-values for 0/12hr LPS \n(limma's paired moderated t-test)",
xlab = "p-value")
## save data as a TSV/CSV
write.csv(res_t12, file = "csv/limma_t12.csv")
write.exprs(msnset_t12, fDataCols = 131:138, sep = "\t",
file = "csv/timecourse_t12.tsv")
```
```{r plotvolcano12, eval=FALSE, warning=FALSE, message=FALSE}
## plot volcanos
ggVolcano(res_t12, "12 hours", N = 20, lfc = 0.6,
p = 0.01, addLegend = FALSE,
text.size = 12, base.text.size = 12,
legend.text.size = 12, geom.point.size = 2,
label.text.size = 4)
```
```{r fig8vol, echo=FALSE, fig.cap="Volcano plot showing up/downregulated proteins at 12 hours post LPS", fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/volcano_LPS_12hr.png")
```
#### 6 hour timepoint
```{r dolimma_t6, warning=FALSE, message=FALSE, fig.cap="Histogram of raw p-values from differential analysis at 6 h post LPS"}
## 0hr and 6r
i <- grep("X0.hr", sampleNames(cmb))
j <- grep("X6.hr", sampleNames(cmb))
eset <- cmb[, c(i, j)]
data <- exprs(eset)
rownames(data) <- fData(cmb)[["GN"]]
head(data)
## Create a model matrix
Time <- as.factor(pData(eset)$Time)
Replicate <- as.factor(pData(eset)$Replicate)
design <- model.matrix(~Replicate+Time)
## Perform the fit
fit <- lmFit(data, design)
fit <- eBayes(fit)
## Get the results
res_t6 <- topTable(fit, coef = "Time6",
adjust.method="BH", number = Inf)
res_t6 <- addPlottingInfo(res_t6)
msnset_t6 <- addLimmaInfo(cmb, res_t6)
## Look at p-value distribution
hist(res_t6[, "P.Value"], breaks = 50,
main = "Histogram of p-values for 0/6hr LPS \n(limma's paired moderated t-test)",
xlab = "p-value")
## save data as a TSV/CSV
write.csv(res_t6, file = "csv/limma_t6.csv")
write.exprs(msnset_t6, fDataCols = 131:138, sep = "\t",
file = "csv/timecourse_t6.tsv")
```
```{r volcano6, eval=FALSE, warning=FALSE, message=FALSE}
## plot volcanos
ggVolcano(res_t6, "6 hours", N = 20, lfc = 0.6,
p = 0.01, addLegend = FALSE,
text.size = 12, base.text.size = 12,
legend.text.size = 12, geom.point.size = 2,
label.text.size = 4)
```
```{r fig10vol, echo=FALSE, fig.cap="Volcano plot showing up/downregulated proteins at 6 hours post LPS", fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/volcano_LPS_6hr.png")
```
#### 4 hour timepoint
```{r dolimma_t4, warning=FALSE, message=FALSE, fig.cap="Histogram of raw p-values from differential analysis at 4 h post LPS"}
## 0hr and 4hr
i <- grep("X0.hr", sampleNames(cmb))
j <- grep("X4.hr", sampleNames(cmb))
eset <- cmb[, c(i, j)]
data <- exprs(eset)
rownames(data) <- fData(cmb)[["GN"]]
head(data)
## Create a model matrix
Time <- as.factor(pData(eset)$Time)
Replicate <- as.factor(pData(eset)$Replicate)
design <- model.matrix(~Replicate+Time)
## Perform the fit
fit <- lmFit(data, design)
fit <- eBayes(fit)
## Get the results
res_t4 <- topTable(fit, coef = "Time4",
adjust.method="BH", number = Inf)
res_t4 <- addPlottingInfo(res_t4)
msnset_t4 <- addLimmaInfo(cmb, res_t4)
## Look at p-value distribution
hist(res_t4[, "P.Value"], breaks = 50,
main = "Histogram of p-values for 0/4hr LPS \n(limma's paired moderated t-test)",
xlab = "p-value")
## save data as a TSV/CSV
write.csv(res_t4, file = "csv/limma_t4.csv")
write.exprs(msnset_t4, fDataCols = c(131:138), sep = "\t",
file = "csv/timecourse_t4.tsv")
```
```{r plotvolcano4, eval=FALSE}
## plot volcanos
ggVolcano(res_t4, "4 hours", N = 20, lfc = 0.6,
p = 0.01, addLegend = FALSE,
text.size = 12, base.text.size = 12,
legend.text.size = 12, geom.point.size = 2,
label.text.size = 4)
```
```{r fig12vol, echo=FALSE, fig.cap="Volcano plot showing up/downregulated proteins at 4 hours post LPS", fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/volcano_LPS_4hr.png")
```
#### 2 hour timepoint
```{r dolimma_t2, warning=FALSE, message=FALSE, fig.cap="Histogram of raw p-values from differential analysis at 2 h post LPS"}
## 0hr and 2hr
i <- grep("X0.hr", sampleNames(cmb))
j <- grep("X2.hr", sampleNames(cmb))
eset <- cmb[, c(i, j)]
data <- exprs(eset)
rownames(data) <- fData(cmb)[["GN"]]
head(data)
## Create a model matrix
Time <- as.factor(pData(eset)$Time)
Replicate <- as.factor(pData(eset)$Replicate)
design <- model.matrix(~Replicate+Time)
## Perform the fit
fit <- lmFit(data, design)
fit <- eBayes(fit)
## Get the results
res_t2 <- topTable(fit, coef = "Time2",
adjust.method="BH", number = Inf)
res_t2 <- addPlottingInfo(res_t2)
msnset_t2 <- addLimmaInfo(cmb, res_t2)
## Look at p-value distribution
hist(res_t2[, "P.Value"], breaks = 50,
main = "Histogram of p-values for 0/2hr LPS \n(limma's paired moderated t-test)",
xlab = "p-value")
## save data as a TSV/CSV
write.csv(res_t2, file = "csv/limma_t2.csv")
write.exprs(msnset_t4, fDataCols = c(131:138), sep = "\t",
file = "csv/timecourse_t2.tsv")
```
```{r plotvolcano2, eval=FALSE}
## plot volcanos
ggVolcano(res_t2, "2 hours", N = 20, lfc = 0.6,
p = 0.05, addLegend = FALSE,
text.size = 12, base.text.size = 12,
legend.text.size = 12, geom.point.size = 2,
label.text.size = 4)
```
```{r fig14vol, echo=FALSE, fig.cap="Volcano plot showing up/downregulated proteins at 2 hours post LPS", fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/volcano_LPS_2hr.png")
```
### Summary
A total of 311 proteins were found to be altered in abundance during the time course of LPS stimulation, with statistical significance (adjusted p-value < 0.01 and log2FC > 0.6) (Supplementary Table 1 of [@Mulvey:2021]). We do not discuss the functional effects of these changes in detail in this workflow and instead we refer readers to the Mulvey et al 2021.
## Bayesian temporal clustering
Bayesian correlated clustering was conducted on the shotgun proteomic time-course series to capture clusters of co-regulated proteins and temporal patterns of protein expression in response to LPS. Replicate experiments were combined using multiple dataset integration MDI [@Kirk2012] where the number of clusters were automatically inferred.
Bayesian inference was performed using Markov-chain Monte Carlo (MCMC) as implemented in the MDI-GPU software [@Kirk2012]. Clusters were extracted only for proteins that were consistently allocated to the same clusterings (sampled from the posterior distribution) across replicates (the results can be found in Supplementary Table 3 of Mulvey et al. and also in the csv directory of this Git repo).
In the below code chunk we load the results from temporal clustering and plot 12 clusters of interest as discussed in Mulvey et al. (Figure 2 of [@Mulvey:2021])
```{r mdiclusteringup, message=FALSE, warning=FALSE, fig.height=12, fig.cap="Temporal abundance profiles 6 clusters output from the MDI clustering analysis that showed a pattern of upregulation over time."}
library("RColorBrewer")
source("r/plotRibbons.R")
## Read in MDI clustering results
mdi_rep1 <- read.csv("csv/mdi_clustering_lpsrep1.csv")
mdi_rep2 <- read.csv("csv/mdi_clustering_lpsrep2.csv")
mdi_rep3 <- read.csv("csv/mdi_clustering_lpsrep3.csv")
mdi <- read.csv("csv/mdi_clusters_output.csv")
# get median normalised intensity
dat_mat <- list(data.matrix(mdi_rep1[, -1]),
data.matrix(mdi_rep2[, -1]),
data.matrix(mdi_rep3[, -1]))
av_mat <- apply(simplify2array(dat_mat), c(1,2), median)
rownames(av_mat) <- mdi_rep1[,1]
colnames(av_mat) <- c(0, 2, 4, 6, 12, 24)
head(av_mat)
## plot clusters of interest
ids <- c(1:6, 9:11, 13, 14, 17)
mdi_profs <- sapply(ids, function(z)
av_mat[mdi$Protein.Accession[which(mdi$Cluster.Number == z)], ])
names(mdi_profs) <- paste0("cluster", ids)
## pick a good palette
ribbonCol <- c(brewer.pal(n = 9, "Oranges")[-c(1:3)],
brewer.pal(n = 9, "GnBu")[-c(1:3)])
par(mfrow = c(3,2), mar = c(6, 5, 3, 3))
for (i in 1:6) {
ribbonPlot(mdi_profs[[i]], col = ribbonCol[i], c(0.05, .95),
main = paste("Cluster", ids[i]), cex.main = 2, size = 2)
}
```
```{r mdiclusteringdown, message=FALSE, warning=FALSE, fig.height=12, fig.cap="Temporal abundance profiles 6 clusters output from the MDI clustering analysis that showed a pattern of downregulation over time."}
par(mfrow = c(3,2), mar = c(6, 5, 3, 3))
for (i in 7:12) {
ribbonPlot(mdi_profs[[i]], col = ribbonCol[i], c(0.05, .95),
main = paste("Cluster", ids[i]), cex.main = 2, size = 2)
}
```
## Gene Ontology enrichment
Temporal clusters were functionally annotated by using the Gene Ontology (GO) [@Ashburner:2000]. For each cluster output from the Bayesian temporal clustering, in turn, an enrichment test for biological process (BP), cellular component (CC) and molecular function (MF) annotations, was carried out using the Database for Annotation, Visualisation and Integrated Discovery (DAVID v6.8; https://david.ncifcrf.gov/) where a cutoff of < 0.05 was applied according to the Benjamini-Hochberg procedure [@Benjamini1995]. The GO annotation enrichment results are found in Supplementary Table 4 of Mulvey et al 2021.
In the below code chunk some of the results from the GO enrichment (Figure 3 of Mulvey et al. 2021). GOCC, GOMF and GOBP annotation term enrichment for the the clusters are shown below where the -log10(adjusted p-value) is shown along the x-axis and most highly enriched term along the y-axis. The numbers within each barplot refer to the number of proteins associated with the term in each cluster.
```{r addGoAnnotationForMDIbp, fig.width=10, fig.cap="Gene Ontology molecular function enrichment results for 12 clusters of interest output from the MDI analysis. The numbers within each barchart refer to the number of proteins associated with the term in that cluster."}
## use custom code for bar plot figures
## https://github.com/CambridgeCentreForProteomics/thp-lopit-2021/blob/main/R/GObarplot.R
source("r/GObarplot.R")
GObarplots(filename = "csv/david_data_gobp.csv", cols = ribbonCol, N = 30) +
ggtitle("GO Biological Process Terms")
```
```{r addGoAnnotationForMDIcc, fig.width=10, fig.cap="Gene Ontology cellular component enrichment results for 12 clusters of interest output from the MDI analysis. The numbers within each barchart refer to the number of proteins associated with the term in that cluster."}
GObarplots(filename = "csv/david_data_gocc.csv", cols = ribbonCol) +
ggtitle("GO Cellular Component Terms")
```
```{r addGoAnnotationForMDImf, fig.width=10, fig.cap="Gene Ontology molecular function enrichment results for 12 clusters of interest output from the MDI analysis. The numbers within each barchart refer to the number of proteins associated with the term in that cluster."}
GObarplots(filename = "csv/david_data_gomf.csv", cols = ribbonCol, N = 15) +
ggtitle("GO Molecular Function Terms")
```
The timecourse of LPS provides an insight into the protein expression of THP-1 proteome during the initial phases of a pro-inflammatory response. In order the gain a complete picture of protein relocalisation events we performed a series of hyperLOPIT experiments and 0 and 12 hour LPS time-points to give a deeper spaial understanding of the dynamic changes that occur in the proteome following stimulation.
# Spatial proteomics data analysis
## The hyperLOPIT workflow
Using the hyperLOPIT proteomics platform [@Christoforou:2016 @Mulvey:2017] we obtained spatial maps that capture the steady-state distribution of thousands of proteins in the THP-1 human leukaemia cell line. Experiments were conducted in triplicate and the samples were analysed and collected when cells were (1) unstimulated and then (2) at 12 hours following stimulation with LPS.
The hyperLOPIT method combines biochemical cell fractionation with multiplexed high-resolution mass-spectrometry [@Christoforou:2016 @Mulvey:2017], the full protocol and experimental design can be found in [@Mulvey:2021]. Briefly, TMT labelling was conducted as described in [@Mulvey:2017] and 20 fractions including the cytosol-enriched fraction were labelled per gradient, by combining two TMT 10-plex experiments in an interleaved labelling design to capture as much subcellular diversity as possible. The experimental design for each dataset is stored in the `pData` slot of each dataset.
To access the full experimental design use the `pData` function
```{r showpdata}
## TMT experiment data for the unstimulated hyperLOPIT data
pData(thpLOPIT_unstimulated_mulvey2021)
## TMT labelling scheme for the LPS hyperLOPIT data
pData(thpLOPIT_lps_mulvey2021)
```
As described in [@Mulvey:2017 @Breckels:2016b @Christoforou:2016] the data was processed with Proteome Discoverer v2.1 (Thermo Fisher Scientific) and Mascot server v2.3.02 (Matrix Science) using the SwissProt sequence database for Homo sapiens (www.uniprot.org in November 2016) and the cRAP database (common Repository for Adventitious Proteins, https://www.thegpm.org/crap). Standard filtering of the raw data was conducted as per [@Mulvey:2017] to remove contaminants and low abundant PSMs. A maximum of two missing values per TMT experiment was allowed and the data was directly exported from Proteome Discoverer at the PSM level for downstream analysis in R.
The experiments were done in triplicate for each condition and each replicate contains 2 x TMT 10-plex sets (which we denote as "set 1" and "set 2"). Thus, in total we conducted a total of 12 experiments, 6 per condition wherein we had 2 sets per replicate containing a total of 20 fractions/channels per replicate. One TMT channel was removed (TMT tag 126 in replicate 2, set 2 in the unstimulated) due to erroneous labelling of insoluble material during the sample preparation for this specific tag. The “Average Reporter S/N” value was recalculated for the nine remaining channels in the corresponding 10plex set and PSMs with a value less than 9.0 were discarded (please see Data Processing in the Supplementary Information of [@Mulvey:2021]).
## PSM level data processing
### Missing values assessment
The 12 PSM level datasets were imported into R and missing values were carefully assessed.
Load the PSM level unstimulated data,
```{r psmdata}
## Unstimulated data
data("psms_thpLOPIT_unstim_rep1_set1")
data("psms_thpLOPIT_unstim_rep1_set2")
data("psms_thpLOPIT_unstim_rep2_set1")
data("psms_thpLOPIT_unstim_rep2_set2")
data("psms_thpLOPIT_unstim_rep3_set1")
data("psms_thpLOPIT_unstim_rep3_set2")
```
Load the PSM level 12h-LPS stimulated data,
```{r psmstimulated}
## 12h post LPS-stimulated
data("psms_thpLOPIT_lps_rep1_set1")
data("psms_thpLOPIT_lps_rep1_set2")
data("psms_thpLOPIT_lps_rep2_set1")
data("psms_thpLOPIT_lps_rep2_set2")
data("psms_thpLOPIT_lps_rep3_set1")
data("psms_thpLOPIT_lps_rep3_set2")
```
For ease of coding we create a list of `MSnSets` for each condition
```{r list}
psms <- MSnSetList(
list(
psms_thpLOPIT_unstim_rep1_set1,
psms_thpLOPIT_unstim_rep1_set2,
psms_thpLOPIT_unstim_rep2_set1,
psms_thpLOPIT_unstim_rep2_set2,
psms_thpLOPIT_unstim_rep3_set1,
psms_thpLOPIT_unstim_rep3_set2,
psms_thpLOPIT_lps_rep1_set1,
psms_thpLOPIT_lps_rep1_set2,
psms_thpLOPIT_lps_rep2_set1,
psms_thpLOPIT_lps_rep2_set2,
psms_thpLOPIT_lps_rep3_set1,
psms_thpLOPIT_lps_rep3_set2
)
)
## add list names for each dataset
.nams <- paste0("Rep", rep(1:3, each = 2), "_set", rep(1:2, 3))
names(psms) <- c(paste0(.nams, "_unstim"), paste0(.nams, "_lps"))
```
We can view the number of PSMs per experiment,
```{r count_psms}
sapply(psms, nrow)
```
If we examine the corresponding protein group of each PSM with a missing value, we find that there are other PSMs available for quantitation for the majority of proteins. There are still several hundred cases where we only have 1 PSM for a given protein group, so we would lose several hundred proteins per replicate if we were to just remove these PSMs. We assess the missing values across all experiments using the `naPlot` function to see if there is a trend in where missing values occur.
Generate `naPlots`,
```{r naplots, eval=FALSE}
for (i in seq(psms)) {
par(las = 2, oma = c(10, 1, 1, 1))
naplot(psms[[i]], col = "black", las = 2, reorderColumns = FALSE,
main = names(psms)[i], cex.axis = 2)
}
```
```{r fignaplotun, echo=FALSE, fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/naplot_unstim.png")
```
```{r fignaplotlps, echo=FALSE, fig.cap="Heatmaps showing missing value distributions per set and replciate", fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/naplot_lps.png")
```
The above `naPlots` show that missing values tend to accumulate at the end of the gradient, and more specifically in the first few fractions of the gradient of each experiment, which classically reflect the gradient distribution e.g. PSMs that are often highly mitochondrial have a huge signal in the heavy channels but little elsewhere in the gradient and thus can result in missing values in the other channels. It is also common to find biologically relevant missing values such as those resulting from the absence of the low abundance of ions. As values do not appear to be missing at random we use a left-censored deterministic minimal value approach, `MinDet` in `MSnbase`.
```{r mindet, warning=FALSE, message=FALSE}
psms.imputed <- lapply(psms, function(z) MSnbase::impute(z, method = "MinDet"))
```
PSMs were quality controlled post-imputation and then combined to protein level by calculating the median of all PSM intensities corresponding to the leading Uniprot Accession number for each protein group.
### Data normalisation
Following the standard [pRoloc workflow](https://f1000research.com/articles/5-2926) [@Breckels:2016b] PSMs are scaled into the same intensity interval by dividing each intensity by the `sum` of the intensities for that quantitative feature. This transformation of the data assures cancellation of the effect of the absolute intensities of the quantitative features along the rows, and focuses subsequent analyses on the relative profiles along the sub-cellular channels. This is important for spatial proteomic experiments are proteins that co-localise in a cell are known to exhibit similar quantitative profiles across the gradient fractions employed [@Deduve:1981].
```{r norm}
psms.imputed.norm <- lapply(psms.imputed, function(z) normalise(z, "sum"))
```
### Aggregation to protein
We use all available PSMs for quantification and aggregate PSMs to proteins using the `combineFeatures` function in `MSnbase` by calculating the median of all PSM intensities corresponding to the leading Uniprot Accession number for each protein group.
```{r aggregate}
prots <- lapply(psms.imputed.norm, function(z)
combineFeatures(z, method = "median",
groupBy = fData(z)[, "Master.Protein.Accessions"]))
## become combining all protein sets we tidy up the feature labels of the dataset
ll <- paste0(rep(c("unst", "lps"), 1, each = 6), ".r",
rep(1:3, 1, each = 2), ".s", 1:2)
for (i in seq(prots)) {
prots@x[[i]] <- updateFvarLabels(prots[[i]], label = ll[i], sep = "_")
}
## combine TMT sets and examine each replicate for each condition
unst_r1 <- filterNA(MSnbase::combine(prots[[1]], prots[[2]]))
unst_r2 <- filterNA(MSnbase::combine(prots[[3]], prots[[4]]))
unst_r3 <- filterNA(MSnbase::combine(prots[[5]], prots[[6]]))
lps_r1 <- filterNA(MSnbase::combine(prots[[7]], prots[[8]]))
lps_r2 <- filterNA(MSnbase::combine(prots[[9]], prots[[10]]))
lps_r3 <- filterNA(MSnbase::combine(prots[[11]], prots[[12]]))
tot_prots <- data.frame("Unstimulated" = c(`Replicate 1` = nrow(unst_r1),
`Replicate 2` = nrow(unst_r2),
`Replicate 3`= nrow(unst_r3)),
"LPS" = c(`Replicate 1` = nrow(lps_r1),
`Replicate 2` = nrow(lps_r2),
`Replicate 3` = nrow(lps_r3)))
```
```{r kableoutput, warning=FALSE, message=FALSE}
library("knitr")
library("kableExtra")
kable(tot_prots, caption = "Number of quantified proteins per replicate per condition")
```
## Protein level data processing
As shown in the above section we quantify between ~4800-5800 proteins per replicate. The below Venn diagrams show the overlap between replicates within conditions. We find 3882 proteins common in the unstimulated hyperLOPIT experiment and 4067 in the 12h-LPS stimulated hyperLOPIT experiment.
We use the `VennDiagram` package.
```{r venns, warning=FALSE, message=FALSE, eval=FALSE}
## Load R packages required for generating Venn diagrams
library("VennDiagram")
library("scales")
library("ggplot2")
## HL unstimulated
venn.diagram(
x = list(featureNames(unst_r1),
featureNames(unst_r2),
featureNames(unst_r3)),
category.names = c("Replicate 1" , "Replicate 2 " , "Replicate 3"),
filename = "figures/venn_HL_unst_replicates.png",
main = "hyperLOPIT: Untreated THP-1 cells",
col=c("#440154ff", '#21908dff', '#D6BA0D'),
fill = c(alpha("#440154ff",0.3), alpha('#21908dff',0.3), alpha('#D6BA0D',0.3)),
fontfamily = "sans",
cat.fontfamily = "sans",
main.fontfamily = "sans",
cat.col = c("#440154ff", '#21908dff', '#D6BA0D'),
cat.cex = 1.2,
main.cex = 1.6,
cex = 1.7,
cat.fontface = 2,
cat.dist = c(0.07, 0.07, 0.03),
sub = "",
output = FALSE
)
# HL LPS
venn.diagram(
x = list(featureNames(lps_r1),
featureNames(lps_r2),
featureNames(lps_r3)),
category.names = c("Replicate 1" , "Replicate 2 " , "Replicate 3"),
filename = "figures/venn_HL_lps_replicates.png",
main = "hyperLOPIT: 12h-LPS stimulated THP-1 cells",
col=c("#440154ff", '#21908dff', '#D6BA0D'),
fill = c(alpha("#440154ff",0.3), alpha('#21908dff',0.3), alpha('#D6BA0D',0.3)),
fontfamily = "sans",
cat.fontfamily = "sans",
main.fontfamily = "sans",
cat.col = c("#440154ff", '#21908dff', '#D6BA0D'),
cat.cex = 1.2,
main.cex = 1.6,
cex = 1.7,
cat.fontface = 2,
cat.dist = c(0.07, 0.07, 0.03),
sub = "",
output = FALSE
)
```
```{r venn1, echo=FALSE, fig.cap="Overlap of protein groups identified between replicated experiments", out.height = '320px', fig.show='hold',fig.align='center'}
knitr::include_graphics("figures/venn_hyperlopit.jpg")
```
## Dataset concatenation
We use the `combine` function in `MSnbase` to perform data fusion (concatenation) within each condition to maximise subcellular resolution and thus the reliability in protein classification [as per the pRoloc pipeline [@Christoforou:2016 @Breckels:2016b @Mulvey:2017]. Trotter et al. [@Trotter:2010] have shown a significant improvement in classifier accuracy when combining replicated experiments in spatial proteomics. Specifically, we concatenate all gradient fractions across replicated experiments to give better discrimination between subcellular niches (as shown in [@Christoforou:2016]) which gives users the opportunity to resolve niches that may not have been discovered when examining replicates alone.
We note (as explained in [@Breckels:2016b, @Mulvey:2017]) that direct comparisons of individual TMT channels in replicated experiments do not provide an adequate, goal-driven assessment of different experiments due to the nature of the gradient fraction collection, where quantitative channels do not correspond to identical selected fractions along the gradient.
Concatenate replicates for each condition,
```{r combineall}
## combine replicates for each condition