-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
666 lines (465 loc) · 20.5 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
ggplot2::theme_set(
ggplot2::theme_minimal()
)
```
# mspms
<!-- badges: start -->
[![R-CMD-check](https://github.com/baynec2/mspms/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/baynec2/mspms/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/baynec2/mspms/branch/main/graph/badge.svg)](https://app.codecov.io/gh/baynec2/mspms?branch=main)
[![BioC status](http://www.bioconductor.org/shields/build/release/bioc/mspms.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/mspms)
<!-- badges: end -->
The goal of mspms is provide a concise code-base for the normalization and
data processing required to analyze data from the
[Multiplex Substrate Profiling by Mass Spectrometry (MSP-MS) method](https://pubmed.ncbi.nlm.nih.gov/36948708/).
Additionally, we provide a
[graphical user interface powered by shiny apps](https://gonzalezlab.shinyapps.io/mspms_shiny/)
that allows for a user to utilize the method without requiring any R coding
knowledge.
## Installation
You can install the development version of mspms from github.
```{r,eval = FALSE}
devtools::install_github("baynec2/mspms",
ref = "develop")
```
## Quickstart
To generate a general report using your own data, run the following code.
It requires data that has been prepared for mspms data analysis by a
converter function, and a design matrix. For more information see subsequent
sections.
```{r,eval = FALSE}
mspms::generate_report(
prepared_data = mspms::peaks_prepared_data,
design_matrix = mspms::design_matrix,
outdir = "../Desktop/mspms_report"
)
```
The above command will generate a .html file containing a general mspms analysis.
There is much more that can be done using the mspms package- see the following
sections for more information.
## Overview
There are 4 different types of functions in this package.
1. Pre-processing data. These functions are focused on making mspms
more generally useful to a wider audience outside of the very specific workflow
traditionally used in the O’Donoghue lab. Allows for the use of other types of
upstream proteomic data processing, different peptide libraries, etc.
2. Data processing/ normalization. These functions allow the user to normalize
and process the MSP-MS data.
3. Statistics. These methods allow the user to perform basic statistics on the
normalized/processed data.
4. Data visualization. These functions allow the user to visualize the data.
**Preprocessing data**.
1. *prepare_peaks()*:Takes two exported files from PEAKS software,
combines them, and prepares for processing by the mspms R package.
2. *prepare_pd()*: prepares exported files from proteome discoverer.
3. *calculate_all_cleavages()*: Calculates all possible cleavages for peptide
library sequences.
**Data Processing/ Normalization**.
1. *normalyze()*: Normalizes values.
2. *handle_outliers()*: Looks for outliers across replicates. Removes them.
3. *impute()*: Imputes data for missing values (not including NAs introduced
by handle_outliers()).
4. *join_with_library()*: Joins the normalized data with the peptide library
sequences.
5. *add_cleavages()*: Figures out the locations of the detected cleavages within
the library of peptide sequences used. Is it cleaved at the N or C terminus,
or both? Uses *cterm_cleavage()* and *nterm_cleavage()* to do this.
6. *polish()*: combines the cleavage info into one column. Discards peptides
that were cleaved on both sides or not at all.
7. *prepare_for_stats()*: Reshapes the data into a long format and appends the
data in the design matrix to make statistical testing easy to perform.
8 *mspms():* combines all of the above functions into one convenient function.
**Calculations/ Statistics**.
1. *mspms_log2fc()*: Calculates the log2 fold change across experimental
conditions relative to T0.
2.*mspms_t_tests()*: Conducts T-tests to determine if there are any peptides
that are significantly different across experimental conditions relative to time
0.
3. *log2fc_t_test()*: Combines results from mspms_log2fc() and mspms_t_tests()
into one convenient tibble.
4. *log2fct_condition()*: Calculates the log2 fold change and t tests for the
specified condition at a user specified time.
5.*log2fct_time()*: Calculates the log2 fold change and t tests between the
specified times within a user specified time.
6. *mspms_anova()*: Conducts ANOVA tests to determine if there are any peptides
that are significantly different across all time points.
7. *count_cleavages_per_pos():* Counts the number of cleavages at each position
of the library. Useful for determining endoprotease vs exoprotease activity.
8. *count_cleavages_per_pos()*: Counts the number of cleavages at the position
of the initial peptide detected peptides were cleaved from.
**Icelogo calculations**
1. *calc_AA_count_of_motif()*: calculates the count of each amino acid at each
position in a vector of motifs.
2. *calc_AA_prop_of_motif()*: calculates the proportion of each amino acid at
each position from the output of calc_AA_count_of_motif().
3. *calc_AA_motif_zscore()*: calculates z scores for each amino acid at each
position in a vector of motifs.
4. *calc_sig_zscore()*: given a matrix of zscores, determines which are significant given a user defined p value.
5. *calc_AA_percent_difference()*: calculates the percentage difference of
6. *calc_AA_fold_change()*: calculates the fold change of a matrix containing
the proportions of amino acids at each position relative to a background matrix.
7. *prepare_sig_p_dif()*: prepares the data for plotting percent difference
icelogos. Removes AAs at positions that are not significantly different.
8. *prepare_pd()*:prepares a matrix with significant amino acid percent
differences by position for plotting.
9. *prepare_fc()*: prepares the data for plotting fold change icelogos.
10. *prepare_icelogo_data()*: uses all the functions described above to
prepare the data for plotting icelogos.
11. *extract_re()*: extracts regular expressions specifying amino acids above a
defined threshold cutoff for each position of a cleavage motif from an icelogo
matrix.
**Data Visualization**.
1. *plot_heatmap()*: Conducts hierarchical clustering analysis and plots an
interactive heatmap to visualize overall patterns in the data.
2. *plot_pca()*: PCA analysis to visualize the data in a 2D space.
3. *plot_time_course()*: Plot peptides over time by condition.
4. *plot_cleavages_per_pos()*: Plot the number of cleavages at each position
in the library.Useful for determining endoprotease vs exoprotease activity.
5.*plot_icelogo()*: Plot an icelogo given an icelogo matrix.
6.*plot_all_icelogos()*: generates icelogos representing enriched sequences
relative to time 0 for each conditions. This function compares cleavage
sequences corresponding to peptides that are significant
(p.adj < 0.05 and log2 fold change > 3) to a background of all possible
sequences present in the initial peptide library.
**Reports**.
1. *generate_report()*: Generates a simple analysis of the mspms data that
should be generically applicable to all mspms experiments.
## Data Pre-processing
### Formating Peaks File Outputs.
Analysis downstream of data exported from PEAKS software is supported.
exported files are generated in the O'donoghue lab as follows:
1. Create a New Project in PEAKS.
2. Select data to add to PEAKS project.
3. Add data, rename samples, set correct parameters.
• Highlight all the added data.
• Select Create new sample for each file.
• Select appropriate instrument and fragmentation method.
• Set Enzyme to “None”.
• Rename samples according to enzymes, time points, and replicates.
• Can also ensure appropriate data is selected for these
• Select “Data Refinement” to continue.
4. Data Refinement.
• On the top right, select MSP-MS as the predefined parameters.
• Parameters shown are what should be used for MSP-MS data analysis.
• Select “Identification” to continue.
5. Peptide Identification.
• On the top right, select MSP-MS as the predefined parameters.
• Parameters shown are what should be used for MSP-MS data analysis.
• Identification -> Select Database -> TPD_237.
• Identification -> no PTMs.
• Can remove the PTMs by highlighting them and selecting Remove
6. Label Free Quantification.
• Make sure Label Free is selected.
• Group samples -> add quadruplicates of samples to new group.
```{r}
library(dplyr)
library(mspms)
### Loading the files ###
lfq_filename <- "tests/testdata/protein-peptides-lfq.csv"
# file "protein-peptides.csv" exported from PEAKS identification
id_filename <- "tests/testdata/protein-peptides-id.csv"
# Prepare the data for normalyzer analysis
peaks_prepared_data <- prepare_peaks(lfq_filename, id_filename)
```
### Formating Proteome Discoverer File Outputs.
We can prepare proteome discover files for analysis with mspms as follows.
```{r}
prepared_proteome_discoverer <-
prepare_pd("tests/testdata/proteome_discoverer_output.xlsx")
```
### Calculating all cleavages
We might want to calculate all possible cleavages for the peptide library
sequences. This is useful for downstream analysis, especially when we are
looking at the specificity of the cleavage sites via plot_cleavage_motif() as
this requires a background of all possible cleavages
We can do this by specifying the number of amino acids after the cleavage site
that we are interested in. First lets try 4, which is the default.
```{r}
all_peptide_sequences <- mspms::calculate_all_cleavages(
mspms::peptide_library$library_real_sequence,
n_AA_after_cleavage = 4
)
head(all_peptide_sequences)
```
We could also try 5 AA after each cleavage site
```{r}
all_peptide_sequences <- mspms::calculate_all_cleavages(
mspms::peptide_library$library_real_sequence,
n_AA_after_cleavage = 5
)
head(all_peptide_sequences)
```
## Data Normalization/ Processing
### Loading design matrix
Before we normalyze data, we need to know what samples are in what groups. We can do that by defining a design matrix.
```{r}
design_matrix <- readr::read_csv("tests/testdata/design_matrix.csv")
head(design_matrix)
```
### Normalyzing data
mspms uses the NormalyzerDE package to do normalization under the hood.
We can normalyze the data as follows:
```{r}
normalyzed_data <- normalyze(peaks_prepared_data, design_matrix)
```
### Handling Outliers
Here we use a Dixon test from the outliers package to detect outliers from each
of our replicates. We need to know what samples are part of which groups, so we
need to specify the design matrix here too. Make sure that the column header
names are "sample" and "group" just like before.
```{r}
design_matrix <- readr::read_csv("tests/testdata/design_matrix.csv")
outliers <- handle_outliers(normalyzed_data, design_matrix)
```
### Imputation of data
We have a lot of missing, or 0 values. For these, we need to impute them so we
can do downstream statistics.
```{r}
imputed <- mspms::impute(outliers)
```
### Joining with Library
Next we need to join everything with the sequences of the peptide library.
```{r}
joined_with_library <- mspms::join_with_library(imputed)
```
### Calcuating cleavages.
Next, we need to determine where the peptide sequences are cleaved.
We check both the N and C terminus.
Sequences are presented as the user specified number of amino acids on
both sides of a cleavage. The default is 4, but there is interest in looking at
motifs that are farther away from the cut site. Note that X indicates that there
was nothing at that position because it is over the end.
```{r}
cleavage_added_data <- mspms::add_cleavages(joined_with_library, n_residues = 4)
head(cleavage_added_data)
```
### Adding metadata back
We need to add information about our samples back to perform downstream
statistics on the data. We accomplish this with the prepare for stats file.
```{r}
prepared_for_stats <- mspms::prepare_for_stats(cleavage_added_data,
design_matrix)
```
### Polishing
This function polishes the data by combining the cterm and nterm cleavage
information into one column while removing any rows that don't have any
cleavage information or have cleavage information on the cterm and nterm.
```{r}
polished_data <- mspms::polish(cleavage_added_data)
```
### Complete workflow
There are a lot of steps to get to the final normalized data we listed above.
We have combined all of these steps into one function, mspms() for convenience.
This offers a nice balance between allowing the user the option to use
individual steps and providing a quick way to get to the final data.
```{r}
mspms_data <- mspms::mspms(peaks_prepared_data, design_matrix)
```
## Statistics
mspms provides a number of convenience functions to conduct statistics
on the data.
First, we need to prepare the data for stats. This involves reshaping the data
into the long format so it is easy to conduct statistics on and then appending
the data in the design matrix that allows us to determine what sample contains
what conditions/time and conduct the appropriate statistics.
Now we can conduct the statistics:
### T tests
T tests are performed within each condition and compared to time 0.
For example, for an experiment where there are two conditions,
DMSO and MZB as well as 3 time points, 0, 1, and 2, the t tests would be
as follows:
* DMSO.TO vs DMSO.T0.
* DMSO.T0 vs DMSO.T1.
* DMSO.T0 vs DMSO.T2.
* MZB.T0 vs MZB.T0.
* MZB.T0 vs MZB.T1.
* MZB.T0 vs MZB.T2.
```{r}
# Perform T test
t_test_stats <- mspms::mspms_t_tests(mspms_data)
```
### log2FC
We can also calculate the log 2 fold change.
The comparisons here are the same as for the T tests.
```{r}
# calculate log2fc
log2fc <- mspms::mspms_log2fc(mspms_data)
head(log2fc)
```
### log2fc_t_tests.
Oftentimes (such as when you want to make volcano plots) it is most useful to
look at the log2fc and the t test statistics at the same time.
```{r}
log2fc_t_test <- mspms::log2fc_t_test(mspms_data)
head(log2fc_t_test)
```
### log2fct_condition
You may wish to compare two conditions at a specific time point.
This function allows this
```{r}
condition_t = mspms::log2fct_condition(mspms_data,
ref_condition = "DMSO",
comparison_condition = "MZB",
at_time = 60)
head(condition_t)
```
### log2fct_time
You may wish to compare two time points within a specific condition.
This function allows
```{r}
time_t = mspms::log2fct_time(mspms_data,
within_condition = "DMSO",
ref_time = 0,
comparison_time = 60)
head(time_t)
```
### ANOVA
We also might want to perform an anova. Here, we have it set up to show the
effect of time within each condition.
For example, for an experiment where there are two conditions, DMSO and MZB as
well as 3 time points, 0, 1, and 2, the anova would statistics for the effect of
time for each peptide within DMSO or MZB.
```{r}
# Doing ANOVA
anova_stats <- mspms::mspms_anova(mspms_data)
head(anova_stats)
```
## Common Data Visualizations
We also provide some functions that make common data visualizations easier.
### volcano plots
```{r}
volcano_plot <- mspms::plot_volcano(log2fc_t_test)
volcano_plot
```
### Plotting sequence specificity motif
Here use an approach similar to what is implemented in ICELOGO to visualize the
sequence specificity of the cleavage sites. This was used as reference to build
the code: https://iomics.ugent.be/icelogoserver/resources/manual.pdf.
Below we can see this applied to see what cleavage sequences are enriched among
the peptides significantly different between time 0 and time 60
within DMSO.
```{r}
cleavage_seqs <- time_t %>%
dplyr::filter(log2fc > 3, p.adj <= 0.05) %>%
dplyr::pull(cleavage_seq)
background_universe <- mspms::all_possible_8mers_from_228_library
mspms::plot_icelogo(cleavage_seqs, background_universe)
```
We could also look at the fold change instead of the percent difference.
```{r}
mspms::plot_icelogo(cleavage_seqs, background_universe, type = "fold_change")
```
We also provide a convenience function for looking at all icelogos relative to
time 0 within the experiment.
This function includes cleavages that were significant at any time point once in
the experimental set of the ICELOGO. Below we show this applied to create
percent difference and fold change icelogos.
```{r}
plot_all_icelogos(mspms_data)
plot_all_icelogos(mspms_data, type = "fold_change")
```
### Extracting regular expressions from icelogos
It might be useful for the user to extract the information contained in the
Icelogo and use it to filter their data for peptides that are predicted
to be well cleaved, or not cleaved according to the icelogo plot.
For example, we can get the amino acids that are enriched at each position of
the icelogo using the code below.
```{r}
# Getting data for Icelog
cleavage_seqs <- time_t %>%
dplyr::filter(log2fc > 3, p.adj <= 0.05) %>%
dplyr::pull(cleavage_seq)
background_universe <- mspms::all_possible_8mers_from_228_library
# Plotting icelogo
mspms::plot_icelogo(cleavage_seqs, background_universe)
# corresponding icelogo data to the plot
icelogo_matrix = mspms::prepare_icelogo_data(cleavage_seqs, background_universe)
# Extracting regular expressions
RE_pos = mspms::extract_re(icelogo_matrix)
RE_pos
```
We can also do this for amino acids that are negatively enriched.
```{r}
RE_neg = mspms::extract_re(icelogo_matrix,type = "negative")
RE_neg
```
Once we have these regular expressions, we can use it to filter our data as
follows:
```{r}
pos_enriched = mspms_data %>%
dplyr::filter(grepl(RE_pos, cleavage_seq))
```
In this case, none of the peptides in the original data set match all the rules
in our library (peptide library does not contain all possible combinations, so
none of the peptides was a match to the combination of rules at each position)
We could then filter to only include peptides that are greater than a certain
threshold percent difference. We can then visualize these results easily with
the plot_time_course function described further below. Here we will only
consider AAs with a percent difference greater than 10 at each position.
```{r}
# Extracting regular expressions
RE_pos = mspms::extract_re(icelogo_matrix,threshold = 10)
RE_pos
# Filtering data for peptides that have cleavage sequences matching regular
# expressions
f = mspms_data %>%
dplyr::filter(grepl(RE_pos, cleavage_seq))
#visualizing these peptides that match over time.
plot_time_course(f)
```
Here we see that these peptides were cleaved over time in DMSO, as would be
expected based on the icelogo plot.
### PCA.
We can generate a PCA plot to visualize the data.
Here, the colors show the different time points while the shape shows
the different conditions.
```{r}
mspms::plot_pca(mspms_data)
```
### Hierchical clustering
We can also generate an interactive heatmap with the hierarchical clustering
results.
The format of this readme does not allow for interactivity, so a static picture
of the output is shown instead.
```{r,eval = FALSE}
mspms::plot_heatmap(mspms_data)
```
![](README_files/plot_heatmap_output.png)
### Ploting time course.
We also provide a function for plotting the mean intensity and standard
deviation over time for each peptide in the data set by conditions.
Facets show peptide, color shows the condition.
This can be used in combination with the ANOVA function.
Below, we will use anova to calculate the peptides where we see an
effect of time (in the DMSO group), and then plot these.
```{r}
top10sig <- anova_stats %>%
dplyr::arrange(p.adj) %>%
dplyr::pull(Peptide) %>%
head(10)
p1 <- mspms_data %>%
dplyr::filter(Peptide %in% top10sig) %>%
mspms::plot_time_course()
p1
```
## Calculating the number of cleavages at each position in the library
We can also calculate the number of cleavages at each position of the library.
Here we will do this for peptides significantly different relative to time 0 for
DMSO and MZB.
```{r}
# Performing T tests
t_tests = log2fc_t_test(mspms_data) %>%
dplyr::filter(p.adj < 0.05,
log2fc > 3)
count = mspms::count_cleavages_per_pos(t_tests)
p1 = mspms:::plot_cleavages_per_pos(count)
p1
```