Skip to content

Multigroup analysis

Witold Wolski edited this page Aug 29, 2017 · 17 revisions

Multigroup analysis

Abstract: Idea is to provide an easy to use API to process and analyze Quantitative mass spectrometry data from different sources.

Aims

  • Single configuration script allowing to specify imputation, normalization methods, models used, visualizations

  • Generate output Report, including figure captions, in HTML and PDF format. Place all figures with reasonable naming in high res png and pdf in folders.

  • Allow for variable column naming,

    • change column names depending on input software and allow for meaningful column names for factors.
  • Ensure that naming is consistent through analysis.

  • Report p-values and adjusted p-values but more importantly confidence intervals.

  • Provide statistics of fold changes in case of multiple comparisons:

    • Barplots, correlation plots, Venn diagrams.
  • Support data on protein, peptide/precursor and transition level (likely peptide/precursor level only).

    • use filtering and aggregation on precursor level. Filtering option: correlation filtering, top N filtering, Aggregation: sum, median, mean of transitions.
  • Support algorithms working on the precursor (only MSStats and MapDIA), peptide (msqrob) and protein level (limma).

  • Support for 1 up to 3 explanatory variables i.e. condition, gender, patient. - This also means that the plots need to be annotated.

  • Support for specifying contrasts for one condition.

Functionality includes:

  • generate QC plots for the data:

    • Scatter plot within condition
    • CV for each condition if N > 2 and overall.
    • Distribution plot for each file - violin plot or density,
      • Densities within condition - before and after normalization
      • compare condition with common density for each condition - before and after normalization
    • clustering on not normalized and normalized data with condition labels
      • specify which variables to encode - work out how to encode - color scheme.
  • Data preprocessing:

    • data filtering
      • top X transitions, top X precursors
      • correlation
      • NA counting and removal, globally and per condition.
      • top N precursors, peptides per protein
      • min X precursors per protein
      • for DIA data - Q value filtering using 2 thresholds.
    • data aggregation (protein from precursors, and precursors from transitions)
      • sum, mean median
    • data normalization
      • vsn
      • median
      • median and variance
      • quantile
    • missing value imputation - must be used if aggregating is enabled

      • row mean and column mean imputation
      • linear regression
      • min 15% imputation
  • Data modeling:

    • ANOVA
    • linear models
    • mixed linear models
    • external packages like limma and msqrob (eventually MSStats) on the data.
  • Visualization of modeling results.

    • Global for all proteins:
      • ma plot for each comparison - color code p-values?
      • volcano plot for each comparison (p-values, adjusted p-values)
      • scatterplot to compare fold changes
      • distribution of p-values
      • distribution of adjusted p-values (FDR)
    • Local (for each protein)
      • boxplot for each protein (unpaired)
      • line plot showing all transitions.

implementation

  • Use long format (tibble) as long as possible and implement functions to switch from wide to long.
  • methods work on tibble's and should support magrittr operator

Protein inference problem

We measure precursor but the subject is proteins. Precursors can be grouped in proteins. Either, unambiguously no conflicting peptide protein assignments. Optionally we can allow for ambiguity, which means that a peptide can be assigned to more than one protein.

In ER it is an N:1 relationship in the unambiguous case, in the ambiguous we have an N:M relationship.

What is needed for modeling?

We want to learn about proteins, therefore the input needs to be:

  ProteinID, fixed effects, random effects, Response, Obsolete columns. 

The ProteinID is here the grouping variable. Obsolete columns can be e.g. Fasta.headers containing additional information. But obsolete can be also original intensities if transformed intensities are used.

The specification needs to be

  • subject = "ProteinID"

  • FixedEffects = c("a","b","c","d")

  • RandomEffects = c("x","y","z")

  • Response = "Intensity"

  • Formula for fixed and random effects or design matrix if only fixed effects are considered.

  • A list of contrasts.

Supported backends

  • msqrob
  • limma

How do we preprocess the data?

data normalization

Most of the normalization methods i.e. quantile or justvsn work with data in matrix format. Append transformation results to data.

Adjusting p-values

What do we need to plot?

Most of the plots will work with the long format (ggplot2) but some of the plots need the data in a wide format (e.g. pairs plot).

Diagnostic plots

  • Pairsplot within condition - works only with 2 to 4 samples. Group labels required.

  • Plot - Matrix (line) plot of transitions, peptide intensities per sample and protein. No group labels required

  • Correlation map samples.

  • Violin plot showing CV for each condition. How to interpret it in a paired experiment.

  • Pairsplot of group averages (output of modeling actually)

Visualization of modeling results.

  • Altman Bland plot - mean vs. log2 fold change ideally with the labeling of significant proteins.
  • Distribution of p-values with description.
  • Volcano plot

Is modeling diagnostics for single proteins needed?

  • boxplot for single protein when intensity of protein level only
  • box plot matrix plot for each peptide and precursor if available.

Output

  • Long format

     ProteinID, Comparison, fold change, p-value, adjusted-pvalue, obsolete columns (i.e fasta.headers).
    

Example Projects

Example project 2147

factors - levels

  • Raw.file
  • Measurement.Date
  • Measurement.Order
  • Strain
  • Timepoint
  • Treatment
  • Plant
  • Leave
  • BATCH

fixed effects

interesting to the analysis

  • Condition (Strain + Timepoint + Treatment)
  • Since the experiment is paired we also need plant as an additional factor

Random effects

of no interest to the analysis, in this case, this would be the batch. If we move to protein level it will be peptide or precursor.

Contrasts

Are computed based on a single factor - condition. Which can be a composition of several factors (see above).

Protein Level Annotation

We are interested in inferences on the protein or protein groups level usually.

in proteomics we are usually measuring precursors:

  • FileName
  • TransitionGroupID
  • StrippedSequence
  • ModifiedSequence
  • PrecursorCharge
  • Decoy
  • LabelType
  • Measured.PrecursorMZ
  • Measured.PrecursorRT
  • Measured.PrecursorScore (some arbitrary quality score).
  • Measured.MS1Intensity
  • Measured.MS2IntensityAggregated

Precursors are aggregated to proteins passing several levels

  • Precursor
  • Modified peptide sequence
  • Peptide
  • Protein