Skip to content

Latest commit

 

History

History
398 lines (323 loc) · 16 KB

introduction.md

File metadata and controls

398 lines (323 loc) · 16 KB

Introduction

Copyright (c) 2020 Tyrone Chen ORCID logo, Al J Abadi ORCID logo, Kim-Anh Lê Cao ORCID logo, Sonika Tyagi ORCID logo

Code in this package and git repository https://github.com/tyronechen/SARS-CoV-2/ is provided under a MIT license. This documentation is provided under a CC-BY-3.0 AU license.

Visit our lab website here. Contact Sonika Tyagi at [email protected].

1 Index

NOTE: The pipeline API has changed since the original publication. To reproduce the results in the original COVID-19 paper for case study 1 and 2, please use the specific version of the pipeline available on zenodo only. Please refer to case study 3 for latest usage.

2 Installation

2.1 Quick

You can install this directly as an R package from gitlab. Note that you may get errors if you don't have libgit2 and freetype libraries installed (these are not R packages).

NOTE: This version of the pipeline is compatible with R version 4.2.3 on linux.

    install.packages("devtools")
    library("devtools")
    install_git("https://github.com/tyronechen/SARS-CoV-2.git", subdir="multiomics", build_vignettes=FALSE, INSTALL_opts="--no-multiarch")

The actual script used to run the pipeline is not directly callable but provided as a separate script.

    # this will show you the path to the script
    system.file("scripts", "run_pipeline.R", package="multiomics")

2.2 Manual (for developers or if above doesnt work)

Alternatively, clone the git repository with:

    git clone "https://github.com/tyronechen/SARS-CoV-2.git"

2.2.1 Install dependencies

With conda:

    conda config --add channels defaults
    conda config --add channels bioconda
    conda config --add channels conda-forge

    install_me="r-argparser r-brio r-colorspace r-diffobj r-dplyr r-ellipsis \
     r-farver r-ggplot2 r-ggrepel r-igraph r-isoband r-matrixStats r-mixOmics \
     r-parallel r-plyr r-rARPACK r-Rcpp r-RcppEigen r-reshape2 r-RSpectra \
     r-stringi r-testthat r-tibble r-tidyr r-utf8 r-vctrs r-zeallot \
     bioconductor-biocparallel"

    conda create -n my_environment install ${install_me}

You can also install dependencies in R directly:

    install_me <- c(
      "argparser", "brio", "colorspace", "diffobj", "dplyr", "ellipsis", "farver",
      "ggplot2", "ggrepel", "igraph", "isoband", "matrixStats", "mixOmics",
      "parallel", "plyr", "rARPACK", "Rcpp", "RcppEigen", "reshape2", "RSpectra",
      "stringi", "testthat", "tibble", "tidyr", "utf8", "vctrs", "zeallot",
      "BiocParallel")
    sapply(install_me, install.packages)

If you run into any issues with the manual install, please double check the library versions against multiomics/DESCRIPTION.

3 Usage

Load the library.

library(multiomics)

If you installed this pipeline as an R package, an example pipeline script is included. You can find it by running this command in your R environment:

system.file("scripts", "run_pipeline.R", package="multiomics")

Otherwise, you can find a copy of this script in the public git repository: https://github.com/tyronechen/SARS-CoV-2/blob/master/src/run_pipeline.R

To inspect the arguments to the script, run this command:

Rscript run_pipeline.R --help

A minimal script to run the pipeline is below. You can also download this here. This example may take a few hours to run fully.

Data is provided as part of the multiomics package and not directly as files. Extract it first with this:

Rscript -e 'library(multiomics); data(BPH2819); names(BPH2819); export <- function(name, data) {write.table(data.frame(data), paste(name, ".tsv", sep=""), quote=FALSE, sep="\t", row.names=TRUE, col.names=NA)}; mapply(export, names(BPH2819), BPH2819, SIMPLIFY=FALSE)'

Four files will be generated in the current working directory, where classes contains sample information and remaining files contain corresponding omics data:

classes.tsv
metabolome.tsv
proteome.tsv
transcriptome.tsv

Then run the multiomics pipeline on the data:

Rscript run_pipeline.R \
  --classes classes.tsv \
  --data metabolome.tsv \
         proteome.tsv \
         transcriptome.tsv \
  --data_names metabolome proteome transcriptome \
  --ncpus 2 \
  --icomp 12 \
  --pcomp 10 \
  --plsdacomp 2 \
  --splsdacomp 2 \
  --diablocomp 2 \
  --dist_plsda "centroids.dist" \
  --dist_splsda "centroids.dist" \
  --dist_diablo "centroids.dist" \
  --cross_val "Mfold" \
  --cross_val_folds 5 \
  --cross_val_nrepeat 50 \
  --corr_cutoff 0.1 \
  --outfile_dir BPH2819 \
  --contrib "max" \
  --progress_bar

4 Pipeline minimum input data

4.1 Input data

The minimum input data needed is a file of classes and at least two files of quantitative omics data. Tab separated data is expected by default.

A small example subset of test data is included in the package for reference. In this test case, data has already been log2 transformed and missing values filled in with imputation.

data(BPH2819)
names(BPH2819)
#> [1] "classes"       "metabolome"    "proteome"      "transcriptome"

The class information is available as a vector:

BPH2819$classes
#> [1] "RPMI" "RPMI" "RPMI" "RPMI" "RPMI" "RPMI"
#> [6] "Sera" "Sera" "Sera" "Sera" "Sera" "Sera"

Each of the three omics data blocks have 12 matched samples and an arbitrary number of features.

sapply(BPH2819, dim)
#> $classes
#> NULL
#> $metabolome
#> [1]  12 153
#> $proteome
#> [1]   12 1451
#> $transcriptome
#> [1]   12 2771
BPH2819$metabolome[,1:3]
#>         X3.Aminoglutaric.acid HMDB0000005 HMDB0000008
#> RPMI_0             -1.7814083   -9.103010   -3.471373
#> RPMI_1             -1.9108074   -5.401229   -3.488496
#> RPMI_2             -1.5458964  -10.898804   -2.845025
#> RPMI_3             -2.1842312   -9.563557   -1.232155
#> RPMI_4             -1.3106881   -4.755440   -1.723564
#> RPMI_5             -0.9600247   -4.771127   -1.403044
#> Sera_6             -0.8764074   -6.507606   -2.884537
#> Sera_7             -1.4139388  -11.175670   -2.861640
#> Sera_8             -3.7537269  -10.883382   -1.238028
#> Sera_9             -2.6902848  -10.744718   -2.635249
#> Sera_10            -3.3605788  -10.439710   -1.774845
#> Sera_11            -2.9362071   -5.850829   -1.139523

Important notes on input data:

  1. Class information and the sample order in each omics dataset must be identical.
  2. Ideally data should be already preprocessed and missing values should be below 20%.
  3. Feature names in each omics dataset may be truncated. Too long names cause issues with visualisation.
  4. R silently replaces all non-alphanumeric characters in feature names with ..

To work around (3) and (4), you can rename your feature names to a short alphanumeric ID in your files, and remap them back later.

4.2 Input files

If you did not install the R package, you can obtain these example files from github:

5 Pipeline output data

Files output by the pipeline include:

  • a pdf file of all plots generated by the pipeline
  • tab-separated txt files containing feature contribution weights to each biological class
  • tab-separated txt file containing correlations between each omics data block

A RData object with all input and output is available in the git repository. This is not included directly in the multiomics package because of size constraints, and includes data from three omics datasets.

6 Acknowledgements

We thank David A. Matthews for helpful discussions and feedback. We thank Yashpal Ramakrishnaiah for performing an extended analysis of the primary data. We thank Melcy Philip for performing downstream analysis of the data. This work was supported by the MASSIVE HPC facility and the authors thank the HPC team at Monash eResearch Centre for their continuous personnel support. This R package was compiled referring to information from blog posts or books by Hilary Parker, Fong Chun Chan, Karl Broman, Yihui Xie, J. J. Allaire, Garrett Grolemund as well as Jenny Bryan and Hadley Wickham. We acknowledge and pay respects to the Elders and Traditional Owners of the land on which our 4 Australian campuses stand.

7 References

7.1 Data source in package and case studies

7.2 Methods used in package

7.3 R package compilation