BiomarkerML

Tip

To import the workflow into your Terra workspace, click on the above Dockstore badge, and select 'Terra' from the 'Launch with' widget on the Dockstore workflow page.

Introduction

High-throughput affinity and mass-spectrometry-based proteomic studies of large clinical cohorts generate vast proteomic data and can enable rapid disease biomarker discovery. Here, we introduce an advanced machine learning (ML) workflow designed to streamline the ML analysis of proteomics data, thus enabling researchers to efficiently leverage sophisticated algorithms in the search for critical disease biomarkers.

The workflow: takes proteomic data and sample labels as input, imputing missing values where necessary; pre-processes the data for ML models and optionally performs dimensionality reduction; makes available as standard a catalogue of machine learning and deep learning classification and regression models, including both well established and cutting-edge methods; calculates accuracy, sensitivity and specificity of models, enabling the evaluation and comparison of models based on these metrics; and carries out feature selection in models using SHapley Additive exPlanations (SHAP) values. In addition to these ML capabilities, the workflow also provides downstream modules for functional enrichment and protein-protein interaction (PPI) network analyses of feature-selected proteins.

The workflow is implemented in Python, R and Workflow Description Language (WDL), and can be executed on a cloud-based platform for biomedical data analysis. Deployment in this manner provides a standardized, user-friendly interface, and ensures the reproducibility and reliability of analytical outputs. Furthermore, such deployment renders the workflow scalable and streamlines the analysis of large, complex proteomic data. This ML workflow thus represents a significant advancement, empowering researchers to efficiently explore proteomic landscapes and identify biomarkers critical for early detection and treatment of diseases.

Workflow Steps

Preprocessing : By default, Z-score standardisation is applied to the input data. Optionally, users can choose to apply dimensionality reduction to the dataset. Display scatter plots for every two dimensions based on the selected number of output dimensions. The available methods include:
- PCA (Principal Component Analysis for linear data)
- ELASTICNET (ElasticNet Regularization)
- UMAP (Uniform Manifold Approximation and Projection)
- TSNE (t-Distributed Stochastic Neighbor Embedding)
- KPCA (Kernel Principal Component Analysis for non-linear data)
- PLS (Partial Least Squares Regression)
Classification : This step applies the machine learning models to the standardized data and generates a confusion matrix, ROC plots for all classes and averages, and other relevant evaluation metrics (Accuracy, F1, sensitivity, specificity) for all the models. The available algorithms are as follows:
- RF (Random Forest)
- KNN (K-Nearest Neighbors)
- NN (Neural Network)
- SVM (Support Vector Machine)
- XGB (XGBoost)
- PLSDA (Partial Least Squares Discriminant Analysis)
- VAE (Variational Autoencoder with Multilayer Perceptron)
- LR (Logistic Regression)
- GNB (Gaussian Naive Bayes)
- LGBM (LightGBM)
- MLPVAE (Multilayer Perceptron inside Variational Autoencoder)
Regression : This step applies the machine learning models to the standardized data and generates a confusion matrix, ROC plots for all classes and averages, and other relevant evaluation metrics (Accuracy, F1, sensitivity, specificity) for all the models. The available algorithms are as follows:
- RF_REG (Random Forest Regression)
- NN_REG (Neural Network Regression)
- SVM_REG (Support Vector Regression)
- XGB_REG (XGBoost Regression)
- PLS_REG (Partial Least Squares Regression)
- KNN_REG (K-Nearest Neighbors Regression)
- LGBM_REG (LightGBM Regression)
- VAE_REG (Variational Autoencoder with Multilayer Perceptron)
- MLPVAE_reg (Multilayer Perceptron inside Variational Autoencoder)
SHAP analysis : (Optional) This step calculates SHapley Additive exPlanations (SHAP) values for variable importance (CSV file and radar plot for top features) and plots ROC curves for all the models specified by the user.
Protein–Protein Interaction analysis : (Optional) Biological functional analyses through protein–protein interaction network diagrams for top-ranked biomarkers and first-degree network expansions combining protein coexpression patterns to highlight functional connectivity.
Report generation : This step aggregates all output plots from the previous steps and compiles them into a .pdf report.

Installation (local)

Important

This workflow is primarily designed for cloud-based platforms (e.g., Terra.bio, DNANexus, Verily) that support WDL workflows.

However, you can also run it locally using the Cromwell workflow management system.

This workflow has also been tested locally on Ubuntu 22.04 with Docker v28.3.1 and Cromwell v40, running on a 12th Gen Intel Core i7-1270P with 32 GB RAM.

ARM64 (linux/arm64) architecture is currently not supported.

Requirements

Docker
- Please checkout the Docker installation guide.
Mamba package manager
- Please checkout the mamba or micromamba official installation guide.
- We prefer mamba over conda since it is faster and uses libsolv to effectively resolve the dependencies.

Steps

Create a new environment with Cromwell

Using mamba (recommended):

mamba create --name biomarkerml bioconda::cromwell

Or, using conda:

conda create --name biomarkerml -c bioconda cromwell

Activate the environment

With mamba:

mamba activate biomarkerml

Or, with conda:

conda activate biomarkerml

Prepare your input file
- All workflow inputs must be specified in a JSON file.
- Use the provided example/inputs.json file as a template. You can find this file in the example/ directory of the repository.
- Make a copy of example/inputs.json and edit it to specify your own input data file and desired output prefix. At a minimum, update these two fields:
```
  "main.input_csv": "/full/path/to/your/input/data.csv",
  "main.output_prefix": "your_output_prefix"
```
- Replace /full/path/to/your/input/data.csv with the absolute path to your CSV data file, and set your_output_prefix to a name you want for your analysis outputs.
- You can adjust other parameters in the JSON file as needed. See the Inputs section below for descriptions of all available options.

Run the workflow locally

cromwell run workflows/main.wdl -i example/inputs.json

Inputs

main.input_csv : [File] Input file in .csv format, includes a Label column, with each row representing a sample and each column representing a feature. An example of the .csv is shown below:

SampleID Label Protein1 Protein2 ... ProteinN

ID1 Label1 0.1 0.4 ... 0.01

ID2 Label2 0.2 0.1 ... 0.3
main.output_prefix : [String] Analysis ID. This will be used as prefix for all the output files.
main.mode : [String] Specify the mode of the analysis. Options include Classification, Regression, and Summary. Default value: Summary.
main.dimensionality_reduction_choices : [String] Specify the dimensionality method name(s) to use. Options include PCA, UMAP, TSNE, KPCA and PLS. Multiple methods can be entered together, separated by a space. Default value: PCA

Warning

It is recommended to select only one dimensionality reduction method when using it alongside classification or regression models.

If multiple dimensionality reduction methods are specified, the workflow will only perform the dimentinality reduction and generate a report.

main.num_of_dimensions: [Int] Total number of expected dimensions after applying dimensionality reduction for the visualization. This option only works when multiple dimensionality_reduction_choices are selected. Default value: 3.
main.classification_model_choices : [String] Specify the classification model name(s) to use. Options include RF, KNN, NN, SVM, XGB, PLSDA, VAE, LR, GNB, LGBM and MLPVAE. Multiple model names can be entered together, separated by a space. Default value: RF
main.regression_model_choices : [String] Specify the regression model name(s) to use. Options include RF_reg, NN_reg, SVM_reg, XGB_reg, PLS_reg, KNN_reg, LGBM_reg, VAE_reg and MLPVAE_reg. Multiple model names can be entered together, separated by a space. Default value: RF_reg
main.calculate_shap: [Boolean] Top features to display on the radar/bar chart. Default value: false
main.shap_features: [Int] Number of features to display on the radar/bar chart. Default value: 10
main.run_ppi: [Boolean] Execute Protein-Protein interaction (ppi) analysis. Default value: false

Warning

The Protein-Protein interaction analysis can be performed only when the dimensionality_reduction_choices option is set to either ELASTICNET or NONE, and calculate_shap option is set to true.

main.ppi_analysis.score_threshold : [Int] Confidence score threshold for loading STRING database. Default value: 400
main.ppi_analysis.combined_score_threshold : [Int] Confidence score threshold for selecting nodes to plot in the network. Default value: 800
main.ppi_analysis.SHAP_threshold : [Int] The number of top important proteins selected for network analysis based on SHAP values. Default value: 100
main.ppi_analysis.protein_name_mapping : [Boolean] Whether to perform protein name mapping from UniProt IDs to Entrez Symbols. Default value: TRUE
main.ppi_analysis.correlation_method : [String] Correlation method used to define strongly co-expressed proteins. Options include spearman, pearson and kendall. Default value: spearman
main.ppi_analysis.correlation_threshold : [Float] Threshold value of the correlation coefficient used to identify strongly co-expressed proteins. Default value: 0.8
main.*.memory_gb : [Int] Amount of memory in GB needed to execute the specific task. Default value: 24
main.*.cpu : [Int] Number of CPUs needed to execute the specific task. Default value: 16

Note

We recommend that users adopt unique Entrez Symbols as the protein naming convention for our network analysis, although we provide an approach using the R/Bioconductor annotation package org.Hs.eg.db to map UniProt IDs to Entrez Symbols.

The protein name mapping process handles edge cases as follows:

UniProt IDs mapped to multiple Entrez symbols: All matched Entrez symbols corresponding to the same UniProt ID are concatenated using a semicolon (;) and later deconcatenated during network plot mapping to STRINGdb. This may occur in cases where protein complexes are composed of subunits encoded by different genes and etc.
Multiple UniProt IDs mapping to the same Entrez symbol: Only the first occurrence — corresponding to the protein with the highest SHAP value for that symbol — is retained in the final dataset. This may happen in cases involving protein isoforms, fusion proteins and etc.
UniProt IDs with no associated Entrez symbol: These entries are removed from the dataset.

Outputs

report : [File] A .pdf file containing the final reports, including the plots generated through the analyses.
results : [File] A .gz file containing the results and plots from all steps in the workflow.

Components

Package	License
micromamba==1.5.5	BSD-3-Clause
python	PSF/GPL-compat
joblib	BSD-3-Clause
matplotlib	PSF/BSD-compat
numpy	BSD
pandas	BSD 3-Clause
scikit-learn	BSD-3-Clause
xgboost	Apache-2.0
shap	MIT
pillow	Open Source HPND
PyTorch	BSD
Optuna	MIT
fpdf	LGPL-3.0
seaborn	BSD-3-Clause
umap-learn	BSD-3-Clause
AnnotationDbi	Artistic-2.0
BiocManager	Artistic-2.0
fields	GPL (>= 2)
ggplot2	MIT
igraph	GPL (>= 2)
magrittr	MIT
optparse	GPL (>= 2)
STRINGdb	GPL (>= 2)
tidyverse	GPL-3
writexl	BSD-2-clause
org.Hs.eg.db	Artistic-2.0

Citations

Zhou, Y., Maurya, A., Deng, Y., & Taylor, A. (2024). A cloud-based proteomics ML workflow for biomarker discovery. Zenodo. https://doi.org/10.5281/zenodo.13378490

If you use proteomics-ML-workflow for your analysis, please cite the Zenodo record for that specific version using the following DOI: 10.5281/zenodo.13378490.

Name		Name	Last commit message	Last commit date
Latest commit History 396 Commits
.github/workflows		.github/workflows
container		container
example		example
scripts		scripts
workflows		workflows
.dockerignore		.dockerignore
.dockstore.yml		.dockstore.yml
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
proteomics-ml-WDL_white.png		proteomics-ml-WDL_white.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BiomarkerML

Introduction

Workflow Steps

Installation (local)

Requirements

Steps

Inputs

Outputs

Components

Citations

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

SampleID	Label	Protein1	Protein2	...	ProteinN
ID1	Label1	0.1	0.4	...	0.01
ID2	Label2	0.2	0.1	...	0.3

License

anand-imcm/proteomics-ML-workflow

Folders and files

Latest commit

History

Repository files navigation

BiomarkerML

Introduction

Workflow Steps

Installation (local)

Requirements

Steps

Inputs

Outputs

Components

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Packages