Peptide Generalization Benchmarks

1. Introduction

This set of benchmarks is meant to evaluate the ability of peptide representation methods and models to provide meaningful features of standard and modified peptides useful for machine learning. The main metric the benchmark measures is the ability of a representation technique to extrapolate from standard to modified peptides, as this is the most common scenario in real-world pharmaceutical development.

Here, we define standard peptides as those protein sequences with less than 50 amino acids and composed by the 20 canonical amino acids; modified peptides are defined as peptides with chemical modifications either in the back-bone, cyclizations, or any non-canonical side-chains.

If you want to learn more, please check out our paper preprint.

2. Datasets

The benchmark is currently integrated by four different prediction tasks:

Protein-peptide binding affinity (Regression)
Cell penetration (Classification)
Antibacterial activity (Classification)
Antiviral activity (Classification)

For each of these tasks there are two subsets of data: standard (the file starts with c-) and modified (file starts with nc-). We are continuously looking to improve the benchmarks and make them more comprehensive so we welcome any suggestions for tasks or datasets that may be relevant for 1) drug development or 2) bio-catalyst optimization. If you have a suggestion, please open an issue or contact us at [email protected].

The representations can be downloaded from here: PeptideGeneralizationBenchmarks - Representations.

3. Running the benchmarks

You will need to clone the repo.

git clone https://github.com/IBM/PeptideGeneralizationBenchmarks
cd PeptideGeneralizationBenchmarks

3.1 Prepare peptide representations

Then you will need to adapt the rep_transfer/represent_peptides.py file to account for your peptide representation/featurization method/model. The output should be a matrix with $N \times E$ with $N$ being the number of peptides in each dataset and $E$ the dimensions of the embedding space.

3.2 Run all the benchmarks

To compute the representations and run all the benchmarks simply run:

./run_all.sh <name-of-representation> svm
./run_all.sh <name-of-representation> lightgbm

3.3 Statistical analysis of the results

The statistical analysis of the results can be easily performed by running the analysis/results_analysis.ipynb notebook.

4. Submission and scoring

All datasets have been partitioned using the Hestia-GOOD framework (more information in the Hestia-GOOD paper or Github Repository). The final model score for each dataset is the average across all thresholds and 5 independent runs. Error measurements are provided as standard error of the mean across thresholds and independent runs. The significant rank is defined through the statistical analysis of the significant differences between models with Kruskal-Wallis and post-hoc Wilcoxon test with Bonferroni correction for multiple testing.

The performance is measured as Spearman's $\rho$ correlation coefficient for the regression tasks and as Matthew's Correlation coefficient for binary classification tasks.

Currently, we support only one category of evaluation, representation transfer, where a featurization method or representation learning model encodes each peptide into a single vector that then is used to train a machine learning model (LightGBM) to predict the associated label.

Submissions can be made through a dedicated issue (Issue type: Submission), we expect a zip file with the Results/ directory generated from running the rep_transfer/evaluation.py and rep_transfer/evaluation_joint.py.

If you have any doubts as to how to run the scripts, please do not hesitate to open an issue or contact us at [email protected].

5. Leaderboards

5.1. Interpolation standard to standard peptides

Results with

Representation	Antibacterial	Antiviral	Cell penetration	Protein-peptide binding affinity	Average	Significant rank
ESM2 8M	0.81±0.02	0.78±0.01	0.91±0.01	0.90±0.01	0.85±0.01	1
ESM2 650M	0.81±0.02	0.76±0.01	0.92±0.01	0.91±0.00	0.84±0.01	1
ECFP-16 counts	0.79±0.02	0.75±0.01	0.94±0.01	0.91±0.01	0.84±0.01	1
Prot-T5-XL	0.81±0.02	0.77±0.01	0.91±0.01	0.90±0.00	0.84±0.01	1
ESM2 150M	0.81±0.02	0.74±0.01	0.91±0.01	0.90±0.01	0.83±0.01	1
ECFP-16	0.77±0.02	0.74±0.01	0.92±0.01	0.90±0.01	0.83±0.01	1
ChemBERTa-2	0.80±0.02	0.73±0.01	0.90±0.01	0.89±0.01	0.82±0.01	1
ProtBERT	0.79±0.02	0.71±0.01	0.91±0.01	0.92±0.01	0.82±0.01	1
Pepland	0.78±0.02	0.70±0.01	0.88±0.01	0.89±0.01	0.81±0.01	2
PeptideCLM	0.79±0.02	0.71±0.01	0.90±0.01	0.86±0.00	0.81±0.01	2
Molformer-XL	0.77±0.02	0.68±0.02	0.91±0.01	0.88±0.01	0.80±0.01	2
PepFuNN	0.68±0.02	0.73±0.01	0.89±0.01	0.76±0.01	0.76±0.01	3
Avalon FP	0.68±0.02	0.73±0.01	0.85±0.02	0.62±0.01	0.72±0.01	4

5.2. Interpolation modified to modified peptides

Results with LightGBM.

Representation	Antibacterial	Antiviral	Cell penetration	Protein-peptide binding affinity	Average	Significant rank
Molformer-XL	0.88±0.01	0.91±0.01	0.89±0.01	0.85±0.02	0.88±0.01	1
ChemBERTa-2	0.87±0.00	0.91±0.01	0.84±0.02	0.88±0.01	0.88±0.01	1
Prot-T5-XL	0.87±0.01	0.84±0.02	0.93±0.01	0.84±0.02	0.87±0.01	1
Avalon FP	0.83±0.01	0.90±0.01	0.72±0.01	0.90±0.01	0.85±0.01	2
ProtBERT	0.85±0.01	0.87±0.02	0.87±0.01	0.81±0.02	0.85±0.01	2
ECFP-16	0.90±0.01	0.87±0.01	0.71±0.01	0.87±0.01	0.84±0.01	2
ESM2 650M	0.89±0.00	0.91±0.01	0.72±0.01	0.80±0.02	0.83±0.01	2
PeptideCLM	0.88±0.00	0.83±0.02	0.78±0.01	0.85±0.01	0.83±0.01	2
ECFP-16 counts	0.89±0.01	0.87±0.01	0.65±0.04	0.86±0.02	0.82±0.01	2
ESM2 150M	0.86±0.01	0.91±0.01	0.60±0.02	0.82±0.02	0.80±0.01	2
ESM2 8M	0.78±0.01	0.89±0.02	0.68±0.02	0.82±0.02	0.80±0.01	3
Pepland	0.85±0.01	0.78±0.01	0.62±0.02	0.83±0.01	0.77±0.01	3
PepFuNN	0.88±0.01	0.74±0.02	0.44±0.01	0.73±0.02	0.70±0.02	4

5.3. Standard to modified extrapolation

The last subtask measures how well models trained with each of the representations can generalise/extrapolate from a standard training set to a modified test set.

Representation	Antibacterial	Antiviral	Cell penetration	Protein-peptide binding affinity	Average	Significant rank
PepFuNN	0.32±0.06	0.49±0.03	-0.04±0.04	0.23±0.03	0.25±0.04	1
ECFP-16	0.25±0.06	0.54±0.03	-0.08±0.04	0.08±0.03	0.19±0.04	1
ECFP-16 counts	0.27±0.06	0.51±0.03	-0.09±0.04	0.03±0.03	0.18±0.04	1
ChemBERTa-2	0.02±0.06	0.14±0.03	0.03±0.04	0.37±0.03	0.14±0.04	1
Prot-T5-XL	0.08±0.06	0.01±0.03	0.01±0.04	0.34±0.03	0.11±0.04	2
ESM2 150M	0.08±0.06	-0.00±0.03	0.02±0.04	0.31±0.03	0.10±0.04	2
ESM2 650M	-0.01±0.06	0.09±0.03	0.11±0.04	0.14±0.03	0.08±0.04	2
ProtBERT	0.02±0.06	0.02±0.03	0.05±0.04	0.24±0.03	0.08±0.04	2
Pepland	0.20±0.06	-0.06±0.03	0.07±0.04	0.12±0.03	0.08±0.04	2
ESM2 8M	0.08±0.06	0.01±0.03	-0.03±0.04	0.21±0.03	0.07±0.04	2
Molformer-XL	0.07±0.06	0.01±0.03	0.06±0.04	0.11±0.03	0.06±0.04	2
PeptideCLM	-0.01±0.06	0.03±0.03	-0.14±0.04	0.26±0.03	0.04±0.04	2

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
Results		Results
analysis		analysis
downstream_data		downstream_data
partitions		partitions
rep_transfer		rep_transfer
.DS_Store		.DS_Store
.gitignore		.gitignore
.whitesource		.whitesource
LICENSE		LICENSE
README.md		README.md
renovate.json		renovate.json
run_all.sh		run_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Peptide Generalization Benchmarks

1. Introduction

2. Datasets

3. Running the benchmarks

3.1 Prepare peptide representations

3.2 Run all the benchmarks

3.3 Statistical analysis of the results

4. Submission and scoring

5. Leaderboards

5.1. Interpolation standard to standard peptides

5.2. Interpolation modified to modified peptides

5.3. Standard to modified extrapolation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

IBM/PeptideGeneralizationBenchmarks

Folders and files

Latest commit

History

Repository files navigation

Peptide Generalization Benchmarks

1. Introduction

2. Datasets

3. Running the benchmarks

3.1 Prepare peptide representations

3.2 Run all the benchmarks

3.3 Statistical analysis of the results

4. Submission and scoring

5. Leaderboards

5.1. Interpolation standard to standard peptides

5.2. Interpolation modified to modified peptides

5.3. Standard to modified extrapolation

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages