This repository stores the codes and results in BioHackathon-2021, aiming to predict the stability of multiple mutated proteins.
The descriptions of this Hackathon can be found here.
The data comes from this article https://www.bakerlab.org/wp-content/uploads/2017/12/Science_Rocklin_etal_2017.pdf
This our team page. Team members: Jiajun He, Zelin Li.
An outline of our work and results can be found here.
A more detailed description is shown below.
- The main task for us is to use the amino acids sequence of mini-protein (43 a.a. length) and their secondary structure information to predict their structural stability.
- The inputs are a.a. sequence (20 kinds of standard a.a., and non-standard, in total 21 kinds) and secondary structure (E, T, H; 3 kinds) sequence.
- This is a regression task and will output the stability score change of the mutated mini-protein (a score proportion to ΔG).
We perform 2 kinds of models: Simple Machine Learning Methods and Complex Deep Learning Models with transformer and LSTMs.
One-hot encoding for Amino acids and for secondary structures. MLP, RF and SVM to perform the regression.
Here is the notebook for there models.
We first got latent embedding for each amino acids by transformer(ESM-1b pretrained Model). Then we built RNN with LSTMs to get the prediction.
The overall structure is:
We also used Transformer to predict contact map, and combined it in using attention mechanisms. But we found that there is no very significant improvement(only 0.001 higher than Model 2; more details can be found in the "NoteBook for testing"), so just to keep the model simple, we do not use the contact map for our final results.
Here are the notebooks for these models:
Model 1: Transformer+LSTM using SS(Our Final Model for Testing)
Model 2: Transformer+LSTM without SS
Model 3: Transformer+LSTM+Contact Map
* Some weights are retrained, so there are some slight differences from the results below. But the overall trends and conclusions are the same.
Model | Correlation Coefficient for Single Mutation | Correlation Coefficient for Multiple Mutation |
---|---|---|
MLP | 0.8451 | 0.3177 |
RF | 0.8136 | 0.3827 |
SVM | 0.8350 | 0.4089 |
Transformer Embedding + LSTMs | 0.8912 | 0.5940 |
Plots on Test data:
Single Mutation: Multiple Mutations:
Besides, we explored the necessity of Secondary Structure, and actually found that it is unnecessary for our task.
We bulid 2 models, one with SS, one without SS, results are as follows:
-
Better feature engineering yields better results. Transformer embedding is better than simple one-hot ecoding in our task.
-
Multiple mutation data is harder to predict than single mutation data, especially those protein with a negative score.
-
Secondary structure is almost redundant for our task.
- Possible Reasons:
- We have only 4 original sequences. So our task can be seen as 4 individual regressions, and the secondary structure only serves as a category label.
- If the original energies are thought to be similar, then all the information is stored in the mutated amino acid sequence.
- Possible Reasons:
-
Fine tuning for each dataset respectively.
-
Better feature engineering, e.g., considering the chemical properties of amino acids.
-
Better architecture, e.g., transfer Learning by Transformer, etc.
-
Use more proteins to collect mutation data. (better from different organisms and environments)
Rocklin, G. J.. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).
Rives, A.. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. (2019). doi:10.1101/622803
Rao, R. M., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A.. Transformer protein language models are unsupervised structure learners. (2020). doi:10.1101/2020.12.15.422761