Final update to README.md

AC-BO-Hackathon · Mar 29, 2024 · d814336 · d814336
1 parent 2c12f89
commit d814336
Showing 1 changed file with 24 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1,6 +1,29 @@
 **Authors**: Ben Weiser, Jérôme Genzling, Nicolas Gastellu, and Sylvester Zhang
 
-We want to optimize a protein to achieve better binding affinity to a certain substrate. This can have applications to biologics such as antibody therapeutics, enzymology, and in the case of this project, biosensors. We aimed to use Bayesian optimization to mutate the residues of a protein to achieve better binding affinity to fentanyl. This can be used for drug testing to detect fentanyl at higher sensitivities. Our methodology, was to use a pre-train a BERT language model trained to predict binding of ligands to a given protein from a sequence publish by Andrew E Blanchard. We give the amino acid sequence of the protein and the smiles string of fentanyl to this model and it outputs score related to binding affinity. We then use Bayesian optimization to query position and amino acid to mutate to and subsequently predict the next best position and amino acid to try next. We want to start with a protein that already has affinity to fentanyl so we took a previously developed protein published by Lisa M. Eubanks. We analyzed the protein and selected the residues with 5 angstroms of the ligand to be selected for possible sites of mutagenesis. Then take a mutation if we find it increased the binding above a certain threshold, and do this until we find 3 mutations chosen by BO which our language model predicts to have high binding affinity. We compared the methodology to a baseline of using random mutations. And we look at using other acquisition functions. We then took our results and analyzed them using pymol’s mutagenesis tool which showed how the suggested changes could lead to new interactions being formed, and how some lead to clashes potentially changing the conformation of the protein. Finally, we investigated the changes of these mutations on the structure using alpha fold.
+
+# Intro
+
+We built a tool which optimizes a protein to achieve better binding affinity to a certain substrate. This can have applications to biologics such as antibody therapeutics, enzymology, and in the case of this project, biosensors. We aimed to use Bayesian optimization to mutate the residues of a protein to achieve better binding affinity to fentanyl. This can be used for drug testing to detect fentanyl at higher sensitivities. 
+
+
+# Protein-ligand system of interest
+
+As a proof of concept, we focused on a polypeptide which was computationally designed to have a high binding affinity to the opiate fentanyl (see References section). Our code improves upon this peptide using the method described below.
+
+
+# Methods
+
+Our methodology was to use a pre-train a BERT language model trained to predict binding of ligands to a given protein from a sequence publish by Andrew E Blanchard. We give the amino acid sequence of the protein and the SMILES string of fentanyl to this model and it outputs score related to binding affinity. We then use Bayesian optimization to query position and amino acid to mutate to and subsequently predict the next best position and amino acid to try next. We want to start with a protein that already has affinity to fentanyl so we took a previously developed protein published by Lisa M. Eubanks. We analyzed the protein and selected the residues with 5 angstroms of the ligand to be selected for possible sites of mutagenesis. Then take a mutation if we find it increased the binding above a certain threshold, and do this until we find 3 mutations chosen by BO which our language model predicts to have high binding affinity. We compared the methodology to two baselines: (i) random mutations and (ii) a rudimentary genetic algorithm.
+
+# What you'll find in this repository
+
+* `final_notebook.ipynb`: This Jupyter notebook contains most of code we developped for this project. It also includes our tests on our protein-ligand system of interest and the comparison of our results with the aforementioned benchmarks. It is our pride and joy. We ran it on Google Colab, so you should be able to copy it and run it yourself, just make sure to uncomment the code in the first cell which installs all of the dependencies.
+* `get_close_resids.py`: A helper module which we wrote to identify the protein residues which are within some distance `rcut` (which we set to 5Å) to the ligand.
+* `5tzo.pdb`: A PDB file containing the protein-ligand structure of our system of interest. (Obtained from [here](https://www.rcsb.org/structure/5TZO)).
+* `requirements.txt`: The list of our code's dependencies. It is not necessary if you plan on running `final_notebook.ipynb` directly on Google Colab.
+
+
+# References
 
 * Starting protein found [here](https://www.rcsb.org/structure/5TZO). ([Relevant paper](https://doi.org/10.7554/eLife.28909))