This repo accompanies the pub "How confident should we be in potential targets of tick protease inhibitors predicted by AlphaFold-Multimer?". We are interested in developing approaches to predict the targets of tick secreted effector proteins, with the goal of learning new strategies for modulating itch, pain, and inflammation in the skin. In this work, we try out using AlphaFold Multimer to predict targets for 10 representatives of a family of tick TIL-domain protease inhibitors (orthogroup OG0000058).
This pilot had three main steps:
- Protease inhibitor discovery: Identify protease inhibitor gene families across 15 species of ticks using ProteinCartography and orthogroup data from previous work
- Phylogenomic profiling: Identify which gene family is most predictive of the ability of ticks to suppress host-detection mechanisms such as itch, pain, and inflammation using trait mapping
- Protein-protein interaction prediction: Predict targets for protease inhibitor family orthogroup OG0000058 using AlphaFold Multimer. We also tried D-SCRIPT, but ultimately didn't move forward with that approach.
This repository uses conda to manage software environments and installations. You can find operating system-specific instructions for installing miniconda here. After installing conda and mamba, run the following commands to create the pipeline run environments.
To create the enviroment used for protease inhibitor discovery:
mamba env create -n pi-disco --file envs/pi-disco.yml
conda activate pi-disco
To create the enviroment used for protein-protein interaction prediction:
mamba env create -n tick_ppi --file envs/ppi.yml
conda activate tick_ppi
Our phylogenomic profiling workflow is implemented in an R Markdown notebook (Notebook 04), which sources a trait_mapping_functions.R
script located in the scripts
folder. To run this analysis, you can install Rstudio by following the instructions here. Once RStudio is installed, run the following from command line to open Rstudio from your activated conda environment:
open -a Rstudio
To execute the code in Notebook 04, please make sure you have installed all the dependencies shown in the sessionInfo()
printed at the beginning the notebook.
- Datasheets used in Notebooks 01-03 and 05 are found in
datasheets
- All chelicerate protein annotations are in
all_chelicerate_proteins.csv
on Zenodo - Tick protein structures are in
pi-disco_esmfold_structures.zip
- Animal toxin database downloaded from UniProt Animal toxin annotation project
- Our full ProteinCartography run is in
tick_PUFs_PIs_1200_plus_toxinDB_carto_run3.zip
on Zenodo, key results are found inoutputs/carto_run
- NovelTree orthogroup data comes from previous work here
- Datasheets of high-confidence and low-confidence protease inhibitor gene families
- NovelTree workflow output directory
chelicerata-v1-10062023
- Results from phylogenomic profiling in
detection_suppression_outputs
- Datasheets of Uniprot accessions and expression metadata found in
datasheets
- Results from D-script and AF-multimer runs are in
protein_protein_interaction_results
- Raw results files from AF-Multimer are on Zenodo The .json and .pae files are the most important for reanalyzing the data to calculate AF-Multimer metrics
datasheets/
contains input metadata and datasheets used in analysis notebooksenvs/
contains two.yml
files that specify the conda environments used for protease inhibitor discovery and protein-protein interaction predictionnotebooks/
contains the Jupyter notebooks used to do this analysisoutputs/
contains figures, select results from the ProteinCartography run, results from the phylogenomic profiling pipeline, and results from protein-protein interaction predictionscripts/
contains: scripts for folding proteins using an Arcadia-hosted ESMFold API, further details inscripts/README_esmfold_scripts.md
, scripts for protein-protein interaction prediction using D-SCRIPT and AlphaFold multimer, with further details inscripts/README_ppi_scripts.md
, and an R scriptscripts/trait_mapping_functions.R
that accompanies Notebook 04.
Protease inhibitor discovery:
- Activate environment
envs/pi-disco.yml
. - Get proteins:
01_get_proteins.ipynb
Select protease inhibitors (PIs) and secreted proteins of unknown function (PUFs) <1200 aa, using annotations found inall_chelicerate_proteins.csv
. PIs that we have selected are in./datasheets/tick_PIs_1200.csv
and secreted PUFs are in./datasheets/tick_PUFs_1200.csv
. - Fold tick proteins for structural analysis:
We use an Arcadia-hosted ESMFold API to predict and download the folded proteins. All the scripts we used are in
./scripts
, and detailed usage can be found inscripts/README_esmfold_scripts.md
. The structures that we computed can be found inpi-disco_esmfold_structures.zip
. - Download structures of Animal Toxins:
We downloaded structures from UniProt Animal toxin annotation project using ProteinCartography scripts
fetch_accession.py
andfetch_uniprot_metadata.py
. Our list of successfully downloaded structures can be found here. - Prepare metadata files to run ProteinCartography on tick proteins and other animal toxins:
02_make_metadata.ipynb
This notebook prepares the metadata file for tick proteins and toxin database proteins to go into ProteinCartography. - Use ProteinCartography to cluster tick proteins and toxins:
We cloned the ProteinCartography repo, and ran ProteinCartography on the combined tick and toxin database structures using
cluster
mode. Our config file can be found here and our full results can be found intick_PUFs_PIs_1200_plus_toxinDB_carto_run3.zip
. While the full dataset is too large to host on GitHub, we've moved the key files into an output folder here. - Analyze ProteinCartography results to get protease inhibitor gene families:03_analyze_cartography.ipynb
This notebook takes in the clustered output data from the ProteinCartography run on the tick proteins combined with the toxin database. By looking at semantic data (found intick_PUFs_PIs_1200_plus_toxinDB_carto_run3.zip
) and intra-cluster structural similarity, we find 7 clusters that are high quality protein inhibitors (2.5k proteins) and 4 clusters that are interesting but are much lower confidence (1.8 k proteins). We then combine these protease inhibitor sturctural clusters with NovelTree orthogroup data to select protease inhibitor orthogroups to put through trait mapping. The notebook selects 36 orthogroups that have either high quality or low quality protease inhibitor annotations for trait mapping.
Phylogenomic profiling:
- Download NovelTree workflow output directory
chelicerata-v1-10062023
. - In R-Studio, run Notebook 04. This performs trait mapping on the 36 orthogroups that have either high quality or low quality protease inhibitor annotations, using evolutionary data from the NovelTree analysis. You can see our full session here, and full results can be found here.
- To dig further into the results and select proteins for PPI prediction, activate the
pi-disco
environment. Launch Notebook 05, where we use model variable importance and individual conditional expectation (ICE) to select the gene families that most strongly and postively predict the detection-supression trait. We also filter by secretion signal and by expression level in the female Amblyomma americanum salivary gland to select the most promsing individual proteins to study using protein-protein interaction predicition. We select 10 TIL domain proteins, which can be found here.
Protein-protein interaction prediction:
- Activate environment
envs/ppi.yml
- Collect sets of tick and human proteins. Human proteins and associated metadata are pulled down from Uniprot accessions with the
fetch_accession.py
andfetch_uniprot_metadata.py
from ProteinCartography following the instructions there and providing TXT files of Uniprot accessions. All tick protein sequences used in this study can be found in2024-06-24-all-chelicerate-noveltree-proteins.fasta
. We specifically looked at these 10 proteins here, using Amblyomma-americanum_evm.model.contig-94090-1.4 as a representative, since it had one of the highest-quality structures of the group (pLDDT = 83.8). - Make sequence-based predictions of PPIs with D-script using the script
make_dscript_PPI_predictions.py
. Analysis of these results and attempted filtering are inscripts/dscript-results-analysis.R
. - Prepare FASTA files for running with the Goolge Colab Batch AF-multimer notebook. This can be done either providing the separate bait and prey FASTA sequences that should be combined, or from a TSV of comparisons and a FASTA containing all the sequences
- Analyze Alphafold results with the Jupyter notebook code in
notebooks/06_calculate_afmultimer_predictions.ipynb
and run with the LazyAF Google Colab notebook, but paste in the modified code in this notebook. The modified code adds lines to calculate the average pLDDT and include in the resulting CSV files for each comparison. Change the text in each cell to point to your data and rename the output CSV file. If you don't need the pLDDT information just use the default notebook code. - Save both the results and analysis folders from the AF-multimer predictions and the LazyAF analysis, which selects the highest ranking prediction JSON file and copies it into the analysis folder. Downloading the folder from Google Drive will automatically create a .zip archive and save this in your results folder for backup.
- Analyze AF-multimer results compared to D-script predictions with
notebooks/07_analyze-dscript-afmultimer-comparisons.ipynb
- Code for analysis and plotting is in the
notebooks/08_full-filtered-serine-protease-analysis.ipynb
- We also calculate additional AFmultimer metrics with the
colabfold_analysis.py
script which is part of the AF2multimer-analysis toolkit that underlies the predictomes.org webserver. To use the script:- Download the raw AF2multimer results from Google drive and unzip them. Google will sometimes split up large downloads into multiple zip files and not all results files that go together are in the same subdirectories. Make sure for this script to work all the
.json
and.pae
files together as well as a.done.txt
file with the name of the comparison - Clone the directory with
git clone https://github.com/walterlab-HMS/AF2multimer-analysis
- Run the script with
python3 colabfold_analysis.py <AF2MULTIMER_RESULTS_DIRECTORY>
. This will create a new folder with the same name as your input directory with_analysis
appended at the end. This creates three CSV files: a contacts, interfaces, and summary file. Thesummary.csv
is the most useful for downstream steps and is how we calculated the pDockQ score for each interaction prediction. The contacts and interfaces files give more information for each residue the calculated metrics, whereas the summary file is an average across all residues.
- Download the raw AF2multimer results from Google drive and unzip them. Google will sometimes split up large downloads into multiple zip files and not all results files that go together are in the same subdirectories. Make sure for this script to work all the
- Obtain the predicted expression of each human serine protease hit from the NCBI Gene Database. For each gene, we downloaded primary tissue expression data generated by Fagerberg et al. 2014, which analyzed gene expression in 27 different tissues from 95 healthy humans. Code for analysis and plotting is found in
notebooks/08_full-filtered-serine-protease-analysis.ipynb
Scripts and analysis run on MacOS except for AF-multimer runs which were through Google Colab. Most of the runs on Google Colab were with A100 GPUs, and when not available with either V100 or T4 GPUs.
See how we recognize feedback and contributions to our code