2025-pi-disco-pub

Purpose

This repo accompanies the pub "How confident should we be in potential targets of tick protease inhibitors predicted by AlphaFold-Multimer?". We are interested in developing approaches to predict the targets of tick secreted effector proteins, with the goal of learning new strategies for modulating itch, pain, and inflammation in the skin. In this work, we try out using AlphaFold Multimer to predict targets for 10 representatives of a family of tick TIL-domain protease inhibitors (orthogroup OG0000058).

This pilot had three main steps:

Protease inhibitor discovery: Identify protease inhibitor gene families across 15 species of ticks using ProteinCartography and orthogroup data from previous work
Phylogenomic profiling: Identify which gene family is most predictive of the ability of ticks to suppress host-detection mechanisms such as itch, pain, and inflammation using trait mapping
Protein-protein interaction prediction: Predict targets for protease inhibitor family orthogroup OG0000058 using AlphaFold Multimer. We also tried D-SCRIPT, but ultimately didn't move forward with that approach.

Installation and Setup

This repository uses conda to manage software environments and installations. You can find operating system-specific instructions for installing miniconda here. After installing conda and mamba, run the following commands to create the pipeline run environments.

To create the enviroment used for protease inhibitor discovery:

mamba env create -n pi-disco --file envs/pi-disco.yml
conda activate pi-disco

To create the enviroment used for protein-protein interaction prediction:

mamba env create -n tick_ppi --file envs/ppi.yml
conda activate tick_ppi

Our phylogenomic profiling workflow is implemented in an R Markdown notebook (Notebook 04), which sources a trait_mapping_functions.R script located in the scripts folder. To run this analysis, you can install Rstudio by following the instructions here. Once RStudio is installed, run the following from command line to open Rstudio from your activated conda environment:

open -a Rstudio

To execute the code in Notebook 04, please make sure you have installed all the dependencies shown in the sessionInfo() printed at the beginning the notebook.

Data related to protease inhibitor discovery

Datasheets used in Notebooks 01-03 and 05 are found in datasheets
All chelicerate protein annotations are in all_chelicerate_proteins.csv on Zenodo
Tick protein structures are in pi-disco_esmfold_structures.zip
Animal toxin database downloaded from UniProt Animal toxin annotation project
Our full ProteinCartography run is in tick_PUFs_PIs_1200_plus_toxinDB_carto_run3.zip on Zenodo, key results are found in outputs/carto_run
NovelTree orthogroup data comes from previous work here

Data related to phylogenomic profiling

Datasheets of high-confidence and low-confidence protease inhibitor gene families
NovelTree workflow output directory chelicerata-v1-10062023
Results from phylogenomic profiling in detection_suppression_outputs

Data related to Protein-protein interaction prediction

Datasheets of Uniprot accessions and expression metadata found in datasheets
Results from D-script and AF-multimer runs are in protein_protein_interaction_results
Raw results files from AF-Multimer are on Zenodo The .json and .pae files are the most important for reanalyzing the data to calculate AF-Multimer metrics

Overview

Description of the folder structure

datasheets/ contains input metadata and datasheets used in analysis notebooks
envs/ contains two .yml files that specify the conda environments used for protease inhibitor discovery and protein-protein interaction prediction
notebooks/ contains the Jupyter notebooks used to do this analysis
outputs/ contains figures, select results from the ProteinCartography run, results from the phylogenomic profiling pipeline, and results from protein-protein interaction prediction
scripts/ contains: scripts for folding proteins using an Arcadia-hosted ESMFold API, further details in scripts/README_esmfold_scripts.md, scripts for protein-protein interaction prediction using D-SCRIPT and AlphaFold multimer, with further details in scripts/README_ppi_scripts.md, and an R script scripts/trait_mapping_functions.R that accompanies Notebook 04.

Methods

Protease inhibitor discovery:

Activate environment envs/pi-disco.yml.
Get proteins: 01_get_proteins.ipynb
Select protease inhibitors (PIs) and secreted proteins of unknown function (PUFs) <1200 aa, using annotations found in all_chelicerate_proteins.csv. PIs that we have selected are in ./datasheets/tick_PIs_1200.csv and secreted PUFs are in ./datasheets/tick_PUFs_1200.csv.
Fold tick proteins for structural analysis: We use an Arcadia-hosted ESMFold API to predict and download the folded proteins. All the scripts we used are in ./scripts, and detailed usage can be found in scripts/README_esmfold_scripts.md. The structures that we computed can be found in pi-disco_esmfold_structures.zip.
Download structures of Animal Toxins: We downloaded structures from UniProt Animal toxin annotation project using ProteinCartography scripts fetch_accession.py and fetch_uniprot_metadata.py. Our list of successfully downloaded structures can be found here.
Prepare metadata files to run ProteinCartography on tick proteins and other animal toxins: 02_make_metadata.ipynb
This notebook prepares the metadata file for tick proteins and toxin database proteins to go into ProteinCartography.
Use ProteinCartography to cluster tick proteins and toxins: We cloned the ProteinCartography repo, and ran ProteinCartography on the combined tick and toxin database structures using cluster mode. Our config file can be found here and our full results can be found in tick_PUFs_PIs_1200_plus_toxinDB_carto_run3.zip. While the full dataset is too large to host on GitHub, we've moved the key files into an output folder here.
Analyze ProteinCartography results to get protease inhibitor gene families:03_analyze_cartography.ipynb
This notebook takes in the clustered output data from the ProteinCartography run on the tick proteins combined with the toxin database. By looking at semantic data (found in tick_PUFs_PIs_1200_plus_toxinDB_carto_run3.zip) and intra-cluster structural similarity, we find 7 clusters that are high quality protein inhibitors (2.5k proteins) and 4 clusters that are interesting but are much lower confidence (1.8 k proteins). We then combine these protease inhibitor sturctural clusters with NovelTree orthogroup data to select protease inhibitor orthogroups to put through trait mapping. The notebook selects 36 orthogroups that have either high quality or low quality protease inhibitor annotations for trait mapping.

Phylogenomic profiling:

Download NovelTree workflow output directory chelicerata-v1-10062023.
In R-Studio, run Notebook 04. This performs trait mapping on the 36 orthogroups that have either high quality or low quality protease inhibitor annotations, using evolutionary data from the NovelTree analysis. You can see our full session here, and full results can be found here.
To dig further into the results and select proteins for PPI prediction, activate the pi-disco environment. Launch Notebook 05, where we use model variable importance and individual conditional expectation (ICE) to select the gene families that most strongly and postively predict the detection-supression trait. We also filter by secretion signal and by expression level in the female Amblyomma americanum salivary gland to select the most promsing individual proteins to study using protein-protein interaction predicition. We select 10 TIL domain proteins, which can be found here.

Protein-protein interaction prediction:

Activate environment envs/ppi.yml
Collect sets of tick and human proteins. Human proteins and associated metadata are pulled down from Uniprot accessions with the fetch_accession.py and fetch_uniprot_metadata.py from ProteinCartography following the instructions there and providing TXT files of Uniprot accessions. All tick protein sequences used in this study can be found in 2024-06-24-all-chelicerate-noveltree-proteins.fasta. We specifically looked at these 10 proteins here, using Amblyomma-americanum_evm.model.contig-94090-1.4 as a representative, since it had one of the highest-quality structures of the group (pLDDT = 83.8).
Make sequence-based predictions of PPIs with D-script using the script make_dscript_PPI_predictions.py. Analysis of these results and attempted filtering are in scripts/dscript-results-analysis.R.
Prepare FASTA files for running with the Goolge Colab Batch AF-multimer notebook. This can be done either providing the separate bait and prey FASTA sequences that should be combined, or from a TSV of comparisons and a FASTA containing all the sequences
Analyze Alphafold results with the Jupyter notebook code in notebooks/06_calculate_afmultimer_predictions.ipynb and run with the LazyAF Google Colab notebook, but paste in the modified code in this notebook. The modified code adds lines to calculate the average pLDDT and include in the resulting CSV files for each comparison. Change the text in each cell to point to your data and rename the output CSV file. If you don't need the pLDDT information just use the default notebook code.
Save both the results and analysis folders from the AF-multimer predictions and the LazyAF analysis, which selects the highest ranking prediction JSON file and copies it into the analysis folder. Downloading the folder from Google Drive will automatically create a .zip archive and save this in your results folder for backup.
Analyze AF-multimer results compared to D-script predictions with notebooks/07_analyze-dscript-afmultimer-comparisons.ipynb
Code for analysis and plotting is in the notebooks/08_full-filtered-serine-protease-analysis.ipynb
We also calculate additional AFmultimer metrics with the colabfold_analysis.py script which is part of the AF2multimer-analysis toolkit that underlies the predictomes.org webserver. To use the script:
1. Download the raw AF2multimer results from Google drive and unzip them. Google will sometimes split up large downloads into multiple zip files and not all results files that go together are in the same subdirectories. Make sure for this script to work all the .json and .pae files together as well as a .done.txt file with the name of the comparison
2. Clone the directory with git clone https://github.com/walterlab-HMS/AF2multimer-analysis
3. Run the script with python3 colabfold_analysis.py <AF2MULTIMER_RESULTS_DIRECTORY>. This will create a new folder with the same name as your input directory with _analysis appended at the end. This creates three CSV files: a contacts, interfaces, and summary file. The summary.csv is the most useful for downstream steps and is how we calculated the pDockQ score for each interaction prediction. The contacts and interfaces files give more information for each residue the calculated metrics, whereas the summary file is an average across all residues.
Obtain the predicted expression of each human serine protease hit from the NCBI Gene Database. For each gene, we downloaded primary tissue expression data generated by Fagerberg et al. 2014, which analyzed gene expression in 27 different tissues from 95 healthy humans. Code for analysis and plotting is found in notebooks/08_full-filtered-serine-protease-analysis.ipynb

Compute Specifications

Scripts and analysis run on MacOS except for AF-multimer runs which were through Google Colab. Most of the runs on Google Colab were with A100 GPUs, and when not available with either V100 or T4 GPUs.

Contributing

See how we recognize feedback and contributions to our code

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github		.github
datasheets		datasheets
envs		envs
notebooks		notebooks
outputs		outputs
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

2025-pi-disco-pub

Purpose

Installation and Setup

Data related to protease inhibitor discovery

Data related to phylogenomic profiling

Data related to Protein-protein interaction prediction

Overview

Description of the folder structure

Methods

Compute Specifications

Contributing

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

Arcadia-Science/2025-pi-disco-pub

Folders and files

Latest commit

History

Repository files navigation

2025-pi-disco-pub

Purpose

Installation and Setup

Data related to protease inhibitor discovery

Data related to phylogenomic profiling

Data related to Protein-protein interaction prediction

Overview

Description of the folder structure

Methods

Compute Specifications

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages