Skip to content

Latest commit

 

History

History
175 lines (128 loc) · 6.28 KB

README.md

File metadata and controls

175 lines (128 loc) · 6.28 KB

ProtEx

This repository contains open source code related to the paper ProtEx: A Retrieval-Augmented Approach for Protein Function Prediction.

Installation

Clone the repository:

git clone https://github.com/google-deepmind/protex.git

It is then recommended to setup a virtual environment. We provide an example using conda:

conda create -n protex python=3.10
conda activate protex

Then install dependencies specified in setup.py:

pip install .

Overview

The code, along with the released model predictions, support reproducing the main results from the paper. The code is organized as follows:

  • blast/ - Contains conversion scripts for reproducing BLAST results.
  • common/ - Some common utility libraries.
  • data/ - Contains conversion scripts for various datasets to a common format.
  • eval/ - Contains tools for computing various evaluation metrics.

We convert datasets to a common format consisting of newline separated json files, where each has the following keys:

  • sequence - String of protein sequence.
  • accession - String for unique identifier, e.g. UniProt accession.
  • labels - List of strings for labels, e.g. EC numbers.

Usage Examples

ProteInfer

Here we provide a usage example focused on reproducing the results for the ProteInfer dataset for the clustered EC split. Conversion and evaluation scripts for other datasets can be found in data/ and /eval, and usages are similar.

The original dataset is available on GCP at gs:///brain-genomics-public/research/proteins/proteinfer/datasets/swissprot/. We can set our input to the path to the EC clustered test split:

CLUSTERED_EC_TEST_TFR="gs://brain-genomics-public/research/proteins/proteinfer/datasets/swissprot/clustered/test.tfrecord"

We will assume that the variable DATA_DIR is set to readable and writable directory, such as DATA_DIR=/tmp/.

We can then run the data conversion script:

CLUSTERED_EC_TEST_JSONL="${DATA_DIR}/proteinfer_clustered_ec_test.jsonl"
python -m data.convert_proteinfer \
--alsologtostderr \
--input=${CLUSTERED_EC_TEST_TFR} \
--output=${CLUSTERED_EC_TEST_JSONL} \
--labels=ec

Model predictions for ProtEx on all test splits are available at gs://protex/predictions. Specifically, the clustered EC predictions are here:

PREDS_PROTEX=gs://protex/predictions/proteinfer-clustered-ec-test-protex.jsonl

We can then reproduce the max micro-averaged F1 metrics reported for this split with the following script:

python -m eval.eval_micro_f1 \
--alsologtostderr \
--dataset=${CLUSTERED_EC_TEST_JSONL} \
--predictions=${PREDS_PROTEX}

We also released BLAST predictions, so the above script can also be used with the following --predictions argument to reproduce the reported BLAST results:

PREDS_BLAST=gs://protex/predictions/proteinfer-clustered-ec-test-protex.jsonl

Reproducing BLAST

We also released code to reproduce the BLAST predictions. For this we need to also convert the ProteInfer training set:

CLUSTERED_EC_TRAIN_TFR="gs://brain-genomics-public/research/proteins/proteinfer/datasets/swissprot/clustered/train.tfrecord"
CLUSTERED_EC_TRAIN_JSONL="${DATA_DIR}/proteinfer_clustered_ec_train.jsonl
python -m data.convert_proteinfer \
--alsologtostderr \
--input=${CLUSTERED_EC_TRAIN_TFR} \
--output=${CLUSTERED_EC_TRAIN_JSONL} \
--labels=ec

We then need to convert both train and test splits to .fasta format:

CLUSTERED_EC_TRAIN_FASTA="${DATA_DIR}/proteinfer_clustered_ec_train.fasta
python -m blast.convert_to_fasta \
--alsologtostderr \
--input=${CLUSTERED_EC_TRAIN_JSONL} \
--output=${CLUSTERED_EC_TRAIN_FASTA}

CLUSTERED_EC_TEST_FASTA="${DATA_DIR}/proteinfer_clustered_ec_test.fasta
python -m blast.convert_to_fasta \
--alsologtostderr \
--input=${CLUSTERED_EC_TEST_JSONL} \
--output=${CLUSTERED_EC_TEST_FASTA}

Note that if DATA_DIR refers to a GCP bucket rather than a local directory, the files may need to be copied locally so that they can be read by the BLAST command line tool before proceeding to the next step. We will assume BLAST_DIR is set to the location of the BLAST binaries, e.g. BLAST_DIR=".../ncbi-blast-2.14.1+/bin".

We can then run BLAST.

BLAST_TSV="${DATA_DIR}/blast_proteinfer_clustered_ec_test.tsv"
${BLAST_DIR}/makeblastdb -in ${CLUSTERED_EC_TRAIN_FASTA} -dbtype prot
${BLAST_DIR}/blastp -query ${CLUSTERED_EC_TEST_FASTA} -db ${CLUSTERED_EC_TRAIN_FASTA} -outfmt 6 -max_hsps 1 -num_threads 16 -max_target_seqs 1 -out ${BLAST_TSV}

Finally, we can convert the tsv file generated by BLAST to the standard predictions format we are using:

BLAST_JSONL=${DATA_DIR}/blast_proteinfer_clustered_ec_test.jsonl
python -m blast.convert_blast \
--alsologtostderr \
--input=${BLAST_TSV} \
--database_records=${CLUSTERED_EC_TRAIN_FASTA} \
--output=${BLAST_JSONL}

Citing this work

You can cite the preprint of our work as follows:

@article{shaw2024protex,
  title={ProtEx: A Retrieval-Augmented Approach for Protein Function Prediction},
  author={Shaw, Peter and Gurram, Bhaskar and Belanger, David and Gane, Andreea and Bileschi, Maxwell L and Colwell, Lucy J and Toutanova, Kristina and Parikh, Ankur P},
  journal={bioRxiv},
  URL = {https://www.biorxiv.org/content/early/2024/06/02/2024.05.30.596539},
  year={2024},
}

License and disclaimer

Copyright 2024 DeepMind Technologies Limited

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.