nnet-enzyme: A Framework designed for the Classification of Enzymatic Protein Sequences

Note

For further insights into the nnet-enzyme framework, refer to the corresponding paper accessible here.

Malte A. Weyrich, T. Truc Bui, Jan P. Hummel, Sofiia Kozoriz

Technical University of Munich | Technische Universität München

Contact via: [email protected]

Getting started

Follow these steps to set up and configure the environment for your specific needs:

Environment Setup:

Clone this repository git clone https://github.com/github4touchdouble/nnet-enzyme.git
Install the necessary dependencies and libraries. Ensure compatibility with your system specifications pip install -r './requirements.txt'

Create an .env file in the root of the project nano .env and add the following lines and adjust for your individual needs:

# --------------------------------
#    Non-enzymatic protein data 
# --------------------------------

# <WIP>
FASTA_NON_ENZYMES='PATH/TO/NON_ENZYME/FASTA'
FASTA_ENZYMES='PATH/TO/ENZYME/FASTA'
PROTT5_NON_ENZYMES='PATH/TO/NON_ENZYME/PROTT5' -- optional: absed on your needs
ESM2_NON_ENZYMES='PATH/TO/NON_ENZYME/ESM2' # -- optional: absed on your needs
OHE_NON_ENZYMES='PATH/TO/NON_ENZYME/OHE' # >> i.e. provide one-hot-encoded protein sequences

# ----------------------------
#    Enzymatic protein data 
# ----------------------------

# Enzyme, enzyme commission number, amino acid sequence
# CSV file: <Identifier>,<EC>,<Sequence> ~> C7C422,3.5.2.6,MEL...KLR
# I.a. if you intend to train models using datasets with varying levels of redundancy reduction, replace "X" with the required percentage of similarity for two sequences to be deemed duplicates
# Customize this as needed for your specific requirements. Refer to the "Run configuration" section for ESSENTIAL considerations before intiating a project
CSVX_ENZYMES='PATH/TO/ENZYME/SPLITX'

# Enzyme, protein embedding vector
# H5 file: <Identifier>,<Embedding> ~> A0A024RBG1,[-0.015143169, 0.035552002, -0.02231326, ...]
# I.a. if you intend to train models using datasets with varying levels of redundancy reduction, replace "X" with the required percentage of similarity for two sequences to be deemed duplicates
# I.a. nnet-enzyme offers support for ESM2, PROTT5, and One-hot encoded vectors
# Customize this as needed for your specific requirements. Refer to the "Run configuration" section for ESSENTIAL considerations before intiating a project
ESM2_ENZYMES_SPLIT_X='PATH/TO/ENZYME/ESM2/SPLIT_X' # i.a.
PROTT5_ENZYMES_SPLIT_X='PATH/TO/ENZYME/PROTT5/SPLIT_X' # i.a.     
OHE_ENZYMES_SPLIT_X='PATH/TO/ENZYME/OHE' # i.a.

Run configuration:

Important

Without following the run config instructions as denoted below, you won't be able to run nnet-enzyme.

WIP

Classification Pipelines:
- Execute the provided Jupyter notebooks to follow the classification pipeline as detailed in the accompanying paper.
- Should you require the integration of nnet-enzyme into a custom pipeline, adapt the provided code to align with the requirements of your framework. Merge relevant components seamlessly to ensure smooth functionality within your project.

Example of use

WIP

Supplementary information: Unstructured tips & tricks

Using the environment variables in the code

import os
from dotenv import load_dotenv

load_dotenv() # load environment variables, should return True

abs_path_to_split30 = os.getenv("CSV30_ENZYMES")
abs_path_to_non_enzyme_fasta = os.getenv("FASTA_NON_ENZYMES")

[...]

Adding .env to .gitignore

Make sure to add the .env file to the .gitignore file so that the environment variables are not pushed to the repository.

In .gitignore add the following line:

.env

Reading embeddings

First import the H5Dataset class:

form data_manipulation import load_ml_data

Then use this method to load the data:

Loading enzyme esm2 embeddings

enzyme_csv = os.getenv("CSVX_ENZYMES") # replace X with the number of the split you want to use
enzyme_esm2 = os.getenv("ESM2_ENZYMES_SPLIT_X") # replace X with the number of the split you want to use

X_enzymes, y_enzymes = load_ml_data(path_to_esm2=enzyme_esm2, path_tp_enzyme_csv=enzyme_csv)

Loading non enzyme esm2 embeddings

Since we don't have a .csv for our non enzymes, we need to use the load_non_enzyme_esm2 method instead:

path_to_non_ez_fasta = os.getenv("FASTA_NON_ENZYMES")
path_to_non_ez_esm2 = os.getenv("ESM2_NON_ENZYMES")
X_non_enzymes, y_non_enzymes  = load_non_enzyme_esm2(non_enzymes_fasta_path = path_to_non_ez_fasta, non_enzymes_esm2_path=path_to_non_ez_esm2)

Now we can merge the two datasets:**

# Combine data
X = np.vstack((X_enzymes, X_non_enzymes))
y = np.hstack((y_enzymes, y_non_enzymes))

Name		Name	Last commit message	Last commit date
Latest commit History 279 Commits
.idea		.idea
CNN_prott5		CNN_prott5
KNN		KNN
KidaNN		KidaNN
data_manipulation		data_manipulation
knn_gzip		knn_gzip
metadata		metadata
metrics_helper_funcs		metrics_helper_funcs
onehot_encoding		onehot_encoding
random_classifiers		random_classifiers
random_forest		random_forest
rnn_prott5		rnn_prott5
svm_esm2		svm_esm2
svm_ohe		svm_ohe
tf_cnn_esm2		tf_cnn_esm2
LICENSE		LICENSE
README.md		README.md
basic7ClassFNN.py		basic7ClassFNN.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

nnet-enzyme: A Framework designed for the Classification of Enzymatic Protein Sequences

Getting started

Example of use

Supplementary information: Unstructured tips & tricks

Using the environment variables in the code

Adding .env to .gitignore

Reading embeddings

Loading enzyme esm2 embeddings

Loading non enzyme esm2 embeddings

Now we can merge the two datasets:**

About

Uh oh!

Uh oh!

Contributors 5

Uh oh!

Languages

License

github4touchdouble/nnet-enzyme

Folders and files

Latest commit

History

Repository files navigation

nnet-enzyme: A Framework designed for the Classification of Enzymatic Protein Sequences

Getting started

Example of use

Supplementary information: Unstructured tips & tricks

Using the environment variables in the code

Adding .env to .gitignore

Reading embeddings

Loading enzyme esm2 embeddings

Loading non enzyme esm2 embeddings

Now we can merge the two datasets:**

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 5

Uh oh!

Languages