Note
For further insights into the nnet-enzyme framework, refer to the corresponding paper accessible here.
Malte A. Weyrich, T. Truc Bui, Jan P. Hummel, Sofiia Kozoriz
Technical University of Munich | Technische Universität München
Contact via: [email protected]
Follow these steps to set up and configure the environment for your specific needs:
-
Environment Setup:
-
Clone this repository
git clone https://github.com/github4touchdouble/nnet-enzyme.git
-
Install the necessary dependencies and libraries. Ensure compatibility with your system specifications
pip install -r './requirements.txt'
-
Create an .env file in the root of the project
nano .env
and add the following lines and adjust for your individual needs:# -------------------------------- # Non-enzymatic protein data # -------------------------------- # <WIP> FASTA_NON_ENZYMES='PATH/TO/NON_ENZYME/FASTA' FASTA_ENZYMES='PATH/TO/ENZYME/FASTA' PROTT5_NON_ENZYMES='PATH/TO/NON_ENZYME/PROTT5' -- optional: absed on your needs ESM2_NON_ENZYMES='PATH/TO/NON_ENZYME/ESM2' # -- optional: absed on your needs OHE_NON_ENZYMES='PATH/TO/NON_ENZYME/OHE' # >> i.e. provide one-hot-encoded protein sequences # ---------------------------- # Enzymatic protein data # ---------------------------- # Enzyme, enzyme commission number, amino acid sequence # CSV file: <Identifier>,<EC>,<Sequence> ~> C7C422,3.5.2.6,MEL...KLR # I.a. if you intend to train models using datasets with varying levels of redundancy reduction, replace "X" with the required percentage of similarity for two sequences to be deemed duplicates # Customize this as needed for your specific requirements. Refer to the "Run configuration" section for ESSENTIAL considerations before intiating a project CSVX_ENZYMES='PATH/TO/ENZYME/SPLITX' # Enzyme, protein embedding vector # H5 file: <Identifier>,<Embedding> ~> A0A024RBG1,[-0.015143169, 0.035552002, -0.02231326, ...] # I.a. if you intend to train models using datasets with varying levels of redundancy reduction, replace "X" with the required percentage of similarity for two sequences to be deemed duplicates # I.a. nnet-enzyme offers support for ESM2, PROTT5, and One-hot encoded vectors # Customize this as needed for your specific requirements. Refer to the "Run configuration" section for ESSENTIAL considerations before intiating a project ESM2_ENZYMES_SPLIT_X='PATH/TO/ENZYME/ESM2/SPLIT_X' # i.a. PROTT5_ENZYMES_SPLIT_X='PATH/TO/ENZYME/PROTT5/SPLIT_X' # i.a. OHE_ENZYMES_SPLIT_X='PATH/TO/ENZYME/OHE' # i.a.
-
-
Run configuration:
Important
Without following the run config instructions as denoted below, you won't be able to run nnet-enzyme.
WIP
- Classification Pipelines:
- Execute the provided Jupyter notebooks to follow the classification pipeline as detailed in the accompanying paper.
- Should you require the integration of nnet-enzyme into a custom pipeline, adapt the provided code to align with the requirements of your framework. Merge relevant components seamlessly to ensure smooth functionality within your project.
WIP
import os
from dotenv import load_dotenv
load_dotenv() # load environment variables, should return True
abs_path_to_split30 = os.getenv("CSV30_ENZYMES")
abs_path_to_non_enzyme_fasta = os.getenv("FASTA_NON_ENZYMES")
[...]
Make sure to add the .env
file to the .gitignore
file so that the environment variables are not pushed to the repository.
In .gitignore
add the following line:
.env
First import the H5Dataset class:
form data_manipulation import load_ml_data
Then use this method to load the data:
enzyme_csv = os.getenv("CSVX_ENZYMES") # replace X with the number of the split you want to use
enzyme_esm2 = os.getenv("ESM2_ENZYMES_SPLIT_X") # replace X with the number of the split you want to use
X_enzymes, y_enzymes = load_ml_data(path_to_esm2=enzyme_esm2, path_tp_enzyme_csv=enzyme_csv)
Since we don't have a .csv
for our non enzymes, we need to use the load_non_enzyme_esm2
method instead:
path_to_non_ez_fasta = os.getenv("FASTA_NON_ENZYMES")
path_to_non_ez_esm2 = os.getenv("ESM2_NON_ENZYMES")
X_non_enzymes, y_non_enzymes = load_non_enzyme_esm2(non_enzymes_fasta_path = path_to_non_ez_fasta, non_enzymes_esm2_path=path_to_non_ez_esm2)
# Combine data
X = np.vstack((X_enzymes, X_non_enzymes))
y = np.hstack((y_enzymes, y_non_enzymes))