Workflow employed for fragSMILES analysis and evaluation for de novo drug design provided by relative work
Here, notebook files are explained about their aim sorted by steps of workflow (enumeration order).
The entire workflow is based on Python language. Neural network training was made possible by Pytorch package. However, requirements are provided in requirements.txt. Essential Python repositories include chemicalgof for molecular decomposition of each dataset and MOSES for RNN model architecture and metric evaluations. Ensure these repositories are installed before cloning this repository:
git clone https://github.com/f48r1/fragsmiles.git
After cloning, navigate to the folder and install the required packages for the pipeline (pip is suggested):
cd fragsmiles/
pip install -r requirements.txt
The experiments/scripts folder has been adapted from the MOSES benchmarking framework to accommodate various chemical notations used in our study for training, sampling and evaluation phases. Given that fragSMILES is a chemical-word-level notation, modifications were made to the char-RNN model architecture originally provided by MOSES, resulting in a tailored word-RNN model architecture, found in (experiments/scripts/word_rnn).
RNN models were trained for each notation (SMILES, SELFIES, fragSMILES) and each hyperparameter setting. In the experiments/ folder, you will find the train_single_model.sh
, sample_single_model.sh
, and eval_single_model.sh
bash script files. These can be used to initiate the training, sampling, or evaluation phases for a single model. The files provide key arguments to specify the chemical notation, hyperparameters, and other parameters. The script files must be run from the same directory (experiments/). Additionally, if you are using a conda or pure Python environment, it needs to be activated within the script.
Ex: to train an RNN model on the Grisoni dataset, using the first fold, represented in fragSMILES notation with 2 hidden layers, 512 hidden units, a batch size of 512, learning rate of 0.001, and an embedding size of 300. The device can be set as either CPU or GPU:
cd experiments/
bash train_single_model.sh --dataset grisoni --fold 0 --notation fragsmiles --hl 2 --hu 512 --bs 512 --lr 0.001 --es 300 --device cuda:0 --epochs 16
Saved files concerning the weights of the models are not included in this repository. However, training log files, generated set of molecules and their evaluation metrics are stored in the respective settings within the experiments/ folder.
This notebook highlights the differences between the notations SMILES, SELFIES, and fragSMILES for the molecules depicted in Figure 1 and Figure 2. Specifically, it examines three known drugs composed of indole fragments and molecules with chiral stereocenters. The images have been saved in the figures/ path.
Decomposition of Zinc250K dataset were made starting from zinc.csv file stored in data folder. Original file was provided by Cheng et al work (Group SELFIES).
Dataset was cured employing functions imported from processer.py file. Then, each molecule was decomposed by standard cleavage rule of GoF framework (for fragSMILES notation) and by rotatable bonds rule. Resulting decomposed molecules as textual representation are stored in results/01_zincToks.csv and results/01_zincRotatableToks.csv.
Decomposed molecules of Zinc250K represented as fragSMILES were compared with Group SELFIES, SELFIES and SMILES by number of tokens (encoding lenght). Benchmarking file (stored in data/compactnessBenchmarking.csv and provided by Cheng et al work) were employed for the purpose and plotted in figure saved in figures/encodingCount.pdf.
Counts of unique tokens provided by files results/01_zincToks.csv and results/01_zincRotatableToks.csv are compared. The comparison is summarized in the table saved in results/03_cleavageBenchmarking.csv.
Whole molecules dataset provided by MOSES benchmarking framework was decomposed into fragSMILES notation. Functions employed for this purpose were imported from processer.py (multiprocessing was adopted to accelerate decomposition). The molecule dataset represented as SMILES, SELFIES and fragSMILES was split into train and test sets using a 5-fold cross-validation scheme. These datasets were then stored in the experiments/data/ following a schematic naming convention such as moses_train_*.tar.xz
and moses_test_*.tar.xz
where *
indicates the fold number.
Whole molecules dataset provided by Grisoni et al work was first cured and then decomposed into fragSMILES notation. Functions employed for this purpose were imported from processer.py (multiprocessing was adopted to accelerate decomposition). Molecules represented by a number of tokens in range 10-32 were retained. Then, resulting molecules represented as SMILES, SELFIES and fragSMILES were augmented till 5 representations. Finally, 2 resulting dataset were split into train and test sets using a 5-fold cross-validation scheme. These datasets were then stored in experiments/data/ following a schematic naming convention such as grisoni_train_*.tar.xz
, grisoni_test_*.tar.xz
and grisoni_trainAug5_*.tar.xz
where *
indicates the fold number.
Sampled molecules generated by RNN models trained on the MOSES dataset, for each hyperparameter setting, were evaluated using functions imported from the evaluation.py file. Notably, experiments/ folder does not contain saved weight models, but it does include log files, generated sets and novel generated set. Results are stored in results/ folder.
See above, same of metricMoses notebook but on Grisoni et al dataset. Evaluation on chiral molecules is also provided. Results are stored in results/ folder.
A scaffold analysis was made on generated set of novel molecules by RNN models trained on Grisoni Dataset. Functions for this purpose were imported from evaluation.py file.