update

anuprulez · Jun 1, 2023 · c90f95f · c90f95f
1 parent 4cd8c88
commit c90f95f
Show file tree

Hide file tree

Showing 3 changed files with 17 additions and 862 deletions.
diff --git a/README.md b/README.md
@@ -6,7 +6,9 @@ Project name: Galaxy tool recommendation using Transformers
 
 Project home page: https://github.com/anuprulez/galaxy_tool_recommendation_transformers
 
-Data: https://github.com/anuprulez/galaxy_tool_recommendation_transformers/tree/master/data/test_data
+(Example) Data: https://github.com/anuprulez/galaxy_tool_recommendation_transformers/tree/master/data/test_data
+
+Complete data: https://doi.org/10.5281/zenodo.7825973
 
 Operating system(s): Linux
 
@@ -22,22 +24,26 @@ License: MIT License
 
 ## (To reproduce this work) How to create a sample tool recommendation model:
 
-**Note**: To reproduce this work after training on complete model, it is required to have a decent compute resource (with at least 10 GB RAM) and it takes > 24 hrs to create a trained model on complete set of workflows (~ 18,000). However, the following steps can be used to create a sample tool recommendation model on a subset of workflows:
+**Note**: To reproduce this work after training on complete model, it is required to have a decent compute resource (with at least 10 GB RAM) and it takes > 24 hrs to create a trained model on complete set of workflows (~ 60,000). However, the following steps can be used to create a sample tool recommendation model on a subset of workflows:
 
 1. Install the dependencies by executing the following lines:
     *    `conda env create -f environment.yml`
     *    `conda activate tool_prediction_transformers`
 
 2. Execute `sh train.sh` (https://github.com/anuprulez/galaxy_tool_recommendation_transformers/blob/master/train.sh). It runs on a subset of workflows.
 
-3. After successful finish (~2-3 minutes), a trained model is created at `data/log/saved_model/<<last training iteration>>/tf_model_h5/<<model.h5>>`.
+3. After successful finish (~2-3 minutes), a trained model is created at `log/saved_model/<<last training iteration>>/tf_model_h5/<<model.h5>>`.
 4. For running on complete data: All datasets are shared at: https://doi.org/10.5281/zenodo.7825973. Download these two tabular files and add their paths in the `train.sh` file and execute.
 
 ## Plots:
 
-## Transformer vs RNN
+## Precision@k for Transformer, RNN, CNN and DNN
+
+![Precision@k](https://raw.githubusercontent.com/anuprulez/galaxy_tool_recommendation_transformers/master/plots/prec_k_transformer_rnn_cnn_dnn.png "Precision@k")
+
+## Precision@k for Transformer, RNN, CNN and DNN for infrequent tools
 
-![Precision@k](https://raw.githubusercontent.com/anuprulez/galaxy_tool_recommendation_transformers/master/plots/precision%40k_transformer_rnn.png "Precision@k")
+![Precision@k](https://raw.githubusercontent.com/anuprulez/galaxy_tool_recommendation_transformers/master/plots/prec_low_prec.png "Precision@k")
 
 Attention scores:
 
@@ -49,11 +55,11 @@ Execute data extraction script `extract_data.sh` to extract two tabular files -
 
 ### Description of all parameters mentioned in the training script:
 
-`python <main python script> -wf <path to workflow file> -tu <path to tool usage file> -om <path to the final H5 model file> -cd <cutoff date to exclude old workflows> -pl <maximum length of tool path> -ti <number of training iterations> -nhd <number of attention heads> -ed <embedding dimensions> -fd <feed forward dimensions> -dt <dropout> -lr <learning rate> -ts <test data percentage> -trbs <training batch size> -tebs <test batch size> -trlg <train logging step> -telg <test logging step> -ud <use preprocessed data> --is_transformer <to use transformer or RNN>`
+`python <main python script> -wf <path to workflow file> -tu <path to tool usage file> -om <path to the final H5 model file> -cd <cutoff date to exclude old workflows> -pl <maximum length of tool path> -ti <number of training iterations> -nhd <number of attention heads> -ed <embedding dimensions> -fd <feed forward dimensions> -dt <dropout> -lr <learning rate> -ts <test data percentage> -trbs <training batch size> -tebs <test batch size> -trlg <train logging step> -telg <test logging step> -ud <use preprocessed data> --is_transformer <to use transformer or RNN> --model_type <use one of transformer, rnn, cnn or dnn> --restart_step <use step of last training>`
 
 ### (To reproduce this work on complete set of workflows) Example command:
 
-   `python scripts/main.py -wf data/aug_22/wf-subset.csv -tu data/aug_22/tool_popularity_Aug_22.csv -om data/aug_22/tool_recommendation_model.hdf5 -cd '2017-12-31' -pl 25 -ti 20 -nhd 4 -ed 128 -fd 128 -dt 0.1 -lr 0.001 -ts 0.2 -trbs 128 -tebs 128 -trlg 10 -telg 5 -ud false --is_transformer true`
+   `python scripts/main.py -wf data/aug_22/wf-subset.csv -tu data/aug_22/tool_popularity_Aug_22.csv -om data/aug_22/tool_recommendation_model.hdf5 -cd '2017-12-31' -pl 25 -ti 200 -nhd 4 -ed 128 -fd 128 -dt 0.2 -lr 0.001 -ts 0.2 -trbs 512 -tebs 512 -trlg 10 -telg 10 -ud false --model_type transformer --restart_step 0`
 
 ## (For Galaxy admins) The following steps are only necessary for deploying on any Galaxy server: