Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
anuprulez committed Jun 1, 2023
1 parent 4cd8c88 commit c90f95f
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 862 deletions.
20 changes: 13 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ Project name: Galaxy tool recommendation using Transformers

Project home page: https://github.com/anuprulez/galaxy_tool_recommendation_transformers

Data: https://github.com/anuprulez/galaxy_tool_recommendation_transformers/tree/master/data/test_data
(Example) Data: https://github.com/anuprulez/galaxy_tool_recommendation_transformers/tree/master/data/test_data

Complete data: https://doi.org/10.5281/zenodo.7825973

Operating system(s): Linux

Expand All @@ -22,22 +24,26 @@ License: MIT License

## (To reproduce this work) How to create a sample tool recommendation model:

**Note**: To reproduce this work after training on complete model, it is required to have a decent compute resource (with at least 10 GB RAM) and it takes > 24 hrs to create a trained model on complete set of workflows (~ 18,000). However, the following steps can be used to create a sample tool recommendation model on a subset of workflows:
**Note**: To reproduce this work after training on complete model, it is required to have a decent compute resource (with at least 10 GB RAM) and it takes > 24 hrs to create a trained model on complete set of workflows (~ 60,000). However, the following steps can be used to create a sample tool recommendation model on a subset of workflows:

1. Install the dependencies by executing the following lines:
* `conda env create -f environment.yml`
* `conda activate tool_prediction_transformers`

2. Execute `sh train.sh` (https://github.com/anuprulez/galaxy_tool_recommendation_transformers/blob/master/train.sh). It runs on a subset of workflows.

3. After successful finish (~2-3 minutes), a trained model is created at `data/log/saved_model/<<last training iteration>>/tf_model_h5/<<model.h5>>`.
3. After successful finish (~2-3 minutes), a trained model is created at `log/saved_model/<<last training iteration>>/tf_model_h5/<<model.h5>>`.
4. For running on complete data: All datasets are shared at: https://doi.org/10.5281/zenodo.7825973. Download these two tabular files and add their paths in the `train.sh` file and execute.

## Plots:

## Transformer vs RNN
## Precision@k for Transformer, RNN, CNN and DNN

![Precision@k](https://raw.githubusercontent.com/anuprulez/galaxy_tool_recommendation_transformers/master/plots/prec_k_transformer_rnn_cnn_dnn.png "Precision@k")

## Precision@k for Transformer, RNN, CNN and DNN for infrequent tools

![Precision@k](https://raw.githubusercontent.com/anuprulez/galaxy_tool_recommendation_transformers/master/plots/precision%40k_transformer_rnn.png "Precision@k")
![Precision@k](https://raw.githubusercontent.com/anuprulez/galaxy_tool_recommendation_transformers/master/plots/prec_low_prec.png "Precision@k")

Attention scores:

Expand All @@ -49,11 +55,11 @@ Execute data extraction script `extract_data.sh` to extract two tabular files -

### Description of all parameters mentioned in the training script:

`python <main python script> -wf <path to workflow file> -tu <path to tool usage file> -om <path to the final H5 model file> -cd <cutoff date to exclude old workflows> -pl <maximum length of tool path> -ti <number of training iterations> -nhd <number of attention heads> -ed <embedding dimensions> -fd <feed forward dimensions> -dt <dropout> -lr <learning rate> -ts <test data percentage> -trbs <training batch size> -tebs <test batch size> -trlg <train logging step> -telg <test logging step> -ud <use preprocessed data> --is_transformer <to use transformer or RNN>`
`python <main python script> -wf <path to workflow file> -tu <path to tool usage file> -om <path to the final H5 model file> -cd <cutoff date to exclude old workflows> -pl <maximum length of tool path> -ti <number of training iterations> -nhd <number of attention heads> -ed <embedding dimensions> -fd <feed forward dimensions> -dt <dropout> -lr <learning rate> -ts <test data percentage> -trbs <training batch size> -tebs <test batch size> -trlg <train logging step> -telg <test logging step> -ud <use preprocessed data> --is_transformer <to use transformer or RNN> --model_type <use one of transformer, rnn, cnn or dnn> --restart_step <use step of last training>`

### (To reproduce this work on complete set of workflows) Example command:

`python scripts/main.py -wf data/aug_22/wf-subset.csv -tu data/aug_22/tool_popularity_Aug_22.csv -om data/aug_22/tool_recommendation_model.hdf5 -cd '2017-12-31' -pl 25 -ti 20 -nhd 4 -ed 128 -fd 128 -dt 0.1 -lr 0.001 -ts 0.2 -trbs 128 -tebs 128 -trlg 10 -telg 5 -ud false --is_transformer true`
`python scripts/main.py -wf data/aug_22/wf-subset.csv -tu data/aug_22/tool_popularity_Aug_22.csv -om data/aug_22/tool_recommendation_model.hdf5 -cd '2017-12-31' -pl 25 -ti 200 -nhd 4 -ed 128 -fd 128 -dt 0.2 -lr 0.001 -ts 0.2 -trbs 512 -tebs 512 -trlg 10 -telg 10 -ud false --model_type transformer --restart_step 0`

## (For Galaxy admins) The following steps are only necessary for deploying on any Galaxy server:

Expand Down
Loading

0 comments on commit c90f95f

Please sign in to comment.