Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Link Invent Dataset inconsistent with the code base and prior model. #39

Open
vincrichard opened this issue May 24, 2023 · 0 comments
Open

Comments

@vincrichard
Copy link

vincrichard commented May 24, 2023

Hello and thank you for the opensource repository.

I was going through LinkInvent and wanted to train to try to train the model in a TL fashion with the dataset provided in ReinventCommunity/notebooks/data/linkinvent_prior_training_data and the prior model. However, I think there was an error in the process of dataset creation. This was mainly for testing the code and I am aware there is no particular use in doing this TL.

The code expects the data to have warheads/inputs as first columns and linkers/targets as the second column. This can be seen in the code as well as in the ReinventCommunity/notebooks/models/linkinvent.prior vocabulary which has * and | as input tokens and [*] as target token.

The dataset provided however follows the following setup:
Linkers/target ---- warheads/inputs ----- Full smiles
[*]C#CC(O)CCCCCCC[*] ---- *C#CCO|*CCC#CCCCCCCC(C)C ---- CC(C)CCCCCCC#CCCCCCCCCCC(O)C#CC#CCO

They should be modified to:

Warheads/inputs ----- linker/target ---- Full smiles
*C#CCO|*CCC#CCCCCCCC(C)C ---- *C#CCO|*CCC#CCCCCCCC(C)C ----CC(C)CCCCCCC#CCCCCCCCCCC(O)C#CC#CCO

I tried it on my hand and after doing so it worked fine.
This might not be a big issue since in the case of LinkInvent, TL is less important. And in the case of a new model the vocabulary will be recreated. I still wanted to share this feedback since the dataset does not match the code logic.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant