Issue with only train data being used for vocab creation #15

rangwani-harsh · 2019-05-18T16:03:42Z

Hi team,
Thanks for this wonderful repo . The code in the repo is generic and can easily be reused. I wanted to ask that during creation of the vocab in all the models only training tokens are being used.

"datasets_for_vocab_creation": ["train"]

So in cases when we are using the multitask model we have a large coverage of tokens as we have a large vocab that consists of tokens from all datasets. So there is a high probability of test token to be found in that vocab. Whereas in case of using only single model the vocab size is less and there is a large chance of a token being OOV (Out of Vocab).
So how do we make sure that the improvements are due to multitask learning rather then due to large coverage of vocabulary in case of multitask learning?

The other point was that if we only consider vocab made from training data we make our model work well on only tokens that are present in training data which makes us loose important token information that is present in the word embeddings for those tokens which are not present in the training data.

It would be great to hear your thoughts on it.

The text was updated successfully, but these errors were encountered:

rangwani-harsh changed the title ~~Issue with only train data neing used for vocab creation~~ Issue with only train data being used for vocab creation May 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with only train data being used for vocab creation #15

Issue with only train data being used for vocab creation #15

rangwani-harsh commented May 18, 2019

Issue with only train data being used for vocab creation #15

Issue with only train data being used for vocab creation #15

Comments

rangwani-harsh commented May 18, 2019