Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with only train data being used for vocab creation #15

Open
rangwani-harsh opened this issue May 18, 2019 · 0 comments
Open

Issue with only train data being used for vocab creation #15

rangwani-harsh opened this issue May 18, 2019 · 0 comments

Comments

@rangwani-harsh
Copy link

Hi team,
Thanks for this wonderful repo . The code in the repo is generic and can easily be reused. I wanted to ask that during creation of the vocab in all the models only training tokens are being used.

"datasets_for_vocab_creation": ["train"]

So in cases when we are using the multitask model we have a large coverage of tokens as we have a large vocab that consists of tokens from all datasets. So there is a high probability of test token to be found in that vocab. Whereas in case of using only single model the vocab size is less and there is a large chance of a token being OOV (Out of Vocab).
So how do we make sure that the improvements are due to multitask learning rather then due to large coverage of vocabulary in case of multitask learning?

The other point was that if we only consider vocab made from training data we make our model work well on only tokens that are present in training data which makes us loose important token information that is present in the word embeddings for those tokens which are not present in the training data.

It would be great to hear your thoughts on it.

@rangwani-harsh rangwani-harsh changed the title Issue with only train data neing used for vocab creation Issue with only train data being used for vocab creation May 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant