You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi team,
Thanks for this wonderful repo . The code in the repo is generic and can easily be reused. I wanted to ask that during creation of the vocab in all the models only training tokens are being used.
"datasets_for_vocab_creation": ["train"]
So in cases when we are using the multitask model we have a large coverage of tokens as we have a large vocab that consists of tokens from all datasets. So there is a high probability of test token to be found in that vocab. Whereas in case of using only single model the vocab size is less and there is a large chance of a token being OOV (Out of Vocab).
So how do we make sure that the improvements are due to multitask learning rather then due to large coverage of vocabulary in case of multitask learning?
The other point was that if we only consider vocab made from training data we make our model work well on only tokens that are present in training data which makes us loose important token information that is present in the word embeddings for those tokens which are not present in the training data.
It would be great to hear your thoughts on it.
The text was updated successfully, but these errors were encountered:
rangwani-harsh
changed the title
Issue with only train data neing used for vocab creation
Issue with only train data being used for vocab creation
May 18, 2019
Hi team,
Thanks for this wonderful repo . The code in the repo is generic and can easily be reused. I wanted to ask that during creation of the vocab in all the models only training tokens are being used.
So in cases when we are using the multitask model we have a large coverage of tokens as we have a large vocab that consists of tokens from all datasets. So there is a high probability of test token to be found in that vocab. Whereas in case of using only single model the vocab size is less and there is a large chance of a token being OOV (Out of Vocab).
So how do we make sure that the improvements are due to multitask learning rather then due to large coverage of vocabulary in case of multitask learning?
The other point was that if we only consider vocab made from training data we make our model work well on only tokens that are present in training data which makes us loose important token information that is present in the word embeddings for those tokens which are not present in the training data.
It would be great to hear your thoughts on it.
The text was updated successfully, but these errors were encountered: