Issue with only train data being used for vocab creation

Question

Issue with only train data being used for vocab creation

rangwani-harsh opened this issue 5 years ago · comments

Hi team,
Thanks for this wonderful repo . The code in the repo is generic and can easily be reused. I wanted to ask that during creation of the vocab in all the models only training tokens are being used.

"datasets_for_vocab_creation": ["train"]

So in cases when we are using the multitask model we have a large coverage of tokens as we have a large vocab that consists of tokens from all datasets. So there is a high probability of test token to be found in that vocab. Whereas in case of using only single model the vocab size is less and there is a large chance of a token being OOV (Out of Vocab).
So how do we make sure that the improvements are due to multitask learning rather then due to large coverage of vocabulary in case of multitask learning?

The other point was that if we only consider vocab made from training data we make our model work well on only tokens that are present in training data which makes us loose important token information that is present in the word embeddings for those tokens which are not present in the training data.

It would be great to hear your thoughts on it.