Question about vocab generation
shgabr opened this issue · comments
We extracted the encoder part from the T5 model, and we have been successful in training the model and its results are pretty decent. We have been able to train the model successfully when we don't supply a vocab path, which means the model generated its own vocab.
The problem is that the extracted vocab is horrible, and your vocab seems much better. So we wanted to train our T5 encoder model using your vocab, but doing so results in this error.
ERROR:allennlp.data.vocabulary:Namespace: d_tags
ERROR:allennlp.data.vocabulary:Token: INCORRECT
Traceback (most recent call last):
File "./modifiedGector/train.py", line 311, in <module>
main(args)
File "./modifiedGector/train.py", line 127, in main
special_tokens_fix=args.special_tokens_fix)
File "./modifiedGector/train.py", line 85, in get_model
confidence=confidence)
File "/home/cse/gector/modifiedGector/gector/seq2labels_model.py", line 76, in __init__
namespace=detect_namespace)
File "/home/cse/.conda/envs/gector/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 630, in get_token_index
return self._token_to_index[namespace][self._oov_token]
KeyError: '@@UNKNOWN@@'
We suspect that the error is because the model didn't tokenize the training data specifically for the t5 model, but we don't know how to do so while using your vocabulary.
We tried removing the d_tags file, but again the same error.
Any help or advice would be highly appreciated.
I tried it on my own and I got the same results.
Here is another issue that seems to have the same problem
allenai/allennlp#881 (comment)
Problem solved
Problem solved
Hi @shgabr , I'm facing the same problem. How did you solve it? Thanks!
Problem solved
Hi @shgabr , I'm facing the same problem. How did you solve it? Thanks!