Question about vocab generation

Question

Question about vocab generation

shgabr opened this issue 2 years ago · comments

We extracted the encoder part from the T5 model, and we have been successful in training the model and its results are pretty decent. We have been able to train the model successfully when we don't supply a vocab path, which means the model generated its own vocab.

The problem is that the extracted vocab is horrible, and your vocab seems much better. So we wanted to train our T5 encoder model using your vocab, but doing so results in this error.

ERROR:allennlp.data.vocabulary:Namespace: d_tags
ERROR:allennlp.data.vocabulary:Token: INCORRECT
Traceback (most recent call last):
  File "./modifiedGector/train.py", line 311, in <module>
    main(args)
  File "./modifiedGector/train.py", line 127, in main
    special_tokens_fix=args.special_tokens_fix)
  File "./modifiedGector/train.py", line 85, in get_model
    confidence=confidence)
  File "/home/cse/gector/modifiedGector/gector/seq2labels_model.py", line 76, in __init__
    namespace=detect_namespace)
  File "/home/cse/.conda/envs/gector/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 630, in get_token_index
    return self._token_to_index[namespace][self._oov_token]
KeyError: '@@UNKNOWN@@'

We suspect that the error is because the model didn't tokenize the training data specifically for the t5 model, but we don't know how to do so while using your vocabulary.

We tried removing the d_tags file, but again the same error.

Any help or advice would be highly appreciated.

Mina Ashraf · Answer 1 · Wed Apr 13 2022 01:02:26 GMT+0800 (China Standard Time)

I tried it on my own and I got the same results.
Here is another issue that seems to have the same problem
allenai/allennlp#881 (comment)

Sherif Gabr · Answer 2 · Wed Apr 13 2022 02:21:00 GMT+0800 (China Standard Time)

Problem solved

Liang Shining · Answer 3 · Thu Aug 04 2022 13:47:24 GMT+0800 (China Standard Time)

Problem solved

Hi @shgabr , I'm facing the same problem. How did you solve it? Thanks!

Zhou Bin · Answer 4 · Tue Aug 23 2022 16:36:39 GMT+0800 (China Standard Time)

Problem solved

Hi @shgabr , I'm facing the same problem. How did you solve it? Thanks!

1783696285 · Answer 5 · Fri Nov 18 2022 08:32:24 GMT+0800 (China Standard Time)

You need to use the "codecs " library, when you create the labels in output_vocabulary.