grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about vocab generation

shgabr opened this issue · comments

We extracted the encoder part from the T5 model, and we have been successful in training the model and its results are pretty decent. We have been able to train the model successfully when we don't supply a vocab path, which means the model generated its own vocab.

The problem is that the extracted vocab is horrible, and your vocab seems much better. So we wanted to train our T5 encoder model using your vocab, but doing so results in this error.

ERROR:allennlp.data.vocabulary:Namespace: d_tags
ERROR:allennlp.data.vocabulary:Token: INCORRECT
Traceback (most recent call last):
  File "./modifiedGector/train.py", line 311, in <module>
    main(args)
  File "./modifiedGector/train.py", line 127, in main
    special_tokens_fix=args.special_tokens_fix)
  File "./modifiedGector/train.py", line 85, in get_model
    confidence=confidence)
  File "/home/cse/gector/modifiedGector/gector/seq2labels_model.py", line 76, in __init__
    namespace=detect_namespace)
  File "/home/cse/.conda/envs/gector/lib/python3.7/site-packages/allennlp/data/vocabulary.py", line 630, in get_token_index
    return self._token_to_index[namespace][self._oov_token]
KeyError: '@@UNKNOWN@@'

We suspect that the error is because the model didn't tokenize the training data specifically for the t5 model, but we don't know how to do so while using your vocabulary.

We tried removing the d_tags file, but again the same error.

Any help or advice would be highly appreciated.

I tried it on my own and I got the same results.
Here is another issue that seems to have the same problem
allenai/allennlp#881 (comment)

Problem solved

Problem solved

Hi @shgabr , I'm facing the same problem. How did you solve it? Thanks!

Problem solved

Hi @shgabr , I'm facing the same problem. How did you solve it? Thanks!

image
You need to use the "codecs " library, when you create the labels in output_vocabulary.