dmlc / gluon-nlp

NLP made easy

Home Page:https://nlp.gluon.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BERT/Eelectra vocab file and tokenizer model.

araitats opened this issue · comments

Description

The default learn_subword returns special tokens with "", "", "", and "". By convention, BERT uses [UNK] [PAD] [CLS] [SEP] and [MASK]. How can we define --custom-special-tokens flag so that the tokenizer model and vocab file have these BERT special tokens? What is the best practice for that?

References

  • list reference and related literature
  • list known implementations

@araitats This should have been solved. I'll close this issue.