BERT/Eelectra vocab file and tokenizer model.

Question

BERT/Eelectra vocab file and tokenizer model.

araitats opened this issue 4 years ago · comments

Description

The default learn_subword returns special tokens with "", "~~", "~~", and "". By convention, BERT uses [UNK] [PAD] [CLS] [SEP] and [MASK]. How can we define --custom-special-tokens flag so that the tokenizer model and vocab file have these BERT special tokens? What is the best practice for that?

References

list reference and related literature
list known implementations

Xingjian Shi · Answer 1 · Thu Dec 31 2020 07:44:39 GMT+0800 (China Standard Time)

@araitats This should have been solved. I'll close this issue.