BERT/Eelectra vocab file and tokenizer model.
araitats opened this issue · comments
Description
The default learn_subword returns special tokens with "", "", "", and "". By convention, BERT uses [UNK] [PAD] [CLS] [SEP] and [MASK]. How can we define --custom-special-tokens flag so that the tokenizer model and vocab file have these BERT special tokens? What is the best practice for that?
References
- list reference and related literature
- list known implementations
@araitats This should have been solved. I'll close this issue.