Implement custom Chinese tokenizer.
emfomy opened this issue · comments
Mu Yang commented
We may implement our own tokenizer rather than using BertTokenizerFast.
Our own tokenizer should have the following features:
- Disable word piece. Convert text to token ID character by character (e.g.
tokenizer.convert_tokens_to_ids(list(input_text))
) - Reimplement
clean_up_tokenization
method. The default method is implemented for English only. Our method may remove whitespaces and convert half-width punctuations to full-width ones.
Wacha commented
when computing this:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese-pos')
I have issue:
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for 'ckiplab/albert-tiny-chinese-pos' at
I have
transformers==4.2.2
ckip-transformers==0.2.1
torch==1.4.0
Mu Yang commented
Consider using the 🤗Tokenizers after it release a stable version instead.