Implement custom Chinese tokenizer.

Question

Implement custom Chinese tokenizer.

emfomy opened this issue 3 years ago · comments

We may implement our own tokenizer rather than using BertTokenizerFast.
Our own tokenizer should have the following features:

Disable word piece. Convert text to token ID character by character (e.g. tokenizer.convert_tokens_to_ids(list(input_text)))
Reimplement clean_up_tokenization method. The default method is implemented for English only. Our method may remove whitespaces and convert half-width punctuations to full-width ones.

Wacha · Answer 1 · Mon Feb 01 2021 04:37:17 GMT+0800 (China Standard Time)

when computing this:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese-pos')

I have issue:
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for 'ckiplab/albert-tiny-chinese-pos' at

I have
transformers==4.2.2
ckip-transformers==0.2.1
torch==1.4.0

Mu Yang · Answer 2 · Fri Feb 26 2021 12:50:43 GMT+0800 (China Standard Time)

Consider using the 🤗Tokenizers after it release a stable version instead.