ckiplab / ckip-transformers

CKIP Transformers

Home Page:https://ckip-transformers.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement custom Chinese tokenizer.

emfomy opened this issue · comments

We may implement our own tokenizer rather than using BertTokenizerFast.
Our own tokenizer should have the following features:

  • Disable word piece. Convert text to token ID character by character (e.g. tokenizer.convert_tokens_to_ids(list(input_text)))
  • Reimplement clean_up_tokenization method. The default method is implemented for English only. Our method may remove whitespaces and convert half-width punctuations to full-width ones.
commented

when computing this:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese-pos')

I have issue:
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for 'ckiplab/albert-tiny-chinese-pos' at

I have
transformers==4.2.2
ckip-transformers==0.2.1
torch==1.4.0

Consider using the 🤗Tokenizers after it release a stable version instead.