ckiplab / ckip-transformers

CKIP Transformers

Home Page:https://ckip-transformers.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some traditional Chinese characters mapped to UNK

da03 opened this issue · comments

Thanks for the great library! Not sure if this is the correct place to ask, but I think I was using your tokenizer in huggingface transformers. I found that some traditional Chinese characters are mapped to UNKs, see the below screenshot.

The code I used was

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
input_ids = tokenizer.encode("重刋道藏輯要高上玉皇本行集經天樞上相(臣)張良校正三淸勅門下湛寂常道信擬議之至難恢漠神通豈形容之可盡", return_tensors='pt')
print ('encoded ids: ', input_ids)
print ('map encoded ids back to words: ', tokenizer.decode(input_ids[0]))

Screen Shot 2021-04-12 at 8 11 31 PM

Thanks in advance!

We use the original bert-base-chinese tokenizer (https://huggingface.co/bert-base-chinese), which contains only 8000 Chinese characters.
However, the UNK rate in my corpus is only about 0.3%.
In general usage, you can just ignore these UNK characters.
If you really need to handle these UNKs, you might need to train your own LM model or try other techniques that handles UNKs.