Some traditional Chinese characters mapped to UNK

Question

Some traditional Chinese characters mapped to UNK

da03 opened this issue 3 years ago · comments

Thanks for the great library! Not sure if this is the correct place to ask, but I think I was using your tokenizer in huggingface transformers. I found that some traditional Chinese characters are mapped to UNKs, see the below screenshot.

The code I used was

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
input_ids = tokenizer.encode("重刋道藏輯要高上玉皇本行集經天樞上相(臣)張良校正三淸勅門下湛寂常道信擬議之至難恢漠神通豈形容之可盡", return_tensors='pt')
print ('encoded ids: ', input_ids)
print ('map encoded ids back to words: ', tokenizer.decode(input_ids[0]))

Thanks in advance!

Mu Yang · Answer 1 · Thu Apr 15 2021 09:42:59 GMT+0800 (China Standard Time)

We use the original bert-base-chinese tokenizer (https://huggingface.co/bert-base-chinese), which contains only 8000 Chinese characters.
However, the UNK rate in my corpus is only about 0.3%.
In general usage, you can just ignore these UNK characters.
If you really need to handle these UNKs, you might need to train your own LM model or try other techniques that handles UNKs.