BUG: GLM-10B-Chinese model generate " ⁇".
Tebmer opened this issue · comments
Shawn Xu commented
Hi, When I use the seq2seq
code to evaluate the original GLM-10-Chinese model, it sometimes generate ⁇
in text (in fact these two question marks is a token, and the token id of _⁇
is 25383
) .
For example:
input: "zī yuán"的词语是[MASK]
output:⁇
input:单词“Duck”的读音是
output:单词“Duck”的读音是/tə d ⁇ æk/
Is ther anything wrong with the sentencepice tokenizer or the pretraining stage? And how to fix it? I think this token should be filtred.
THANKS!
Shawn Xu commented
Shawn Xu commented
OK. The reason is that the trained tokenizer encounter some unseen tokens while pretraining such as "岿". Maybe the vocabulary of GLM10bchinese is not big enough.
superhg commented
met same issue: 蚂蟥 的 “蟥 ” 椪柑树 的 “椪” ,token id会被转换为0