BUG: GLM-10B-Chinese model generate " ⁇".

Question

BUG: GLM-10B-Chinese model generate " ⁇".

Tebmer opened this issue a year ago · comments

Hi, When I use the seq2seq code to evaluate the original GLM-10-Chinese model, it sometimes generate ⁇ in text (in fact these two question marks is a token, and the token id of _⁇ is 25383) .

For example:
input: "zī yuán"的词语是[MASK]
output：⁇

input：单词“Duck”的读音是
output：单词“Duck”的读音是/tə d ⁇ æk/

Is ther anything wrong with the sentencepice tokenizer or the pretraining stage? And how to fix it? I think this token should be filtred.

THANKS!

Shawn Xu · Answer 1 · Tue Mar 28 2023 19:55:55 GMT+0800 (China Standard Time)

I checked the wudao dataest and found there are some irregular question marks in the text.

Is this the cause of the problem?

Shawn Xu · Answer 2 · Thu Mar 30 2023 14:25:29 GMT+0800 (China Standard Time)

OK. The reason is that the trained tokenizer encounter some unseen tokens while pretraining such as "岿". Maybe the vocabulary of GLM10bchinese is not big enough.

superhg · Answer 3 · Fri Apr 14 2023 09:34:41 GMT+0800 (China Standard Time)

met same issue: 蚂蟥的 “蟥 ” 椪柑树的 “椪” ，token id会被转换为0