Chinese tokenization removing characters

Question

Chinese tokenization removing characters

jacopofar opened this issue 4 years ago · comments

It seems the Chinese tokenization removes some character, example https://tatoeba.org/eng/sentences/show/5

今天是６月１８号，也是Muiriel的生日！

turns into

{{c1::今天}} 是 月 号 ， 也 是 Muiriel 的 生日 ！

Jacopo Farina · Answer 1 · Sun Jun 21 2020 06:37:30 GMT+0800 (China Standard Time)

The ICU tokener doesn't do it, and an assertion in the code ensures the tokens can always be concatenated to form the original sentence without any change