Chinese tokenization removing characters
jacopofar opened this issue · comments
It seems the Chinese tokenization removes some character, example https://tatoeba.org/eng/sentences/show/5
今天是6月18号,也是Muiriel的生日!
turns into
{{c1::今天}} 是 月 号 , 也 是 Muiriel 的 生日 !
The ICU tokener doesn't do it, and an assertion in the code ensures the tokens can always be concatenated to form the original sentence without any change