jacopofar / grammar-quiz

Online cloze deletion tool focused on grammar

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Chinese tokenization removing characters

jacopofar opened this issue · comments

It seems the Chinese tokenization removes some character, example https://tatoeba.org/eng/sentences/show/5

今天是6月18号,也是Muiriel的生日!

turns into

{{c1::今天}} 是 月 号 , 也 是 Muiriel 的 生日 !

The ICU tokener doesn't do it, and an assertion in the code ensures the tokens can always be concatenated to form the original sentence without any change