Improve Japanese detection quality
greyblake opened this issue · comments
At the moment Japanese remains the only language that gives poor results even with long texts.
It seems to be due to many chinese characters.
LANG | AVG | <= 20 | 21-50 | 51-100 | > 100 |
---|---|---|---|---|---|
Japanese | 54.05% | 52.94% | 55.77% | 55.55% | 51.95% |
See article https://eastasiastudent.net/regional/hanzi-and-kanji/
Chinese is written entirely in hanzi, and Japanese makes heavy use of Chinese characters.
The detection algorithm could be probably adjust in the following way:
- If text contains only Mandarin characters => It's Chinese
- If text contains Mandarin and big portion of Katakana or Hiragana (at least 25%) => it's Japanese
I'm a native Japanese speaker. Feel free to mention me when you need help.
Your algorithm sounds good. 25% seems enough and lesser might be okay.
@KitaitiMakoto Thanks for the feedback!
Yea, I just wanted to double check if my idea is something meaningful.
I am refactoring right now in order to implement and test that plan.
@KitaitiMakoto This seems to be a big improvement for Japanese detection!
I added benchmarks to #89
Thank you!
Great!