greyblake / whatlang-rs

Natural language detection library for Rust. Try demo online: https://whatlang.org/

Home Page:https://whatlang.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve Japanese detection quality

greyblake opened this issue · comments

At the moment Japanese remains the only language that gives poor results even with long texts.

It seems to be due to many chinese characters.

LANG AVG <= 20 21-50 51-100 > 100
Japanese 54.05% 52.94% 55.77% 55.55% 51.95%

See article https://eastasiastudent.net/regional/hanzi-and-kanji/

Chinese is written entirely in hanzi, and Japanese makes heavy use of Chinese characters.

The detection algorithm could be probably adjust in the following way:

  • If text contains only Mandarin characters => It's Chinese
  • If text contains Mandarin and big portion of Katakana or Hiragana (at least 25%) => it's Japanese

I'm a native Japanese speaker. Feel free to mention me when you need help.

Your algorithm sounds good. 25% seems enough and lesser might be okay.

@KitaitiMakoto Thanks for the feedback!
Yea, I just wanted to double check if my idea is something meaningful.
I am refactoring right now in order to implement and test that plan.

@KitaitiMakoto This seems to be a big improvement for Japanese detection!
I added benchmarks to #89
Thank you!

Great!