Improve Japanese detection quality

Question

Improve Japanese detection quality

greyblake opened this issue 4 years ago · comments

At the moment Japanese remains the only language that gives poor results even with long texts.

It seems to be due to many chinese characters.

LANG	AVG	<= 20	21-50	51-100	> 100
Japanese	54.05%	52.94%	55.77%	55.55%	51.95%

See article https://eastasiastudent.net/regional/hanzi-and-kanji/

Chinese is written entirely in hanzi, and Japanese makes heavy use of Chinese characters.

The detection algorithm could be probably adjust in the following way:

If text contains only Mandarin characters => It's Chinese
If text contains Mandarin and big portion of Katakana or Hiragana (at least 25%) => it's Japanese

KITAITI Makoto commented 4 years ago

Great!

KITAITI Makoto · Answer 1 · Sat Mar 27 2021 19:06:40 GMT+0800 (China Standard Time)

I'm a native Japanese speaker. Feel free to mention me when you need help.

Your algorithm sounds good. 25% seems enough and lesser might be okay.

Serhii Potapov · Answer 2 · Sat Mar 27 2021 19:42:38 GMT+0800 (China Standard Time)

@KitaitiMakoto Thanks for the feedback!
Yea, I just wanted to double check if my idea is something meaningful.
I am refactoring right now in order to implement and test that plan.

Serhii Potapov · Answer 3 · Sat Mar 27 2021 20:21:49 GMT+0800 (China Standard Time)

@KitaitiMakoto This seems to be a big improvement for Japanese detection!
I added benchmarks to #89
Thank you!