Korean is incorrectly detected, with way too much confidence

Question

Korean is incorrectly detected, with way too much confidence

rspeer opened this issue 6 years ago · comments

Elia Robyn Lake (Robyn Speer) commented 6 years ago

I compared the results of langdetect to cld2 on a number of snippets from the Common Crawl, and found that langdetect was frequently detecting Japanese or Chinese text as Korean. This is particularly odd because, in the digital era, Korean is overwhelmingly written using hangul, not using Chinese characters.

Here's an example of a Chinese text that langdetect says is Korean with 99.999% confidence:

>>> from langdetect import detect_langs
>>> text = "這些機構主辦的課程，多以基本電腦使用為主，例如文書處理、中文輸入、互聯網應用等，在教學環境方面，多數也屬非常基本實用，部分培訓中心的器材設施甚至有點不足或落後，但是在導師水平和態度上，普遍也很良好，有部分導師更主動 地一直在更新及改良教材，以配合受再培訓人士的能力和需要。"
>>> detect_langs(text)
[ko:0.9999977954260393]

mrhaanraadts commented 3 years ago

See #9

Edward Betts · Answer 1 · Sat Apr 07 2018 16:59:26 GMT+0800 (China Standard Time)

I have the same problem. This is "Doshisha University" in Japanese.

>>> from langdetect import detect_langs
>>> detect_langs('同志社大学')
[ko:0.9999959410191299]

Dobatymo · Answer 2 · Fri May 11 2018 15:00:51 GMT+0800 (China Standard Time)

I have the also same problem. I am using this library to separate internet comments by language and only a few percent of the comments which end up in the Korean category are actually Korean. Most are Chinese or Japanese. Which is odd, because like @rspeer said, usually completely different characters are used. Chinese characters in Korean should be very rare nowadays and Korean characters in Chinese should be nonexistent.

Zafer Cavdar · Answer 3 · Wed Sep 05 2018 22:21:15 GMT+0800 (China Standard Time)

Any published fix for that problem?

Patrick Poon · Answer 4 · Tue Sep 11 2018 22:24:30 GMT+0800 (China Standard Time)

For me, this issue is a deal killer. polyglot seems to handle this really well, though: https://polyglot.readthedocs.io/en/latest/Detection.html

Dobatymo · Answer 5 · Thu Sep 13 2018 11:06:09 GMT+0800 (China Standard Time)

@patrickmpoon polyglot uses cld2 which was mentioned by op.

Mathieu Rey · Answer 6 · Tue Mar 19 2019 21:20:03 GMT+0800 (China Standard Time)

I got the same issue on a Traditional Chinese string:

評估產品的生命週期中，對環境造成的影響，影響包含對氣候的變化以及自然資源的枯竭程度 
[ko:0.9999969462364235] --> KO, this should be zh-tw

评估产品的整个生命周期对环境产生的影响，包括对气候变化的影响以及对自然资源枯竭的影响 
[zh-cn:0.9999981145247211] --> OK

This is indeed a massive deal breaker.

I ended up using fastText (cld2 or cld3 are fine too), and when Chinese is detected, I further detect the script (traditional or simplified) with hanzidentifier.

Brad Solomon · Answer 7 · Sat Oct 05 2019 06:02:33 GMT+0800 (China Standard Time)

As mentioned by others above, Polyglot, or really, the underlying library pycld2/cld2 wins out in these cases:

>>> import pycld2 as cld2
>>> text = "這些機構主辦的課程，多以基本電腦使用為主，例如文書處理、中文輸入、互聯網應用等，在教學環境方面，多數也屬非常基本實用，部分培訓中心的器材設施甚至 有點不足或落後，但是在導師水平和態度上，普遍也很良好，有部分導師更主動 地一直在更新及改良教材，以配合受再培訓人士的能力和需要。"
>>> data = text.encode("utf-8")
>>> cld2.detect(data, bestEffort=False)
(True, 383, (('ChineseT', 'zh-Hant', 99, 1951.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))
>>> cld2.detect('同志社大学'.encode("utf-8"), bestEffort=False)
(True, 17, (('Japanese', 'ja', 94, 1984.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))

yjjg1993 · Answer 8 · Fri Dec 13 2019 17:27:42 GMT+0800 (China Standard Time)

I spent a little time on it and found that the problem lies in the training sample. Many Chinese characters (for example 且) do not show up in the training sample (wikipedia abstracts were used if I understand it correctly) and therefore cause the probability to be very low. However, those characters (且) appear in the Korean texts. The appearances of the characters can be easily checked in the profile directory.

Wei Lu · Answer 9 · Sun Sep 27 2020 03:01:27 GMT+0800 (China Standard Time)

Having the same issue of Chinese being detected as Korean (e.g. "要素替代弹性, 价格加成对劳动收入份额的影响研究"). Also, there are cases where English is detected as Italian (e.g. "A novel comprehensive statistical model for spontaneous synaptic quantal release"). Happens sometimes depending on the seed.

I tried polyglot but had trouble compiling the native dependency libicu (icu4c using brew on macos) so I ended up using fasttext with a pretrained model. The results look much more reliable than what langdetect provides – at least the above two cases are correctly detected.