Mimino666 / langdetect

Port of Google's language-detection library to Python.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Korean is incorrectly detected, with way too much confidence

rspeer opened this issue · comments

I compared the results of langdetect to cld2 on a number of snippets from the Common Crawl, and found that langdetect was frequently detecting Japanese or Chinese text as Korean. This is particularly odd because, in the digital era, Korean is overwhelmingly written using hangul, not using Chinese characters.

Here's an example of a Chinese text that langdetect says is Korean with 99.999% confidence:

>>> from langdetect import detect_langs
>>> text = "這些機構主辦的課程,多以基本電腦使用為主,例如文書處理、中文輸入、互聯網應用等,在教學環境方面,多數也屬非常基本實用,部分培訓中心的器材設施甚至有點不足或落後,但是在導師水平和態度上,普遍也很良好,有部分導師更主動 地一直在更新及改良教材,以配合受再培訓人士的能力和需要。"
>>> detect_langs(text)
[ko:0.9999977954260393]

I have the same problem. This is "Doshisha University" in Japanese.

>>> from langdetect import detect_langs
>>> detect_langs('同志社大学')
[ko:0.9999959410191299]

I have the also same problem. I am using this library to separate internet comments by language and only a few percent of the comments which end up in the Korean category are actually Korean. Most are Chinese or Japanese. Which is odd, because like @rspeer said, usually completely different characters are used. Chinese characters in Korean should be very rare nowadays and Korean characters in Chinese should be nonexistent.

Any published fix for that problem?

For me, this issue is a deal killer. polyglot seems to handle this really well, though: https://polyglot.readthedocs.io/en/latest/Detection.html

@patrickmpoon polyglot uses cld2 which was mentioned by op.

I got the same issue on a Traditional Chinese string:

評估產品的生命週期中,對環境造成的影響,影響包含對氣候的變化以及自然資源的枯竭程度 
[ko:0.9999969462364235] --> KO, this should be zh-tw

评估产品的整个生命周期对环境产生的影响,包括对气候变化的影响以及对自然资源枯竭的影响 
[zh-cn:0.9999981145247211] --> OK

This is indeed a massive deal breaker.

I ended up using fastText (cld2 or cld3 are fine too), and when Chinese is detected, I further detect the script (traditional or simplified) with hanzidentifier.

As mentioned by others above, Polyglot, or really, the underlying library pycld2/cld2 wins out in these cases:

>>> import pycld2 as cld2
>>> text = "這些機構主辦的課程,多以基本電腦使用為主,例如文書處理、中文輸入、互聯網應用等,在教學環境方面,多數也屬非常基本實用,部分培訓中心的器材設施甚至 有點不足或落後,但是在導師水平和態度上,普遍也很良好,有部分導師更主動 地一直在更新及改良教材,以配合受再培訓人士的能力和需要。"
>>> data = text.encode("utf-8")
>>> cld2.detect(data, bestEffort=False)
(True, 383, (('ChineseT', 'zh-Hant', 99, 1951.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))
>>> cld2.detect('同志社大学'.encode("utf-8"), bestEffort=False)
(True, 17, (('Japanese', 'ja', 94, 1984.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))

I spent a little time on it and found that the problem lies in the training sample. Many Chinese characters (for example 且) do not show up in the training sample (wikipedia abstracts were used if I understand it correctly) and therefore cause the probability to be very low. However, those characters (且) appear in the Korean texts. The appearances of the characters can be easily checked in the profile directory.

Having the same issue of Chinese being detected as Korean (e.g. "要素替代弹性, 价格加成对劳动收入份额的影响研究"). Also, there are cases where English is detected as Italian (e.g. "A novel comprehensive statistical model for spontaneous synaptic quantal release"). Happens sometimes depending on the seed.

I tried polyglot but had trouble compiling the native dependency libicu (icu4c using brew on macos) so I ended up using fasttext with a pretrained model. The results look much more reliable than what langdetect provides – at least the above two cases are correctly detected.