Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Korean is incorrectly detected, with way too much confidence #51

Open
rspeer opened this issue Mar 6, 2018 · 10 comments
Open

Korean is incorrectly detected, with way too much confidence #51

rspeer opened this issue Mar 6, 2018 · 10 comments

Comments

@rspeer
Copy link

rspeer commented Mar 6, 2018

I compared the results of langdetect to cld2 on a number of snippets from the Common Crawl, and found that langdetect was frequently detecting Japanese or Chinese text as Korean. This is particularly odd because, in the digital era, Korean is overwhelmingly written using hangul, not using Chinese characters.

Here's an example of a Chinese text that langdetect says is Korean with 99.999% confidence:

>>> from langdetect import detect_langs
>>> text = "這些機構主辦的課程,多以基本電腦使用為主,例如文書處理、中文輸入、互聯網應用等,在教學環境方面,多數也屬非常基本實用,部分培訓中心的器材設施甚至有點不足或落後,但是在導師水平和態度上,普遍也很良好,有部分導師更主動 地一直在更新及改良教材,以配合受再培訓人士的能力和需要。"
>>> detect_langs(text)
[ko:0.9999977954260393]
@EdwardBetts
Copy link

I have the same problem. This is "Doshisha University" in Japanese.

>>> from langdetect import detect_langs
>>> detect_langs('同志社大学')
[ko:0.9999959410191299]

@Dobatymo
Copy link

I have the also same problem. I am using this library to separate internet comments by language and only a few percent of the comments which end up in the Korean category are actually Korean. Most are Chinese or Japanese. Which is odd, because like @rspeer said, usually completely different characters are used. Chinese characters in Korean should be very rare nowadays and Korean characters in Chinese should be nonexistent.

@zafercavdar
Copy link

Any published fix for that problem?

@patrickmpoon
Copy link

For me, this issue is a deal killer. polyglot seems to handle this really well, though: https://polyglot.readthedocs.io/en/latest/Detection.html

@Dobatymo
Copy link

@patrickmpoon polyglot uses cld2 which was mentioned by op.

@matrey
Copy link

matrey commented Mar 19, 2019

I got the same issue on a Traditional Chinese string:

評估產品的生命週期中,對環境造成的影響,影響包含對氣候的變化以及自然資源的枯竭程度 
[ko:0.9999969462364235] --> KO, this should be zh-tw

评估产品的整个生命周期对环境产生的影响,包括对气候变化的影响以及对自然资源枯竭的影响 
[zh-cn:0.9999981145247211] --> OK

This is indeed a massive deal breaker.

I ended up using fastText (cld2 or cld3 are fine too), and when Chinese is detected, I further detect the script (traditional or simplified) with hanzidentifier.

@bsolomon1124
Copy link

bsolomon1124 commented Oct 4, 2019

As mentioned by others above, Polyglot, or really, the underlying library pycld2/cld2 wins out in these cases:

>>> import pycld2 as cld2
>>> text = "這些機構主辦的課程,多以基本電腦使用為主,例如文書處理、中文輸入、互聯網應用等,在教學環境方面,多數也屬非常基本實用,部分培訓中心的器材設施甚至 有點不足或落後,但是在導師水平和態度上,普遍也很良好,有部分導師更主動 地一直在更新及改良教材,以配合受再培訓人士的能力和需要。"
>>> data = text.encode("utf-8")
>>> cld2.detect(data, bestEffort=False)
(True, 383, (('ChineseT', 'zh-Hant', 99, 1951.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))
>>> cld2.detect('同志社大学'.encode("utf-8"), bestEffort=False)
(True, 17, (('Japanese', 'ja', 94, 1984.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))

@yjjg1993
Copy link

yjjg1993 commented Dec 13, 2019

I spent a little time on it and found that the problem lies in the training sample. Many Chinese characters (for example 且) do not show up in the training sample (wikipedia abstracts were used if I understand it correctly) and therefore cause the probability to be very low. However, those characters (且) appear in the Korean texts. The appearances of the characters can be easily checked in the profile directory.

@weilu
Copy link

weilu commented Sep 26, 2020

Having the same issue of Chinese being detected as Korean (e.g. "要素替代弹性, 价格加成对劳动收入份额的影响研究"). Also, there are cases where English is detected as Italian (e.g. "A novel comprehensive statistical model for spontaneous synaptic quantal release"). Happens sometimes depending on the seed.

I tried polyglot but had trouble compiling the native dependency libicu (icu4c using brew on macos) so I ended up using fasttext with a pretrained model. The results look much more reliable than what langdetect provides – at least the above two cases are correctly detected.

@mrhaanraadts
Copy link

See #9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants