Korean is incorrectly detected, with way too much confidence #51

rspeer · 2018-03-06T20:00:45Z

I compared the results of langdetect to cld2 on a number of snippets from the Common Crawl, and found that langdetect was frequently detecting Japanese or Chinese text as Korean. This is particularly odd because, in the digital era, Korean is overwhelmingly written using hangul, not using Chinese characters.

Here's an example of a Chinese text that langdetect says is Korean with 99.999% confidence:

>>> from langdetect import detect_langs
>>> text = "這些機構主辦的課程，多以基本電腦使用為主，例如文書處理、中文輸入、互聯網應用等，在教學環境方面，多數也屬非常基本實用，部分培訓中心的器材設施甚至有點不足或落後，但是在導師水平和態度上，普遍也很良好，有部分導師更主動 地一直在更新及改良教材，以配合受再培訓人士的能力和需要。"
>>> detect_langs(text)
[ko:0.9999977954260393]

The text was updated successfully, but these errors were encountered:

EdwardBetts · 2018-04-07T08:59:26Z

I have the same problem. This is "Doshisha University" in Japanese.

>>> from langdetect import detect_langs
>>> detect_langs('同志社大学')
[ko:0.9999959410191299]

Dobatymo · 2018-05-11T07:00:51Z

I have the also same problem. I am using this library to separate internet comments by language and only a few percent of the comments which end up in the Korean category are actually Korean. Most are Chinese or Japanese. Which is odd, because like @rspeer said, usually completely different characters are used. Chinese characters in Korean should be very rare nowadays and Korean characters in Chinese should be nonexistent.

zafercavdar · 2018-09-05T14:21:15Z

Any published fix for that problem?

patrickmpoon · 2018-09-11T14:24:30Z

For me, this issue is a deal killer. polyglot seems to handle this really well, though: https://polyglot.readthedocs.io/en/latest/Detection.html

Dobatymo · 2018-09-13T03:06:09Z

@patrickmpoon polyglot uses cld2 which was mentioned by op.

matrey · 2019-03-19T13:20:03Z

I got the same issue on a Traditional Chinese string:

評估產品的生命週期中，對環境造成的影響，影響包含對氣候的變化以及自然資源的枯竭程度 
[ko:0.9999969462364235] --> KO, this should be zh-tw

评估产品的整个生命周期对环境产生的影响，包括对气候变化的影响以及对自然资源枯竭的影响 
[zh-cn:0.9999981145247211] --> OK

This is indeed a massive deal breaker.

I ended up using fastText (cld2 or cld3 are fine too), and when Chinese is detected, I further detect the script (traditional or simplified) with hanzidentifier.

bsolomon1124 · 2019-10-04T22:02:33Z

As mentioned by others above, Polyglot, or really, the underlying library pycld2/cld2 wins out in these cases:

>>> import pycld2 as cld2
>>> text = "這些機構主辦的課程，多以基本電腦使用為主，例如文書處理、中文輸入、互聯網應用等，在教學環境方面，多數也屬非常基本實用，部分培訓中心的器材設施甚至 有點不足或落後，但是在導師水平和態度上，普遍也很良好，有部分導師更主動 地一直在更新及改良教材，以配合受再培訓人士的能力和需要。"
>>> data = text.encode("utf-8")
>>> cld2.detect(data, bestEffort=False)
(True, 383, (('ChineseT', 'zh-Hant', 99, 1951.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))
>>> cld2.detect('同志社大学'.encode("utf-8"), bestEffort=False)
(True, 17, (('Japanese', 'ja', 94, 1984.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))

yjjg1993 · 2019-12-13T09:27:42Z

I spent a little time on it and found that the problem lies in the training sample. Many Chinese characters (for example 且) do not show up in the training sample (wikipedia abstracts were used if I understand it correctly) and therefore cause the probability to be very low. However, those characters (且) appear in the Korean texts. The appearances of the characters can be easily checked in the profile directory.

weilu · 2020-09-26T19:01:27Z

Having the same issue of Chinese being detected as Korean (e.g. "要素替代弹性, 价格加成对劳动收入份额的影响研究"). Also, there are cases where English is detected as Italian (e.g. "A novel comprehensive statistical model for spontaneous synaptic quantal release"). Happens sometimes depending on the seed.

I tried polyglot but had trouble compiling the native dependency libicu (icu4c using brew on macos) so I ended up using fasttext with a pretrained model. The results look much more reliable than what langdetect provides – at least the above two cases are correctly detected.

mrhaanraadts · 2021-02-25T21:56:36Z

See #9

Dobatymo mentioned this issue Dec 7, 2018

Not a right detect result. #61

Open

Dobatymo mentioned this issue Jul 24, 2023

detect_langs('基督教和进化论矛盾吗？') returns ko!? #109

Open

Repository owner deleted a comment from TomLucidor Nov 2, 2023

anjifenjou mentioned this issue Dec 7, 2023

LangDetectException - 'Need to load profiles.' #65

Open

sudoskys mentioned this issue Jan 26, 2024

(feat):Enhanced language detection LlmKira/fast-langdetect#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Korean is incorrectly detected, with way too much confidence #51

Korean is incorrectly detected, with way too much confidence #51

rspeer commented Mar 6, 2018

EdwardBetts commented Apr 7, 2018

Dobatymo commented May 11, 2018

zafercavdar commented Sep 5, 2018

patrickmpoon commented Sep 11, 2018

Dobatymo commented Sep 13, 2018

matrey commented Mar 19, 2019

bsolomon1124 commented Oct 4, 2019 •

edited

Loading

yjjg1993 commented Dec 13, 2019 •

edited

Loading

weilu commented Sep 26, 2020

mrhaanraadts commented Feb 25, 2021

Korean is incorrectly detected, with way too much confidence #51

Korean is incorrectly detected, with way too much confidence #51

Comments

rspeer commented Mar 6, 2018

EdwardBetts commented Apr 7, 2018

Dobatymo commented May 11, 2018

zafercavdar commented Sep 5, 2018

patrickmpoon commented Sep 11, 2018

Dobatymo commented Sep 13, 2018

matrey commented Mar 19, 2019

bsolomon1124 commented Oct 4, 2019 • edited Loading

yjjg1993 commented Dec 13, 2019 • edited Loading

weilu commented Sep 26, 2020

mrhaanraadts commented Feb 25, 2021

bsolomon1124 commented Oct 4, 2019 •

edited

Loading

yjjg1993 commented Dec 13, 2019 •

edited

Loading