-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Korean is incorrectly detected, with way too much confidence #51
Comments
I have the same problem. This is "Doshisha University" in Japanese. >>> from langdetect import detect_langs
>>> detect_langs('同志社大学')
[ko:0.9999959410191299] |
I have the also same problem. I am using this library to separate internet comments by language and only a few percent of the comments which end up in the Korean category are actually Korean. Most are Chinese or Japanese. Which is odd, because like @rspeer said, usually completely different characters are used. Chinese characters in Korean should be very rare nowadays and Korean characters in Chinese should be nonexistent. |
Any published fix for that problem? |
For me, this issue is a deal killer. polyglot seems to handle this really well, though: https://polyglot.readthedocs.io/en/latest/Detection.html |
@patrickmpoon polyglot uses |
I got the same issue on a Traditional Chinese string:
This is indeed a massive deal breaker. I ended up using |
As mentioned by others above, Polyglot, or really, the underlying library pycld2/cld2 wins out in these cases: >>> import pycld2 as cld2
>>> text = "這些機構主辦的課程,多以基本電腦使用為主,例如文書處理、中文輸入、互聯網應用等,在教學環境方面,多數也屬非常基本實用,部分培訓中心的器材設施甚至 有點不足或落後,但是在導師水平和態度上,普遍也很良好,有部分導師更主動 地一直在更新及改良教材,以配合受再培訓人士的能力和需要。"
>>> data = text.encode("utf-8")
>>> cld2.detect(data, bestEffort=False)
(True, 383, (('ChineseT', 'zh-Hant', 99, 1951.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))
>>> cld2.detect('同志社大学'.encode("utf-8"), bestEffort=False)
(True, 17, (('Japanese', 'ja', 94, 1984.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0))) |
I spent a little time on it and found that the problem lies in the training sample. Many Chinese characters (for example 且) do not show up in the training sample (wikipedia abstracts were used if I understand it correctly) and therefore cause the probability to be very low. However, those characters (且) appear in the Korean texts. The appearances of the characters can be easily checked in the profile directory. |
Having the same issue of Chinese being detected as Korean (e.g. "要素替代弹性, 价格加成对劳动收入份额的影响研究"). Also, there are cases where English is detected as Italian (e.g. "A novel comprehensive statistical model for spontaneous synaptic quantal release"). Happens sometimes depending on the seed. I tried polyglot but had trouble compiling the native dependency libicu (icu4c using brew on macos) so I ended up using fasttext with a pretrained model. The results look much more reliable than what langdetect provides – at least the above two cases are correctly detected. |
See #9 |
I compared the results of
langdetect
tocld2
on a number of snippets from the Common Crawl, and found that langdetect was frequently detecting Japanese or Chinese text as Korean. This is particularly odd because, in the digital era, Korean is overwhelmingly written using hangul, not using Chinese characters.Here's an example of a Chinese text that langdetect says is Korean with 99.999% confidence:
The text was updated successfully, but these errors were encountered: