You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have noticed that Sudachi Py and Sudachi behave differently because "İstanbul" is not recognized as a single token in SudachiPy, so I will report it.
$ echo "İstanbul" | sudachipy -a
İ 名詞,普通名詞,一般,*,*,* I I アイ 0 []
補助記号,一般,*,*,*,* ̇ ̇ -1 [] (OOV)
stanbul 名詞,普通名詞,一般,*,*,* stanbul stanbul -1 [] (OOV)
EOS
$ echo "İstanbul" | sudachi -a
İstanbul 名詞,固有名詞,一般,*,*,* Istanbul Istanbul Istanbul 0 [15600]
EOS
Apparently, the character normalization process is passing different input to each sudachi.
It seems that "İ (U + 0130)" is converted to "i (U + 0069)-◌̇ (U + 0307)" in python and "i (U + 0069)" in java.
This may be due to the lower specification of each programming language.
The text was updated successfully, but these errors were encountered:
I have noticed that Sudachi Py and Sudachi behave differently because "İstanbul" is not recognized as a single token in SudachiPy, so I will report it.
Apparently, the character normalization process is passing different input to each sudachi.
It seems that "İ (U + 0130)" is converted to "i (U + 0069)-◌̇ (U + 0307)" in python and "i (U + 0069)" in java.
This may be due to the
lower
specification of each programming language.The text was updated successfully, but these errors were encountered: