Open
Description
I have noticed that Sudachi Py and Sudachi behave differently because "İstanbul" is not recognized as a single token in SudachiPy, so I will report it.
$ echo "İstanbul" | sudachipy -a
İ 名詞,普通名詞,一般,*,*,* I I アイ 0 []
補助記号,一般,*,*,*,* ̇ ̇ -1 [] (OOV)
stanbul 名詞,普通名詞,一般,*,*,* stanbul stanbul -1 [] (OOV)
EOS
$ echo "İstanbul" | sudachi -a
İstanbul 名詞,固有名詞,一般,*,*,* Istanbul Istanbul Istanbul 0 [15600]
EOS
Apparently, the character normalization process is passing different input to each sudachi.
$ echo "İstanbul" | sudachipy -d
=== Inupt dump:
i(U+0307)stanbul
$ echo "İstanbul" | sudachi -d
=== Input dump:
istanbul
It seems that "İ (U + 0130)" is converted to "i (U + 0069)-◌̇ (U + 0307)" in python and "i (U + 0069)" in java.
This may be due to the lower
specification of each programming language.
Metadata
Metadata
Assignees
Labels
No labels