Skip to content

"İ" does not behave the same as the Java version of Sudachi. #202

Open
@katsutan

Description

@katsutan

I have noticed that Sudachi Py and Sudachi behave differently because "İstanbul" is not recognized as a single token in SudachiPy, so I will report it.

$ echo "İstanbul" | sudachipy -a
İ       名詞,普通名詞,一般,*,*,*        I       I       アイ    0       []
        補助記号,一般,*,*,*,*   ̇       ̇               -1      []      (OOV)
stanbul 名詞,普通名詞,一般,*,*,*        stanbul stanbul         -1      []      (OOV)
EOS

$ echo "İstanbul" | sudachi -a
İstanbul        名詞,固有名詞,一般,*,*,*        Istanbul        Istanbul        Istanbul        0       [15600]
EOS

Apparently, the character normalization process is passing different input to each sudachi.

$ echo "İstanbul" | sudachipy -d
=== Inupt dump:
i(U+0307)stanbul

$ echo "İstanbul" | sudachi -d
=== Input dump:
istanbul

It seems that "İ (U + 0130)" is converted to "i (U + 0069)-◌̇ (U + 0307)" in python and "i (U + 0069)" in java.
This may be due to the lower specification of each programming language.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions