"İ" does not behave the same as the Java version of Sudachi.

I have noticed that Sudachi Py and Sudachi behave differently because "İstanbul" is not recognized as a single token in SudachiPy, so I will report it.

```
$ echo "İstanbul" | sudachipy -a
İ       名詞,普通名詞,一般,*,*,*        I       I       アイ    0       []
        補助記号,一般,*,*,*,*   ̇       ̇               -1      []      (OOV)
stanbul 名詞,普通名詞,一般,*,*,*        stanbul stanbul         -1      []      (OOV)
EOS

$ echo "İstanbul" | sudachi -a
İstanbul        名詞,固有名詞,一般,*,*,*        Istanbul        Istanbul        Istanbul        0       [15600]
EOS
```

Apparently, the character normalization process is passing different input to each sudachi.

```
$ echo "İstanbul" | sudachipy -d
=== Inupt dump:
i(U+0307)stanbul

$ echo "İstanbul" | sudachi -d
=== Input dump:
istanbul
```
It seems that "İ (U + 0130)" is converted to "i (U + 0069)-◌̇ (U + 0307)" in python and "i (U + 0069)" in java.
This may be due to the `lower` specification of each programming language.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

"İ" does not behave the same as the Java version of Sudachi. #202

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

"İ" does not behave the same as the Java version of Sudachi. #202

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions