You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As reported by @massongit in pull request #98 , node-formatting seems to be ignored by mecab when using Unidic. Please refer to taku910/mecab#41.
A workaround is to force natto-py to accept an empty string value for output -O.
Steps to reproduce:
Install Unidic 2.1.2
Execute code snippet A below to observe that natto-py will not be able to respect the node-formatting specified, but instead use the default node-format for Unidic
Contrast code snippet A (natto-py)with B and C (using mecab from command-line)
# Snippet A
# Note that node-formatting is ignored and defaults to node-format-unidic
>>> with MeCab(r'-d /opt/mecab/lib/mecab/dic/unidic -F%m\t%t,%f[12]\n') as nm:
... for n in nm.parse('日本語だよ、これが。', as_nodes=True):
... print(n.feature)
...
日本 ニッポン ニッポン 日本 名詞-固有名詞-地名-国
語 ゴ ゴ 語 名詞-普通名詞-一般
だ ダ ダ だ 助動詞 助動詞-ダ 終止形-一般
よ ヨ ヨ よ 助詞-終助詞
、 、 補助記号-読点
これ コレ コレ 此れ 代名詞
が ガ ガ が 助詞-格助詞
。 。 補助記号-句点
EOS
# Snippet B
# Note that node-formatting is ignored and defaults to node-format-unidic
$ echo '日本語だよ、これが。' | mecab -d /opt/mecab/lib/mecab/dic/unidic/ -F%m\\t%t,%f[12]\\n
日本 ニッポン ニッポン 日本 名詞-固有名詞-地名-国
語 ゴ ゴ 語 名詞-普通名詞-一般
だ ダ ダ だ 助動詞 助動詞-ダ 終止形-一般
よ ヨ ヨ よ 助詞-終助詞
、 、 補助記号-読点
これ コレ コレ 此れ 代名詞
が ガ ガ が 助詞-格助詞
。 。 補助記号-句点
EOS
# Snippet C
# node-formatting is honored when -O is passed an empty string!
$ echo '日本語だよ、これが。' | mecab -d /opt/mecab/lib/mecab/dic/unidic/ -F%m\\t%t,%f[12]\\n -O ""
日本 2,固
語 2,漢
だ 6,和
よ 6,和
、 3,記号
これ 6,和
が 6,和
。 3,記号
EOS
The text was updated successfully, but these errors were encountered:
The output-format-type option is used in a dictionary's dicrc to specify a default output format type for node-formatting. For example consider the following sample dicrc for Unidic:
Here, the default formatting when no other is specified is then *-format-unidic2
MeCab gives preference to output-format-type over node-format, etc., unlessoutput-format-type is explicitly set to be empty. This behavior is consistent across ipadic, jumandic and unidic dictionaries.
I will close this issue. However, I have updated the output-format-type MeCab option description in the project wiki to describe how to override an existing, default output format by specifying an empty string.
As reported by @massongit in pull request #98 , node-formatting seems to be ignored by mecab when using Unidic. Please refer to taku910/mecab#41.
A workaround is to force natto-py to accept an empty string value for output
-O
.Steps to reproduce:
node-format
for Unidicnatto-py
)with B and C (usingmecab
from command-line)The text was updated successfully, but these errors were encountered: