Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable to specify an empty string option #98

Merged
merged 2 commits into from
Nov 29, 2017

Conversation

massongit
Copy link
Contributor

@massongit massongit commented Nov 26, 2017

(Related to taku910/mecab#41)
I enabled to specify an empty string option to enable to specify node-format option when using UniDic.

@massongit
Copy link
Contributor Author

massongit commented Nov 26, 2017

I will write a test for this implementation, but I don't know where to write it in tests/test_option_parse.py.
Please tell me.

@massongit massongit changed the title Fix to be able to specify node-format option even when using UniDic Don't specify node-format option when using UniDic Nov 26, 2017
@massongit massongit changed the title Don't specify node-format option when using UniDic Enable to specify an empty string option Nov 26, 2017
@massongit massongit force-pushed the feature/node_format branch 2 times, most recently from c340524 to 0e01002 Compare November 27, 2017 07:58
@buruzaemon
Copy link
Owner

Thank you @massongit for bringing this issue to my attention. I will first confirm this and then open up an issue ticket. Please give me some time to look into this.

@buruzaemon
Copy link
Owner

OK, this was easy enough to confirm.

I have opened up issue #99 to track this. I will start by coming up with appropriate tests, hopefully for both Windows and UNIX-type platforms. I don't have any tests for dictionaries besides ipadic, so I will need some time to come up with something that can cover Unidic, and perhaps Jumandic as well.

@buruzaemon
Copy link
Owner

@massongit, thank you for your patience. Here is what I have found:

  1. MeCab gives preference to output-format-type over node-format, etc.
  2. But if you explicitly override this behavior by unsetting output-format-type (specifying an empty string), node-format will then be used.

This behavior of MeCab is consistent across ipadic, jumandic and unidic, and is not a function of the dictionary used.

I expect that your Unidic dicrc has the following lines:

output-format-type = unidic

node-format-unidic = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n

That means that unless you explicitly unset output-format-type by passing MeCab an empty string/name with -O "", the node format will default to node-format-unidic even if you also used -F. If you comment out output-format-type = unidic in your dicrc, then you will see that you don't need -O "".

You are correct that natto-py must likewise be able to accept -O "" in order to mirror this behavior.

Hence, I will be accepting your pull request. Thank you very much! I will come up with some unit tests to cover this new behavior.

@buruzaemon buruzaemon merged commit 6fe5022 into buruzaemon:master Nov 29, 2017
@massongit
Copy link
Contributor Author

@buruzaemon Thank you for confirm and merging!

@massongit massongit deleted the feature/node_format branch November 30, 2017 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants