Some katakana words have no prounuciation #120

annisat · 2023-07-20T07:16:23Z

I need to segment some sentences and get their pronunciations. Some katakana words don't seem to have information on their pronunciation. I can of course transcribe them by katakana's prounuciation rules. But I'm wondering if this is by design? Or this is a bug?

Here's the code to produce the error

from janome.tokenizer import Tokenizer
toker = Tokenizer()

stc = "米国上院では、エドワード・ケネディー上院議員、ジョン・マッケイン上院議員共著による議案についても検討される。"
for token in toker.tokenize(stc):
    print(token)

And here's the output

米国    名詞,固有名詞,地域,国,*,*,米国,ベイコク,ベイコク
上院    名詞,固有名詞,組織,*,*,*,上院,ジョウイン,ジョーイン
で      助詞,格助詞,一般,*,*,*,で,デ,デ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
、      記号,読点,*,*,*,*,、,、,、
エドワード      名詞,固有名詞,人名,名,*,*,エドワード,エドワード,エドワード
・      記号,一般,*,*,*,*,・,・,・
ケネディー      名詞,一般,*,*,*,*,ケネディー,*,*
上院    名詞,固有名詞,組織,*,*,*,上院,ジョウイン,ジョーイン
議員    名詞,一般,*,*,*,*,議員,ギイン,ギイン
、      記号,読点,*,*,*,*,、,、,、
ジョン  名詞,固有名詞,人名,名,*,*,ジョン,ジョン,ジョン
・      記号,一般,*,*,*,*,・,・,・
マッケイン      名詞,一般,*,*,*,*,マッケイン,*,*
上院    名詞,固有名詞,組織,*,*,*,上院,ジョウイン,ジョーイン
議員    名詞,一般,*,*,*,*,議員,ギイン,ギイン
共著    名詞,一般,*,*,*,*,共著,キョウチョ,キョーチョ
による  助詞,格助詞,連語,*,*,*,による,ニヨル,ニヨル
議案    名詞,一般,*,*,*,*,議案,ギアン,ギアン
について        助詞,格助詞,連語,*,*,*,について,ニツイテ,ニツイテ
も      助詞,係助詞,*,*,*,*,も,モ,モ
検討    名詞,サ変接続,*,*,*,*,検討,ケントウ,ケントー
さ      動詞,自立,*,*,サ変・スル,未然レル接続,する,サ,サ
れる    動詞,接尾,*,*,一段,基本形,れる,レル,レル
。      記号,句点,*,*,*,*,。,。,。

The last column in ケネディー and マッケイン are "*", while エドワード and ジョン have that info.

The text was updated successfully, but these errors were encountered:

mocobeta · 2023-07-22T01:27:55Z

Hi, this is an expected behavior. "エドワード" and "ジョン" exist in the mecab-ipadic dictionary but there are no entries of "ケネディー" and "マッケイン".

In terms of morphological analysis, those are "unknown" words and do not have any morphological information such as pronunciation other than estimated POS tag.

annisat · 2023-07-24T05:51:52Z

I see. Thanks for the reply.

In the case of katakana, maybe the pronuncation can be inferred from the word form?
For example, replace every イ followed by エ段 katakana with ー

mocobeta · 2023-07-24T08:26:07Z

In the case of katakana, maybe the pronuncation can be inferred from the word form?
For example, replace every イ followed by エ段 katakana with ー

It would be up to the applications, but consistent conversion with the dictionary entries makes sense to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some katakana words have no prounuciation #120

Some katakana words have no prounuciation #120

annisat commented Jul 20, 2023

mocobeta commented Jul 22, 2023

annisat commented Jul 24, 2023

mocobeta commented Jul 24, 2023

Some katakana words have no prounuciation #120

Some katakana words have no prounuciation #120

Comments

annisat commented Jul 20, 2023

mocobeta commented Jul 22, 2023

annisat commented Jul 24, 2023

mocobeta commented Jul 24, 2023