-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
romaji transliteration #308
Comments
AFAIK, The simplest solution would be to use the "pronunciation" element of the JSON output to get the katakana reading and map it to the romaji somehow. $ echo "ローマ字変換プログラム作ってみた。" | kagome -json | jq -r '.[].pronunciation'
ローマジ
ヘンカン
プログラム
ツクッ
テ
ミ
タ
。 The disadvantage of this method is that the accuracy depends on the quality of the dictionary. Some words have no pronunciation field in the default dictionary. |
Here's a simple example using kana, the alternative of cutlet in Go. package main
import (
"fmt"
"strings"
"unicode"
"github.com/gojp/kana"
)
func main() {
input := `ローマジ
ヘンカン
プログラム
ツクッ
テ
ミ
タ
。
`
lines := strings.Split(input, "\n")
for _, line := range lines {
line = strings.TrimSpace(line)
yomi := strings.Map(func(r rune) rune {
if unicode.IsLetter(r) {
return r
}
return -1
}, kana.KanaToRomaji(line))
if yomi == "" {
continue
}
fmt.Println(yomi)
}
}
// Output:
// romaji
// henkan
// puroguramu
// tsuku
// te
// mi
// ta
|
I've been interested in this issue for some time, ever since it was first proposed, and have tried to develop patterns that can be applied in practice. This is my current solution. If this is ok to you, I would like to PR to add it to the "_examples" directory. package main
import (
"fmt"
"log"
"strings"
"github.com/gojp/kana"
"github.com/ikawaha/kagome-dict/dict"
"github.com/ikawaha/kagome-dict/ipa"
"github.com/ikawaha/kagome/v2/tokenizer"
)
func main() {
input := `
ローマ字変換プログラム作ってみた。
五街道のひとつである、東海道五十三次の品川宿などを変換してみると面白いかもしれない。
`
usrDict := `
東海道五十三次,東海道 五十三 次,トウカイドウ ゴジュウサン ツギ,カスタム名詞
品川宿,品川 宿,シナガワ ジュク,カスタム名詞
`
// Convert user dictionary string to tokenizer.Option
usrDictOpt, err := newUserDictOpt(usrDict)
if err != nil {
log.Fatal(err)
}
// Create IPA-dict-based tokenizer with user dictionary
tkn, err := tokenizer.New(ipa.Dict(), usrDictOpt, tokenizer.OmitBosEos())
if err != nil {
log.Fatal(err)
}
// Split input text by line
lines := strings.Split(input, "\n")
for _, line := range lines {
if strings.TrimSpace(line) == "" {
continue // ignore empty lines
}
tokens := tkn.Tokenize(line)
chunks := []string{}
tmpChunk := ""
// Evaluate each token to retrieve the pronunciation or reading as a
// slice of chunks to join them later. It is similar to Wakachi, but
// with a bit more complex logic.
for _, token := range tokens {
if usrExtra := token.UserExtra(); usrExtra != nil {
tmpChunk = strings.Join(usrExtra.Readings, " ")
} else if p, ok := token.Pronunciation(); ok {
tmpChunk = p
} else if r, ok := token.Reading(); ok {
tmpChunk = r // fallback to reading if pronunciation is not available
} else {
tmpChunk = token.Surface
}
tmpChunk = strings.TrimSpace(tmpChunk)
//fmt.Println("Log:", tmpChunk, token.POS())
if isPartOfPrev(token) {
chunks[len(chunks)-1] += tmpChunk // Append to the previous chunk
} else {
chunks = append(chunks, tmpChunk) // Append to the slice of chunks
}
}
fmt.Println(kana.KanaToRomaji(strings.Join(chunks, " ")))
}
// Output:
// ro-maji henkan puroguramu tsukutte mita。
// go kaido- no hitotsudearu、 toukaidou gojuusan tsugi no shinagawa juku nado wo henkan shite miruto omoshiroi kamo shirenai。
}
// isPartOfPrev returns true if the token prefers to be part of the previous chunk.
//
// e.g. tsuku te mi ta。-> tsukutte mita。
func isPartOfPrev(token tokenizer.Token) bool {
// Not "助詞" "助動詞" or "記号"
if !strings.ContainsAny(token.POS()[0], "助"+"記") {
return false
}
switch token.POS()[1] {
case "副助詞", "連体化", "格助詞":
return false
default:
return true
}
}
// newUserDictOpt creates a tokenizer.Option from a user dictionary string.
func newUserDictOpt(rec string) (tokenizer.Option, error) {
usrDictRec, err := dict.NewUserDicRecords(strings.NewReader(rec))
if err != nil {
return nil, err
}
usrDict, err := usrDictRec.NewUserDict()
if err != nil {
return nil, err
}
return tokenizer.UserDict(usrDict), nil
} |
my ideal solution would be adding a switch into the main binary. |
You are probably looking for a feature such as: $ echo "ローマ字変換プログラム作ってみた。" | kagome yomi
ローマジ ヘンカン プログラム ツクッテ ミタ。
$ echo "ローマ字変換プログラム作ってみた。" | kagome yomi -katakana
ローマジ ヘンカン プログラム ツクッテ ミタ。
$ echo "ローマ字変換プログラム作ってみた。" | kagome yomi -hiragana
ろーまじ へんかん ぷろぐらむ つくって みた。
$ echo "ローマ字変換プログラム作ってみた。" | kagome yomi -romaji
ro-maji henkan puroguramu tsukutte mita。
$ echo "五街道のひとつである、東海道五十三次の品川宿などを変換してみると面白いかもしれない。" > text.txt
$ kagome yomi -file text.txt -userdict mydict.txt -romaji
go kaido- no hitotsudearu、 toukaidou gojuusan tsugi no shinagawa juku nado wo henkan shite miruto omoshiroi kamo shirenai。 If so, I agree. As I also use Kagome on TTS reader very much. And feel that it is useful for creating romaji-subtitles as well.
The main problem is that there are too many variations in the romanization of Japanese. Such as: Nihon-shiki, Kunrei-shiki (or ISO-3602), Traditional Hepburn, Modified Hepburn, JSL romanization, etc. To support them we need to implement custom user dictionary for romanization and define its format. $ kagome yomi -file text.txt -userdict mydict.txt -romajidict my_hepburn.txt -romaji
go kaidō no hitotudearu、 toukaidō gojyūsan tugi no sinagawa juku nado o henkan site miruto omosiroi kamo sirenai。 At this point, it would be ideal to add the example to the "_examples" directory first. Next, open an issue to ask support for the "yomi" subcommand for Katakana/Hiragana readings. Then implement the "-romaji" option for the "yomi" subcommand. |
ok, no problem for me. I've managed to write a tool for lyrics transliteration in go thanks to the examples posted here. But currently i am getting better results with cutlet which it is also able to detect foreign words. |
Nice. This example helps to get a concrete picture. I think fine-tuning the details is the difficult part.
Indeed. "Cutlet" is a cool python application. The "gojp/kana" package, on the other hand, has been inactive for more than five years. And I have to admit that it is inaccurate in some cases. But Kagome is positioned in the same way as MeCab, implementation of transliteration itself is out of scope and not ideal. Thus, we need to search for an alternative package, such as:
Can you provide us some examples? Something like: testData := []struct {
input string
expect string
}{
{input: "こぼれたままの流星群 一秒 一秒", expect: "Koboreta mama no ryu-sei gun ichi byo- ichi byo-"},
{input: "流転 lights 消せないコナゴナ銀河。", expect: "Ruten LIGHTS kesenai konagona ginga."},
} The more examples the better. By the way, the above test is the current output of the example that I'm working on. _example/romaji_transliteration (WIP)package main
import (
"fmt"
"log"
"strings"
"unicode"
"github.com/gojp/kana"
"github.com/ikawaha/kagome-dict/dict"
"github.com/ikawaha/kagome-dict/ipa"
"github.com/ikawaha/kagome/v2/tokenizer"
"golang.org/x/text/width"
)
func main() {
// User input text
input := `
ローマ字変換プログラム作ってみた。
五街道のひとつである、東海道五十三次の品川宿などを変換してみると面白いかもしれない。
こぼれたままの流星群 一秒 一秒
流転 lights 消せないコナゴナ銀河。
`
// Built-in user dictionary
usrDict := `
東海道五十三次,東海道 五十三 次,トウカイドウ ゴジュウサン ツギ,カスタム名詞
品川宿,品川 宿,シナガワ ジュク,カスタム名詞
`
// Convert user dictionary string to tokenizer.Option
usrDictOpt, err := newUserDictOpt(usrDict)
if err != nil {
log.Fatal(err)
}
// Create IPA-dict-based tokenizer with user dictionary
tkn, err := tokenizer.New(ipa.Dict(), usrDictOpt, tokenizer.OmitBosEos())
if err != nil {
log.Fatal(err)
}
// Split input text by line
lines := strings.Split(input, "\n")
for _, line := range lines {
// Get Yomi (pronunciation/reading) from the line in Katakana
yomi := getYomi(tkn, line)
if yomi == "" {
continue // ignore empty lines
}
// Transliterate to Romaji.
romaji := getRomantic(yomi)
fmt.Println(romaji)
}
//
// Output:
// ro-maji henkan puroguramu tsukutte mita.
// go kaido- no hitotsudearu, toukaidou gojuusan tsugi no shinagawa juku nado wo henkan shite miruto omoshiroi kamo shirenai.
}
var conversionMap = map[rune]rune{
'、': ',',
'。': '.',
'!': '!',
'?': '?',
'「': '"',
'」': '"',
'『': '"',
'』': '"',
}
// getRomantic returns the Romaji transliteration of the input in Katakana.
func getRomantic(line string) (yomi string) {
defer func() {
// Finally, remove extra spaces
if yomi != "" {
yomi = strings.Join(strings.Fields(yomi), " ")
}
}()
// In this example we use the github.com/gojp/kana package for Katakana
// transliteration to Romaji. However, other packages are available.
// Such as:
// - github.com/robpike/nihongo
// - github.com/kotaroooo0/gojaconv
// - github.com/yosida95/romaji
// - github.com/goark/krconv
romaji := kana.KanaToRomaji(line)
// Barely normalize full-width chars to half-width ('、' -> ',', '。' -> '.', etc.)
romaji = convToHalfWidth(romaji)
// Capitalize the first letter of each sentence
sentences := strings.SplitAfter(romaji, ".")
for index, sentence := range sentences {
sentence = strings.TrimSpace(sentence)
isFirst := true
// Capitalize the first letter of each sentence
sentence = strings.Map(func(r rune) rune {
if isFirst {
isFirst = false
return unicode.ToUpper(r)
}
return r
}, sentence)
//sentences[index] = cases.Title(language.English).String(sentence)
sentences[index] = sentence
}
return strings.Join(sentences, " ")
}
// convToHalfWidth converts full-width alpha-numeric characters to half-width
// characters according to the conversionMap.
func convToHalfWidth(input string) string {
// Convert half-width katakana characters to full-width and full-width
// alphanumeric characters to half-width.
input = width.Fold.String(input)
return strings.Map(func(r rune) rune {
if unicode.Is(unicode.Han, r) {
return r
}
if converted, ok := conversionMap[r]; ok {
return converted
}
return r
}, input)
}
// getYomi returns the pronunciation/reading (Yomi) of the input in Katakana
// using the given tokenizer.
func getYomi(tkn *tokenizer.Tokenizer, line string) string {
line = strings.TrimSpace(line)
if line == "" {
return ""
}
if isASCII(line) {
return line
}
tokens := tkn.Tokenize(line)
chunks := []string{}
tmpChunk := ""
isPrevASCII := false
// Evaluate each token to retrieve the pronunciation or reading as a
// slice of chunks to join them later. It is similar to Wakachi, but
// with a bit more complex logic.
for _, token := range tokens {
prevKey := len(chunks) - 1
// Detect ASCII words
if isASCII(token.Surface) {
if isPrevASCII {
chunks[prevKey] += token.Surface
} else {
chunks = append(chunks, token.Surface)
isPrevASCII = true
}
continue
} else if isPrevASCII {
// Capitalize the previous chunk if it was all in ASCII
chunks[prevKey] = strings.ToUpper(chunks[prevKey])
}
isPrevASCII = false
// Retrieve the pronunciation/reading from the token in katakana
if usrExtra := token.UserExtra(); usrExtra != nil {
tmpChunk = strings.Join(usrExtra.Readings, " ")
} else if p, ok := token.Pronunciation(); ok {
tmpChunk = p
} else if r, ok := token.Reading(); ok {
tmpChunk = r // fallback to reading if pronunciation is not available
} else {
tmpChunk = token.Surface
}
tmpChunk = strings.TrimSpace(tmpChunk)
//fmt.Println("Log:", tmpChunk, token.POS())
if isPartOfPrev(token) {
chunks[prevKey] += tmpChunk // Append to the previous chunk
} else {
chunks = append(chunks, tmpChunk) // Append to the slice of chunks
}
}
return strings.Join(chunks, " ")
}
// isASCII returns true if the string is all in ASCII.
func isASCII(s string) bool {
for i := 0; i < len(s); i++ {
if s[i] > unicode.MaxASCII {
return false
}
}
return true
}
// isPartOfPrev returns true if the token prefers to be part of the previous chunk.
//
// e.g. tsuku te mi ta。--> tsukutte mita。
func isPartOfPrev(token tokenizer.Token) bool {
// Not "助詞" "助動詞" nor "記号"
if !strings.ContainsAny(token.POS()[0], "助"+"記") {
return false
}
switch token.POS()[1] {
// Ignore below particles, conjunctions, and auxiliary verbs
case "副助詞", "連体化", "格助詞":
return false
// Else, consider as part of the previous chunk
default:
return true
}
}
// newUserDictOpt creates a tokenizer.Option from a user dictionary string.
func newUserDictOpt(rec string) (tokenizer.Option, error) {
// Read user dictionary records from the string.
usrDictRec, err := dict.NewUserDicRecords(strings.NewReader(rec))
if err != nil {
return nil, err
}
// Create a dict.UserDict from the records.
usrDict, err := usrDictRec.NewUserDict()
if err != nil {
return nil, err
}
// Cast the UserDict to tokenizer.Option.
return tokenizer.UserDict(usrDict), nil
} |
seems a good example to me, especially the symbols conversion map is much needed. If you need more text examples, you can find a lot on this site: https://www.animelyrics.com/ some examples: Another use case i have for this is converting filenames with japanese chars, often coming from audio CDs. e.g.: https://vgmdb.net/album/35606 For this use the script should produce only basic ASCII strings (in the range 0-127). edit: for easier reusability the script should act as a unix filer: read 1 line from stdin and output 1 line stdout. |
I am working on it, but it is becoming more and more like a real application (more complex) and is not suitable for sample applications. The examples should be as simple as possible to illustrate the basic use of "kagome" as a library, right? To close this issue, I am considering the following steps. What do you think?
|
sure, no problem for me. For symbols conversions i've found there are some libraries than can do that, so you may remove that hardcoded table and use them instead. From the cmdline:
uconv can also do kana->romaji transliteration , so it could be the library of choice to replace "gojp/kana" :
|
is it possible to use the cmdline tool for interactive romaji transliteration?
e.g.
(same as cutlet)
The text was updated successfully, but these errors were encountered: