Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new word in the vocabulary #79

Open
ckobus opened this issue Oct 22, 2019 · 25 comments
Open

Adding new word in the vocabulary #79

ckobus opened this issue Oct 22, 2019 · 25 comments

Comments

@ckobus
Copy link

ckobus commented Oct 22, 2019

Hi,

I would like to use the pretrained acoustic model for English but use it in combination with a new in-domain language model, for which I have to generate pronunciations.

I am used to the Kaldi toolkit and the CMU dictionary, which uses the ARPA alphabet. I saw in your repo the script to convert the CMU dictionary to the IPA format but when I look at the phones.txt file associated to the acoustic model, I do not recognize the IPA format. For ex, to which phoneme tS corresponds to in the ARPA alphabet?

I hoope my question is clear enough.

Thank you for your answer!

CK

@joazoa
Copy link

joazoa commented Oct 22, 2019

Hello,

I'm quite new to Kaldi and not qualified to answer, but does this help ?
https://github.com/kaldi-asr/kaldi/blob/1ff668adbec7987a8b9f91932a786ad8c4b75d86/src/doc/data_prep.dox (Search for words.txt
and
https://white.ucc.asn.au/Kaldi-Notes/fst-example/
and
https://kaldi-asr.org/doc/graph_recipe_test.html

I hope it helps at least with the format question.

@ckobus
Copy link
Author

ckobus commented Oct 22, 2019

Thanks for the answer but what I want to do is to add a set of new in-domain words in the vocabulary (I do not want them to be considered as OOV). To do that, I need to generate pronunciations for them. Pronunciations coming either from CMU dictionary, or coming from a G2P system.
The problem is that CMU dict uses a set of phonemes (ARPA alphabet I think) but in the zamia speech model, the set of phonemes you can find in the phones.txt file, is not the same (following IPA?) and I would like to know how to easily map the set of new pronunciations into the set of correct phoneme labels handled by the pre-trained acoustic model.

@ckobus
Copy link
Author

ckobus commented Oct 23, 2019

Any hint?

@joazoa
Copy link

joazoa commented Oct 23, 2019 via email

@svenha
Copy link
Contributor

svenha commented Oct 23, 2019

@joazoa If the new word is not in the language model, you have to extend the language model too. An approach is provided by this repo: https://github.com/gooofy/kaldi-adapt-lm

@joazoa
Copy link

joazoa commented Oct 23, 2019 via email

@ckobus
Copy link
Author

ckobus commented Oct 23, 2019

Yes but with kaldi-adapt-lm, it seems you only restrict to the words the model is already able to recognise (ie words part of the lexicon). cf. "we also want to limit our language model to the vocabulary the audio model supports, so let's extract the vocabulary next"
In my case, I want to use an in-domain Language model with a lot of new words, that are OOV for the current model.
My question is how to generate pronunciations that are compliant with the phoneme set of the model? So far, with Kaldi, I worked with pronunciations with ARPABET symbols, which do not match with the ones in the English model. Anyone already tried to do this?

@gooofy
Copy link
Owner

gooofy commented Oct 28, 2019

you can use the speech_lex_edit.py script to add new words to the dictionary. the original dict uses IPA phoneme symbols - for the kaldi models those get converted to XSAMPA AFAIR. you can find translation tables as well as mapping helper functions here:

https://github.com/gooofy/py-nltools/blob/master/nltools/phonetics.py

@besimali
Copy link

Did you manage to do this? @ckobus

@ckobus
Copy link
Author

ckobus commented Dec 4, 2019

Sorry, I just noticed your message.
Yes, I finally succeeded; I had to adapt the script to convert pronunciation from the ARPAbet alphabet to the IPA one and then I adapted the Kaldi script prepare_lang.sh to create a new L.fst.
At the end, the engine works quite well on my domain. Thanks for the quality of the acoustic models!!

@ammyt
Copy link

ammyt commented May 13, 2020

Hi @ckobus ,which scripts did you use and from where after converting to ipa? Can you please clarify?

@fquirin
Copy link

fquirin commented Jun 30, 2021

@gooofy , @ckobus , @ammyt
I'm pretty confused about the phoneme set as well right now. When I have an IPA result do I use SAMPA, X-SAMPA, Conlang X-SAMPA (the " doesn't really exist in lexicon.txt) , X-ARPABET or any variation of this? 😅
Did anyone figure this out?

@abdullah-tayeh
Copy link

Hi @fquirin , there is a script in the package that does the conversion automatically (at least for german)
I think it was speech_lex_edit.
You basically use speech_lex_edit then type the word in german then it does the conversion automatically for you

@fquirin
Copy link

fquirin commented Jun 30, 2021

Hi @abdullah-tayeh , thanks for the note :-)
I followed the breadcrumbs and I think they lead to ipa2xsampa, but looking at the translation table it differs at least in one point from the official X-SAMPA standard using a different apostrophe for "primary stress": ' instead of ". I wonder what else is different 🤔

@gooofy
Copy link
Owner

gooofy commented Jul 1, 2021

@fquirin, please check out the tables in https://github.com/gooofy/py-nltools/blob/master/nltools/phonetics.py which should contain all the phonemes used in zamia-speech

@fquirin
Copy link

fquirin commented Jul 1, 2021

hey @gooofy , yes that's where I found ipa2xsampa but when I compared it to Gruut-IPA sampa conversion I realized its using the wrong apostrophe for "primary stress". So far this is the only difference I've found but I didn't check all the phonemes.

I'm building a new version of kaldi-lm-adapt and wanted to add an espeak-to-zamia feature (espeak IPA) for new lexicon entries 🙂 . Btw the 2019 Zamia Kaldi models still rock 😎 👍

@gooofy
Copy link
Owner

gooofy commented Jul 2, 2021

AFAIR I decided against the concept of "primary stress" vs "secondary stress" when designing the zamia phoneme set, instead I went with general "stress" marks which can appear multiple times within one word. Main reason was dealing with german compound words but also practicality: zamia's phoneme set is geared towards dealing with tts results which can contain arbitrary numbers of stress marks depending on the tool used. In fact, I don't recall any tts engine distinguishing primary and secondary stress.

@fquirin
Copy link

fquirin commented Jul 2, 2021

Thanks for the explanation @gooofy ! I tried to search for info about "AFAIR" before but couldn't find anything ^^.
I can't say that I fully understand how to work with "primary stress" and "secondary stress", but according to your explanation I should be safe if I convert IPA to XSAMPA and then replace the apostrophe? Or maybe even better, use the normalization given in the file?

IPA_normalization = {
        u':' : u'ː',
        u'?' : u'ʔ',
        u'ɾ' : u'ʁ',
        u'ɡ' : u'g',
        u'ŋ' : u'ɳ',
        u' ' : None,
        u'(' : None,
        u')' : None,
        u'\u02c8' : u'\'',
        u'\u032f' : None,
        u'\u0329' : None,
        u'\u02cc' : None,
        u'\u200d' : None,
        u'\u0279' : None,
        u'\u0361' : None,
}

@gooofy
Copy link
Owner

gooofy commented Jul 2, 2021

@gooofy
Copy link
Owner

gooofy commented Jul 2, 2021

From my experience converting from IPA can always be difficult, depending on the source. That IPA-normalization table grew when I started extracting IPA from wiktionary and is certainly by no means complete (or correct, for that matter).

@fquirin
Copy link

fquirin commented Jul 2, 2021

Ok weird, shouldn't there be a clear set of characters and conversion rules for IPA to X-SAMPA? 😕
I was planning on using espeak-ng IPA (espeak-ng -v de -x -q --sep=" " --ipa "test") as main source 🤔

@fquirin
Copy link

fquirin commented Jul 2, 2021

To be honest I don't understand this IPA normalization table entirely 🤔 . For example those characters:

u'ɾ' : u'ʁ',
...
u'ŋ' : u'ɳ',

All 4 of them exist in the IPA table and have a different purpose. Why would you convert one to another?

[EDIT]
And I think u'\u0279' : None, should actually be u'\u0279' : u'r', 🤔

@gooofy
Copy link
Owner

gooofy commented Jul 2, 2021

I am by no means an expert here, maybe you should discuss these questions with someone more proficient in the field of (computer-)linguistics.

That said, here is my take: IPA is typically written by humans for humans to convey some idea of how a written word could be pronounced. I came across dozens of wiktionary IPA entries that looked very sensible to me until I fed them into a TTS system and listened to what that system produced out of it. IPA defines a huge number of phonemes and lots of additional symbols - all that helps conveying pronunciations to humans and supporting lots of different languages.

Designing a phoneme set for machines to produce mathematical models of human speech is a very different affair: typically you want a small set of phonemes especially when you start with a relatively small set of samples - the larger your phoneme set, the more phonemes will have very few samples (or none at all) they occur in causing instabilities in your model.

But even if you have a large sample base there is still the question what good additional phonemes will do to your model - will those additional phonemes really improve recognition performance or the quality of the speech produced? At some point you will also face the question of which phonemes actually exists in nature and which of them you want to model - after all, speech is a natural phenomenon analog world which you model model using discrete phonemes. In fact, even amongst linguists these questions seem debatable:

https://en.wikipedia.org/wiki/Phoneme#The_non-uniqueness_of_phonemic_solutions

one of my favorite examples in the german language is r vs ʀ vs ʁ - which one of them is used differs by region/dialect - so in this case it comes down to the question whether you want to model dialects in your pronunciation dictionary - in zamia, I definitely decided against that but of course other designers may decide otherwise for their phoneme set.

@fquirin
Copy link

fquirin commented Jul 2, 2021

Thanks again for the background info. I see now, its not a trivial problem to solve 😁 .

So, back at the drawing board, what's actually the best way to generate new words for the Zamia lexicon.txt files? 🤷‍♂️
Is there a chance to use espeak (IPA or "normal") and get the correct set of supported phonemes? Or do we need to use the G2P models? Or do we need to implement a manual procedure (generate automatically, check if phonemes are ok, adapt by hand)?

NOTE: The reason why I would like to use espeak is because I can create the phoneme set by actually listening to it (looking at the original 'speech_lex_edit.py' file I think you had the same intention).

@gooofy
Copy link
Owner

gooofy commented Jul 5, 2021

In my experience if you want high quality lexicon entries there is no way around checking them manually. In general I would use speech_lex_edit to add new entries to the dictionary (either directly or through speech_editor while reviewing samples). Inside that tool you have options to generate pronounciations via espeak, mary tts and sequitur g2p. Usually I would listen to all three and pick the best one, sometimes with manual improvements (like fixing stress marks etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants