Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training ForwardTacotron on a dataset comprised of multiple male voices as a single speaker dataset? #59

Open
tomsabanov opened this issue Aug 2, 2021 · 10 comments

Comments

@tomsabanov
Copy link

Hi,

I was wondering, if it was possible to train on a dataset, that has let's say 2-3 male voices, each with about 10 hours of data.

Will the end result of this be a good neutral male voice?

@tomsabanov tomsabanov changed the title Training ForwardTacotron on a dataset comprised of multiple (2-3) male voices? Training ForwardTacotron on a dataset comprised of multiple male voices as a single speaker dataset? Aug 2, 2021
@cschaefer26
Copy link

Hi, short answer is that the voice is going to be rubbish as the model will average them. I will probably implement a multispeaker version soon. The idea is to condition each voice on a speaker embedding, e.g. from https://github.com/resemble-ai/Resemblyzer and provide a reference embedding for inference. I had some success previously with this repo just doing that, but that's outdated already (the branch was done before implementing pitch and energy conditioning).

@tomsabanov
Copy link
Author

I've read in some master's thesis for Finnish language that the author received good results with a "warm start" method. He trained the base model on multiple voices of 20 hours and then trained a single voice on top of that model.

Would this idea work with ForwardTacotron?

@cschaefer26
Copy link

I think still this makes much more sense if you have the voice conditioning. Do the authors share their model architecture? I suspect they are using some speaker embedding.

@tomsabanov
Copy link
Author

The auther used Nvidia's implementation of Tacotron. They didn't change anything in the code.
The following is extracted from the paper.

"Using a warm-starting training schema yielded better results. First, a general
model was trained using all available data. The model had no information about
the speaker, even though the targets consisted of mel-spectrograms created from
multiple speakers’ voices. During training, the model had to generate utterances with
many different voices for the same input. This prohibited the model from converging
after a certain point. In the end, the general model produced very unnatural yet
understandable speech. When creating an utterance, the model seemed to randomly
"choose" a speaker from the training set and produce the rest of the utterance with
that voice. Even though the speech sounded unnatural, it always clearly resembled a
specific speaker’s voice from the training set.
The weights of the general model were then used to initialize weights for an actual
single speaker model. Experimenting with different ways of creating the initial model
showed that using data from speakers of the same gender gave better results than
having speech from both genders in the training data. In addition, letting the model
train until the training error started to plateau worked better than stopping the
training early."

@cschaefer26
Copy link

cschaefer26 commented Aug 3, 2021

Ah very interesting. Could well be tried with this repo then. If there is enough data for each speaker, it could work. Just try it out and throw everything in. Carefully watch the tacotron training to see if the attention score jumps above 0.5 between 3k-10k steps. If its successful then you can wait until the alignments are extracted (after 40k tacotron trainnig steps) and then train your multispeaker forward tacotron until 50k steps or so and then start messing with the data (replace it with single-speaker data).

@tomsabanov
Copy link
Author

I will report my findings.

Thank you for your help.

@cschaefer26
Copy link

Good luck, lmk how it goes!

@m-toman
Copy link

m-toman commented Aug 3, 2021

Haven't tried it but I found that speaker selection isn't random but usually by some similarity to training data sentences. Unfortunately it often overrides the speaker embedding in my case - pick a sentence from the training data of a speaker and the embedding vector of another speaker and you usually still get output of the first speaker, even if you slightly modify the sentence. For very long sentences it sometimes does switch mid-sentence.
I tried reinforcing the speaker ID at multiple positions in the network but not really helping.

@tomsabanov
Copy link
Author

I have another question regarding the fine-tuning of an existing model.
Do I have to save both resulting models from train_tacotron.py and train_forward.py and then load them when I want to fine-tune them in their respective scripts?

How would I go about this?

@cschaefer26
Copy link

The tacotron is only used to extract phoneme durations from the dataset. Once you processed all voices at once you can simply use the latest forward model to fine-tune. You probably need to manually filter the data according to the speaker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants