Training ForwardTacotron on a dataset comprised of multiple male voices as a single speaker dataset? #59

tomsabanov · 2021-08-02T18:48:25Z

Hi,

I was wondering, if it was possible to train on a dataset, that has let's say 2-3 male voices, each with about 10 hours of data.

Will the end result of this be a good neutral male voice?

cschaefer26 · 2021-08-03T07:12:04Z

Hi, short answer is that the voice is going to be rubbish as the model will average them. I will probably implement a multispeaker version soon. The idea is to condition each voice on a speaker embedding, e.g. from https://github.com/resemble-ai/Resemblyzer and provide a reference embedding for inference. I had some success previously with this repo just doing that, but that's outdated already (the branch was done before implementing pitch and energy conditioning).

tomsabanov · 2021-08-03T09:42:26Z

I've read in some master's thesis for Finnish language that the author received good results with a "warm start" method. He trained the base model on multiple voices of 20 hours and then trained a single voice on top of that model.

Would this idea work with ForwardTacotron?

cschaefer26 · 2021-08-03T09:58:10Z

I think still this makes much more sense if you have the voice conditioning. Do the authors share their model architecture? I suspect they are using some speaker embedding.

tomsabanov · 2021-08-03T10:09:17Z

The auther used Nvidia's implementation of Tacotron. They didn't change anything in the code.
The following is extracted from the paper.

"Using a warm-starting training schema yielded better results. First, a general
model was trained using all available data. The model had no information about
the speaker, even though the targets consisted of mel-spectrograms created from
multiple speakers’ voices. During training, the model had to generate utterances with
many different voices for the same input. This prohibited the model from converging
after a certain point. In the end, the general model produced very unnatural yet
understandable speech. When creating an utterance, the model seemed to randomly
"choose" a speaker from the training set and produce the rest of the utterance with
that voice. Even though the speech sounded unnatural, it always clearly resembled a
specific speaker’s voice from the training set.
The weights of the general model were then used to initialize weights for an actual
single speaker model. Experimenting with different ways of creating the initial model
showed that using data from speakers of the same gender gave better results than
having speech from both genders in the training data. In addition, letting the model
train until the training error started to plateau worked better than stopping the
training early."

cschaefer26 · 2021-08-03T10:30:16Z

Ah very interesting. Could well be tried with this repo then. If there is enough data for each speaker, it could work. Just try it out and throw everything in. Carefully watch the tacotron training to see if the attention score jumps above 0.5 between 3k-10k steps. If its successful then you can wait until the alignments are extracted (after 40k tacotron trainnig steps) and then train your multispeaker forward tacotron until 50k steps or so and then start messing with the data (replace it with single-speaker data).

tomsabanov · 2021-08-03T10:36:20Z

I will report my findings.

Thank you for your help.

cschaefer26 · 2021-08-03T10:43:12Z

Good luck, lmk how it goes!

m-toman · 2021-08-03T14:26:02Z

Haven't tried it but I found that speaker selection isn't random but usually by some similarity to training data sentences. Unfortunately it often overrides the speaker embedding in my case - pick a sentence from the training data of a speaker and the embedding vector of another speaker and you usually still get output of the first speaker, even if you slightly modify the sentence. For very long sentences it sometimes does switch mid-sentence.
I tried reinforcing the speaker ID at multiple positions in the network but not really helping.

tomsabanov · 2021-08-03T20:55:48Z

I have another question regarding the fine-tuning of an existing model.
Do I have to save both resulting models from train_tacotron.py and train_forward.py and then load them when I want to fine-tune them in their respective scripts?

How would I go about this?

cschaefer26 · 2021-08-04T07:29:07Z

The tacotron is only used to extract phoneme durations from the dataset. Once you processed all voices at once you can simply use the latest forward model to fine-tune. You probably need to manually filter the data according to the speaker.

tomsabanov changed the title ~~Training ForwardTacotron on a dataset comprised of multiple (2-3) male voices?~~ Training ForwardTacotron on a dataset comprised of multiple male voices as a single speaker dataset? Aug 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training ForwardTacotron on a dataset comprised of multiple male voices as a single speaker dataset? #59

Training ForwardTacotron on a dataset comprised of multiple male voices as a single speaker dataset? #59

tomsabanov commented Aug 2, 2021

cschaefer26 commented Aug 3, 2021

tomsabanov commented Aug 3, 2021

cschaefer26 commented Aug 3, 2021

tomsabanov commented Aug 3, 2021

cschaefer26 commented Aug 3, 2021 •

edited

Loading

tomsabanov commented Aug 3, 2021

cschaefer26 commented Aug 3, 2021

m-toman commented Aug 3, 2021

tomsabanov commented Aug 3, 2021

cschaefer26 commented Aug 4, 2021

Training ForwardTacotron on a dataset comprised of multiple male voices as a single speaker dataset? #59

Training ForwardTacotron on a dataset comprised of multiple male voices as a single speaker dataset? #59

Comments

tomsabanov commented Aug 2, 2021

cschaefer26 commented Aug 3, 2021

tomsabanov commented Aug 3, 2021

cschaefer26 commented Aug 3, 2021

tomsabanov commented Aug 3, 2021

cschaefer26 commented Aug 3, 2021 • edited Loading

tomsabanov commented Aug 3, 2021

cschaefer26 commented Aug 3, 2021

m-toman commented Aug 3, 2021

tomsabanov commented Aug 3, 2021

cschaefer26 commented Aug 4, 2021

cschaefer26 commented Aug 3, 2021 •

edited

Loading