Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A couple of questions #37

Open
jmasterx opened this issue Nov 10, 2020 · 10 comments
Open

A couple of questions #37

jmasterx opened this issue Nov 10, 2020 · 10 comments

Comments

@jmasterx
Copy link

Hi!

I have tried the latest version and I am quite pleased with the results; there is some great progress happening on this repository!

I am using 48KHz 7000 samples of my own voice.

I am very happy with pronunciation.

I had a couple questions:
When I have many sentences together, it does not seem to take a pause and sounds like it is rushing through the sentences. Is this normal, is there a workaround? my current one is to add a '...' instead of '.'

My other question is, are there plans for tokenizable pitch, to be able to do things like emphasize a specific word, or to give a work in particular a specific tone (in the text input not automatic)

Thanks!

@cschaefer26
Copy link

cschaefer26 commented Nov 11, 2020

Hi, glad you like it. How much data do you have?

  1. If you synth across multiple sentences then you would have to produce training data with multiple sentences as well (just concat them with the desired pause for example). Another option would be to manually mess with the mels and phoneme durations of the data (e.g. add some silence to the mel specs and make sure the dot gets a lot of duration) - because the tacotron extracted durations are not really reliable in this regard. We simply synth each sentence individually and concatenate the wavs with some hard-coded pause, thats probably the easiest and also quality-wise the best.

  2. You could actually do this already using the pitch_function in the colab: https://colab.research.google.com/github/as-ideas/ForwardTacotron/blob/master/notebooks/synthesize.ipynb. For example pitch_func = lambda x: torch.cat([x[:, :, :6] + 1.3, x[:, :, 6:]], dim=-1) to raise the pitch for the first 6 chars.

@jmasterx
Copy link
Author

jmasterx commented Nov 11, 2020

Thanks for the info, I will check that out!

I have around 7,000 samples of my voice and I trained at 48Khz. 11 hours 45 mins.

It is my own dataset that is about 4500 sentences from LJSpeech corpus, 500 from Alice In Wonderland, and 2000 questions from Wikipedia.

Here are some examples using WaveRNN
https://vocaroo.com/1ox9ak6O3dHd

One thing that I find strange is why my sentences go flat toward the end:
https://vocaroo.com/16ifuiJhlKVB
the 'that has never gone out of style' part loses all inflection. I do not understand why...

Most sentences suffer from this.
And if I do it twice:
https://vocaroo.com/1e2atNcuDSIN
I start to sound more and more sad.
It is quite evident here https://vocaroo.com/1esyrbcG2B8L
I'm not sure why this happens.

@cschaefer26
Copy link

cschaefer26 commented Nov 12, 2020

Nice. Sounds quite good already, but imo the WaveRNN could still improve a bit (the gnarling/hissing) - how many steps is this for vocoder and tts? The hissing could also come from not so great durations if the tacotron attention is off.

I've seen some problems with ending pitch for some datasets, mainly male. Did you look at the pitch loss? Maybe its overfitting. Also, it could be a problem of trailing durations being a bit off, maybe trimming some silence would help with this (I just fixed the missing trimming functions in master preprocessing). If that doesnt help you could try to mess around witth the pitch loss function and scale it up at the end of the batch (e.g. multiply the loss with an increasing factor), we tried this already and it seemed to help with ending pitch.

@jmasterx
Copy link
Author

Hi

I retrained using the latest repo and it was a bit better but still wound up getting the end pitches wrong by the end of training. It starts out alright but eventually I guess overfits or something.

However, I did try something interesting.

I modified the scripts a bit so that the pitches for each phoneme came from LJSpeech model and got great results like this! I think it could be interesting to have the option to use different models for duration and pitch predictions!

This is using my pitch conditioning:
https://vocaroo.com/19bny42DEx2d

Notice the endings become very monotone.

Now here is the same thing but I fed in the pitches from LJSpeech

https://vocaroo.com/19ltwQ1gBJOJ

To me it sounds much better!

I think this could have some interesting applications. I think it could allow to have high quality voices with potentially less forced alignment data!

In any case, I think adding the option to use a different model for duration and or pitch prediction could be interesting!

@cschaefer26
Copy link

Hi, very cool. This is something on my list, I will also try to train multispeaker models which I hope will improve the pitch prediction. I am pretty sure that some transfer learning will benefit the pitch prediction. So far it seems to me that the pitches of male speakers are harder to pick up for the models, maybe it is harder to extract in the first place (to me the female mel specs are much clearer than male ones).

@jmasterx
Copy link
Author

jmasterx commented Jan 9, 2021

Hi

One thing you mentioned was adding silence to the mel spectrogram. I thought I could add silence by playing with duration of spaces, but it turns out, most words don't actually contain 'silence' phonemes.

However, if I insert something like '...' between words, it completely messes up / changes the spectrogram.

Is there a token I can insert that adds in silence without altering the mels in any other way than to add silence? If not, would you be able to point me to the part of the code where I could inject my own silence after a phoneme?

Thanks!

I'm working on a little program that will allow me to insert pauses, and alter the length and pitch of words / phonemes with a user interface and this would be very helpful!

I was also wondering if you knew the meaning of the duration values. In that, is there a way to convert those values to milliseconds. For example, if I want a word to last exactly 2 seconds, if I know the value of 1.0 duration in ms, I can easily figure out what constant to multiply the durations for that word by for it to last the length of time I want.

Same question for pitch; is it possible to target a specific fundamental frequency for a given phoneme? (which would require knowing the base fundamental frequency generated by the network)

Update: I managed to be able to align the phonemes to a grid: https://vocaroo.com/1d2EZ8aXR8AF

@joseluismoreira
Copy link

joseluismoreira commented Feb 15, 2021

Thanks for the info, I will check that out!

I have around 7,000 samples of my voice and I trained at 48Khz. 11 hours 45 mins.

It is my own dataset that is about 4500 sentences from LJSpeech corpus, 500 from Alice In Wonderland, and 2000 questions from Wikipedia.

Hey @jmasterx amazing work here. Thanks for the insights. Your results look great. I am wondering how was your dataset collected? I am a beginner in tts area, so looking for some best pratices.. Could you describe it please? How many samples do you think are enough ? For many tts datasets, except ljspeech, I couldnt find so many hours from the same speaker. I directed the question for @jmasterx, but please anyone feel free to contribute. Thanks

@jmasterx
Copy link
Author

@joseluismoreira Hi

My dataset was collected by me speaking into a Rode NT1 microphone.

I used a tool that I wrote to make it easier to record the samples.
You can find it in this repo https://github.com/jmasterx/TextToSpeech/tree/main/TextToSpeechTools/Metadata along with the JoshCustom csv which is my own metadata file used for this corpus.

The data was recorded at 16 bit, 96 Khz then downsampled to 48 Khz.
When I trained this iteration, it was trained with the samples at peak normalization, however I am now getting better results by not doing this.

However this model you hear here is very noisy.

The new one I am training, with the same samples, I have processed as follows:

Noise suppression
3db compression
Normalize to -16 lufs
EQ out all frequencies below 70 Hz.

I have attached my hparams for the new way I'm training which addresses hop size, max freq of spectrograms, etc, for 48Khz.

hparams.zip

@joseluismoreira
Copy link

@jmasterx Thank you very much for the detailed answer. It will be very useful for me :)

@lukacupic
Copy link

@jmasterx Have you been able to insert pauses into the text? If so, could you please point me in some direction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants