You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying to train voice models with the Colab notebook. I don't have 8 hours (!) of transcripted audio, but I do have 30 mins split into multiple 20 second files. Maybe 150 of them. So I am finetuning. And because I want a High model, it seems like my only option is the US Lessac model to train from.
I'm watching the Tensorboard, which has the loss_disc_all fluctuating crazily, but if I smooth it then it makes a curve down for the first hour and then goes sort of flat, but arguably starts rising again? I believe that when this goes "flat" then it is a good time to stop training, but unless I have smoothing at max then it never goes flat. This maybe implies that I need to train for much longer?
I'm also testing the Audio tab and can hear the voice get better over time.
After a couple of hours I might end the training because Colab is complaining I am using it too much and kicks me off soon after, export the model, and test it. And it seems to sound worse than the Audio tab version. More vibrato effects.
I go back in the next day and "Continue Training" and leave it another couple of hours. And nothing much changes with the Tensorboard graph. Still fluctuating wildly. The voice model I download is much the same, with a vibrato effect.
I'm training from quality audio from an audio book and the Whisper transcript looks pretty good to me.
So my main question is: how long should I train a voice for? I've had 3 "continues" for maybe 6 hours total, and still I hear artifacts on the voice. Is 30 mins of audio enough to finetune, or do I need to gather a lot more than that?
Also if I want a High model for a British voice, do I have to finetune from lessac, or is there some other option?
Basically: Any tips for making the models. Because I absolutely love the latency of Piper compared to other TTS, but I do also need some fidelity in the audio output, which I am struggling to get.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I've been trying to train voice models with the Colab notebook. I don't have 8 hours (!) of transcripted audio, but I do have 30 mins split into multiple 20 second files. Maybe 150 of them. So I am finetuning. And because I want a High model, it seems like my only option is the US Lessac model to train from.
I'm watching the Tensorboard, which has the loss_disc_all fluctuating crazily, but if I smooth it then it makes a curve down for the first hour and then goes sort of flat, but arguably starts rising again? I believe that when this goes "flat" then it is a good time to stop training, but unless I have smoothing at max then it never goes flat. This maybe implies that I need to train for much longer?
I'm also testing the Audio tab and can hear the voice get better over time.
After a couple of hours I might end the training because Colab is complaining I am using it too much and kicks me off soon after, export the model, and test it. And it seems to sound worse than the Audio tab version. More vibrato effects.
I go back in the next day and "Continue Training" and leave it another couple of hours. And nothing much changes with the Tensorboard graph. Still fluctuating wildly. The voice model I download is much the same, with a vibrato effect.
I'm training from quality audio from an audio book and the Whisper transcript looks pretty good to me.
So my main question is: how long should I train a voice for? I've had 3 "continues" for maybe 6 hours total, and still I hear artifacts on the voice. Is 30 mins of audio enough to finetune, or do I need to gather a lot more than that?
Also if I want a High model for a British voice, do I have to finetune from lessac, or is there some other option?
Basically: Any tips for making the models. Because I absolutely love the latency of Piper compared to other TTS, but I do also need some fidelity in the audio output, which I am struggling to get.
Beta Was this translation helpful? Give feedback.
All reactions