Multivoice finetune test. What can I do better? #189

StoryHack · 2023-08-31T19:59:08Z

StoryHack
Aug 31, 2023

I've been attempting to create a High Quality multi-speaker voice by finetuning the HQ Lessac voice. I wasn't really happy with my first test, so I decided to make a dataset with 4 voices, each with about an hour of recordings. I found 4 narrators on Librivox with good soundng recordings (low ambient noise, decent mixing, etc).

To make the dataset, I used audacity's "Analyze -> label sounds" function to mark sections based on silences. I looked for silences at least .7 seconds long with sections at least 3 seconds long. This did a fair job of finding sentences. I then manually reviewed the marked sections to see if they looked too long (anything over 20 seconds) for me, then manually adjusted the marks. I then used export->Multiple to get a wav directory full of utterences.

Next, I used Whisper-faster (fork of Whisper AI) to create transcripts of the wav files.

I scripted up something to scrub the time stamps and build the dataset's metadata.csv file.

I didn't really do much correcting of the transcripts. Just cleaned up a bit of punctuation weirdness.

Then I moved on to the actual training.

After some frustration getting a colab notebook to work (I may return to that method someday...), I decided to get a spare computer I had kicking around up and running. It now has Ubuntu 22.04 installed and an RTX 3060 GPU. I did add "torchmetrics==0.11.4" to the requirements file when installing piper. I simply followed the directions in the training guide, lowering my batch size to 12 and my max-phenome to 350 in order to make sure the training stuff would fit into my GPU's vram. I set it to train for 1000 epochs, and a couple of times along the way, I copied out the latest ckpt file, so I could run tests.

I'm not crazy about the results due to a little weirdness with some pronounciations. This is stuff the lessac voice gets right. Some vowel sounds are spoken with "arrr". For instance, "water" sounds like "warter", "loch" sounds like "lark", and "phenomenon" sounds like "phenarmenon." That pronounciation is carried across all 4 voices. Mostly, the speakers sound pretty good.

I made and uploaded a video showing how each voice progresses from the base voice as the training completes more and more epochs. I took the checkpoints from 208, 475, 738, and the final 1000 epochs. To my ear, there wasn't much change from 475 to the higher two. I did 2 sets of sentences to compare.

Video on Youtube

I have built another dataset, using more recordings of one of the speakers (about 12 hrs, almost 7k utterances) using the same tools. I'm hesitant to start it training, either as a finetune, or from scratch, if there's something I'm doing or about my dataset that results in those weird pronounciations.

Do I need to edit the metadata.csv better to make sure Whisper got the right text? Is there a better way to make those transcriptions? All the text for the novels used is on gutenberg, but for a large set, that sounds like it would be a ton of work.

Do I need larger sets? Figure out how to rent a GPU with more vRAM in order to do larger batches?

What do you think?

If you are interested, here is:

My dataset Dataset zip on Google Drive ~460mb
The onnx voice file Zipped with config .json file on Google Drive ~113mb

nshmyrev · 2023-08-31T20:28:41Z

nshmyrev
Aug 31, 2023

Proper database annotation is critical for good TTS quality. You'd better use https://google.github.io/df-conformer/librittsr/ instead.

2 replies

StoryHack Aug 31, 2023
Author

My first test was 6 speakers taken directly from that dataset. I ran it for 719 epochs. It didn't get the pronounciation strangeness, but it was also much more choppy sounding than the later 4-voice set after similar number of epochs. So that might be it.

One of the things I'm trying to figure out is how many utterances / total recording time I need to make a good voice. The LibriTTS-R set only has about 25 min per speaker.

So far I've been too lazy to go through and correct transcriptions by hand.

nshmyrev Sep 1, 2023

It is better to train 100 hours for 7 epoch than 1 hour for 700 epoch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multivoice finetune test. What can I do better? #189

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Multivoice finetune test. What can I do better? #189

StoryHack Aug 31, 2023

Replies: 1 comment · 2 replies

nshmyrev Aug 31, 2023

StoryHack Aug 31, 2023 Author

nshmyrev Sep 1, 2023

StoryHack
Aug 31, 2023

Replies: 1 comment 2 replies

nshmyrev
Aug 31, 2023

StoryHack Aug 31, 2023
Author