Multivoice finetune test. What can I do better? #189
StoryHack
started this conversation in
Show and tell
Replies: 1 comment 2 replies
-
Proper database annotation is critical for good TTS quality. You'd better use https://google.github.io/df-conformer/librittsr/ instead. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've been attempting to create a High Quality multi-speaker voice by finetuning the HQ Lessac voice. I wasn't really happy with my first test, so I decided to make a dataset with 4 voices, each with about an hour of recordings. I found 4 narrators on Librivox with good soundng recordings (low ambient noise, decent mixing, etc).
To make the dataset, I used audacity's "Analyze -> label sounds" function to mark sections based on silences. I looked for silences at least .7 seconds long with sections at least 3 seconds long. This did a fair job of finding sentences. I then manually reviewed the marked sections to see if they looked too long (anything over 20 seconds) for me, then manually adjusted the marks. I then used export->Multiple to get a wav directory full of utterences.
Next, I used Whisper-faster (fork of Whisper AI) to create transcripts of the wav files.
I scripted up something to scrub the time stamps and build the dataset's metadata.csv file.
I didn't really do much correcting of the transcripts. Just cleaned up a bit of punctuation weirdness.
Then I moved on to the actual training.
After some frustration getting a colab notebook to work (I may return to that method someday...), I decided to get a spare computer I had kicking around up and running. It now has Ubuntu 22.04 installed and an RTX 3060 GPU. I did add "torchmetrics==0.11.4" to the requirements file when installing piper. I simply followed the directions in the training guide, lowering my batch size to 12 and my max-phenome to 350 in order to make sure the training stuff would fit into my GPU's vram. I set it to train for 1000 epochs, and a couple of times along the way, I copied out the latest ckpt file, so I could run tests.
I'm not crazy about the results due to a little weirdness with some pronounciations. This is stuff the lessac voice gets right. Some vowel sounds are spoken with "arrr". For instance, "water" sounds like "warter", "loch" sounds like "lark", and "phenomenon" sounds like "phenarmenon." That pronounciation is carried across all 4 voices. Mostly, the speakers sound pretty good.
I made and uploaded a video showing how each voice progresses from the base voice as the training completes more and more epochs. I took the checkpoints from 208, 475, 738, and the final 1000 epochs. To my ear, there wasn't much change from 475 to the higher two. I did 2 sets of sentences to compare.
Video on Youtube
I have built another dataset, using more recordings of one of the speakers (about 12 hrs, almost 7k utterances) using the same tools. I'm hesitant to start it training, either as a finetune, or from scratch, if there's something I'm doing or about my dataset that results in those weird pronounciations.
Do I need to edit the metadata.csv better to make sure Whisper got the right text? Is there a better way to make those transcriptions? All the text for the novels used is on gutenberg, but for a large set, that sounds like it would be a ton of work.
Do I need larger sets? Figure out how to rent a GPU with more vRAM in order to do larger batches?
What do you think?
If you are interested, here is:
Beta Was this translation helpful? Give feedback.
All reactions