-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions #1
Comments
Hi, I didn't do any further experiment. It could be improved but it would need hundreds of voices. Would it be possible to use voices available on the web (ie. hugginface)? The result from the original downloaded model (voice of agent-47 in https://huggingface.co/chameleon-ai/so-vits-svc-models/tree/main): a.mp4The result obtained by transferring the voice embedding to G_38_speakers_0_v74.pth: b.mp4The result is not good. It seems that the voice embeddings are intimately related to the rest of the model. If true, this means that there is no other way than training a multispeaker model including hundreds of speakers, starting from D_0.pth, G_0.pth and the audio data. And, since it (probably) doesn't work between two svc models, I don't think you can transfer voice embeddings between so-vits-svc and RVC. Best regards, |
I understand that you plan to train a RVC multispeaker model. Actually, when you look at a RVC model, there is also an 'emb_g.weight' tensor in the 'weights' entry of the model. That tensor looks suspiciously similar to the one we find in svc models. Assuming they are the same thing, then it will be easy to do PCA on your multispeaker RVC model. Now, for your figure, I'm not sure about what I'm looking at. What do you mean by "original pretrained model"? Do you mean, the G_0.pth generator model? And what are these 'single' speaker embeddings? Does each red point come from a separately trained model? Sorry, I don't get it. |
Yes, the original pretrained models are a collection of D and G models where almost all further trained models are derived from. These models are trained on the vtck dataset. This dataset has 109 speakers therefore it has 109 embeddings. These green dots are these embeddings of each speaker. When you train a rvc model on a new voice it uses these models to cut down on compute and make it easier to train with less data. The first embedding of this model becomes the new voices embedding leaving us with 108 from the original since these are not deleted but changed in someway. The red are these 108 embeddings from the 'single' voice model and the blue is the embedding of the trained voice from this model. I hope this made it clearer. |
Aaaah! These points come from a RVC model. OK! So, it looks like you've already figured it out all by yourself. Great! I'm eager to listen at the result. Having good quality voices that you could tune with a few sliders corresponding the main principal directions would be great. I hope this will work. |
Is it like with SVC where you can listen at generated samples during the training? Because, I think this is the only pertinent metric. When it comes to generative adversial networks, losses are not such a good indicator because the generator and the discriminator co-evolve (they fight against each other) so that the losses tend to be rather constant. |
you get the models at each epoch but no samples in the tensorboard sample audio (not in dataset): Epoch 1: |
For Id 0, the result is impressive. Now, it looks like there is a problem with Id 2: the output is very much like the voice you hear in you sample audio, not at all the one from the sample in dataset (which is a male voice). Probably a confusion. Anyway, the generated outputs are very good. Do you really have to train more than that? |
I haven't tried RVC, so I can't really help you with that.
I'm sorry I can't help you more but, again, I've never used RVC. If you face problems training with RVC, you really should ask the RVC community. |
After my problems with rvc I have switched over to svc. I used the same dataset and started training from the pretrained G_0 and D_0 files. This went much better than rvc. This has been trained for 61 epochs and outputs decent audio for most new voices. Here are some samples of random voices: The embbedings don't seem to cluster to based on pitch or other but more so based on the datasets used. I have made a minimal webui: https://github.com/ZegerUser/so-vits-svc-voice-lab, |
Hi!
So, unfortunately, I couldn't try it. UPDATE:
Best regards, |
Hello,
I recently came across your experiments on the so-vits-project link. Since I wanted a way to generate unique voices without the degredation when merging models.
Your results are promising.
But I have a few questions:
Did you do any further experiments on this?
Could this also work on RVC?
this is maybe a crazy idea but after some thinkering around with emb_g weights from rvc and so-vits I saw that almost all of the single speaker models use the same pretrained multispeaker model. But after looking at the emb_g weights I saw that they where different. But could it be possible to "reconstruct" the original pretrained embeddings from the trained single speakers and then scale the embedding of the single speaker. So we could just scrape models from sites and create a big dataset of embeddings and also create a very general model by summing all the generators with each other. This would reduce the compute cost immensely to create a robust enough model to create new voices.
I don't know enough about math and how these models work to know if this could work.
Zeger
The text was updated successfully, but these errors were encountered: