Questions #1

ZegerUser · 2024-06-05T20:25:01Z

Hello,

I recently came across your experiments on the so-vits-project link. Since I wanted a way to generate unique voices without the degredation when merging models.
Your results are promising.

But I have a few questions:
Did you do any further experiments on this?
Could this also work on RVC?

this is maybe a crazy idea but after some thinkering around with emb_g weights from rvc and so-vits I saw that almost all of the single speaker models use the same pretrained multispeaker model. But after looking at the emb_g weights I saw that they where different. But could it be possible to "reconstruct" the original pretrained embeddings from the trained single speakers and then scale the embedding of the single speaker. So we could just scrape models from sites and create a big dataset of embeddings and also create a very general model by summing all the generators with each other. This would reduce the compute cost immensely to create a robust enough model to create new voices.

I don't know enough about math and how these models work to know if this could work.

Zeger

sbersier · 2024-06-06T22:12:07Z

Hi,
I didn't respond immediately since I wanted to take a moment to think about your questions.

I didn't do any further experiment. It could be improved but it would need hundreds of voices.

Would it be possible to use voices available on the web (ie. hugginface)?
Honestly, I can't give you a definitive answer. But, just by curiosity, I downloaded a model from hugginface (checked that the configurations files were compatible) and gave it a try. That is, I copied the voice embedding form the downloaded model and put it into the G_38_speakers_0_v74.pth model.

The result from the original downloaded model (voice of agent-47 in https://huggingface.co/chameleon-ai/so-vits-svc-models/tree/main):

a.mp4

The result obtained by transferring the voice embedding to G_38_speakers_0_v74.pth:

b.mp4

The result is not good. It seems that the voice embeddings are intimately related to the rest of the model. If true, this means that there is no other way than training a multispeaker model including hundreds of speakers, starting from D_0.pth, G_0.pth and the audio data.

And, since it (probably) doesn't work between two svc models, I don't think you can transfer voice embeddings between so-vits-svc and RVC.

Best regards,
S. Bersier

ZegerUser · 2024-06-06T23:08:29Z

Hello, thanks for your response

After my comment from yesterday I also did some small experiments.

I tried transfering embeddings from a trained model to the original pretrained without luck like you said they are somehow related to the whole model.

In green the original pretrained model, red the 'single' speaker embs and in blue the emb of the single speaker in the model

they only seem to shift a little bit so maybe a NN can be trained to reverse this or some other way.

I also tried straight up copy pasting embedding it gave something but not the ones needed to be.

I am now in the processes of creating/training a new rvc model with 278 english speakers to test it on a more robust model.
Each speaker has between 30min and 1h of data, the dataset includes vtck, vocalset, genshinvoice and a self made one, I tried to have lots of accents to create a diverse dataset. I will share more about this once its started training.

I am also trying to create a small gradio webapp to make it easier to create new voices.

Zeger

sbersier · 2024-06-06T23:50:36Z

I understand that you plan to train a RVC multispeaker model. Actually, when you look at a RVC model, there is also an 'emb_g.weight' tensor in the 'weights' entry of the model. That tensor looks suspiciously similar to the one we find in svc models. Assuming they are the same thing, then it will be easy to do PCA on your multispeaker RVC model.

Now, for your figure, I'm not sure about what I'm looking at. What do you mean by "original pretrained model"? Do you mean, the G_0.pth generator model? And what are these 'single' speaker embeddings? Does each red point come from a separately trained model? Sorry, I don't get it.

ZegerUser · 2024-06-07T06:35:08Z

Yes, the original pretrained models are a collection of D and G models where almost all further trained models are derived from. These models are trained on the vtck dataset. This dataset has 109 speakers therefore it has 109 embeddings. These green dots are these embeddings of each speaker. When you train a rvc model on a new voice it uses these models to cut down on compute and make it easier to train with less data. The first embedding of this model becomes the new voices embedding leaving us with 108 from the original since these are not deleted but changed in someway. The red are these 108 embeddings from the 'single' voice model and the blue is the embedding of the trained voice from this model.

I hope this made it clearer.

sbersier · 2024-06-07T08:48:37Z

Aaaah! These points come from a RVC model. OK! So, it looks like you've already figured it out all by yourself. Great! I'm eager to listen at the result. Having good quality voices that you could tune with a few sliders corresponding the main principal directions would be great. I hope this will work.

ZegerUser · 2024-06-07T08:59:16Z

An hour ago I have started the training of the model. It will probably take a while since a single epoch takes 1h30m on my 3090.

below are some graphs

sbersier · 2024-06-07T09:09:17Z

Is it like with SVC where you can listen at generated samples during the training? Because, I think this is the only pertinent metric. When it comes to generative adversial networks, losses are not such a good indicator because the generator and the discriminator co-evolve (they fight against each other) so that the losses tend to be rather constant.

ZegerUser · 2024-06-07T09:46:05Z

you get the models at each epoch but no samples in the tensorboard

sample audio (not in dataset):
https://voca.ro/1icXuEQ0xVBm

Epoch 1:
id 0
sample in dataset: https://voca.ro/11XeosxXvSb1
model output: https://voca.ro/17kT0LKi205y
id 2
sample in dataset: https://voca.ro/1m4gXpnQUpzm
model output: https://voca.ro/1n8F6A0Adi4o

sbersier · 2024-06-07T11:18:11Z

For Id 0, the result is impressive. Now, it looks like there is a problem with Id 2: the output is very much like the voice you hear in you sample audio, not at all the one from the sample in dataset (which is a male voice). Probably a confusion. Anyway, the generated outputs are very good. Do you really have to train more than that?

ZegerUser · 2024-06-07T12:47:59Z

I think it it because the embeddings and the model don't line up, I set the embeddings to all zeros which was probably not the best. If you look at the progression for the embedding pca components they are seem to diverge

The blue ones are the from the original pretrained models.

ZegerUser · 2024-06-13T20:23:18Z

I may have found a problem with the training. My first attempts where finetuning from the provided rvc pretrained models these have failed because the gradients kept exploding and vanishing or loss that went nan. After this I switched over to a fully randomized model. This has now been training stable for 4 days and 12h. After doing some more analysis on the embeddings I found that mixing datasets maybe not a good idea. When I take the cosine similarity of all the embeddings you can see that there are 4 distinct squares. These match exactly with my datasets that I have used. I hope that it will fix themself if I train it longer.
Epoch 70

Epoch 1

sbersier · 2024-06-14T05:17:36Z

I haven't tried RVC, so I can't really help you with that.

"I switched to a fully randomized model":
With so-vits-svc, the program comes with two pre-trained models: D_0.pth and G_0.pth (the discriminator and the generator). Pre-training these models took a LOT of time and effort and you shouldn't try to train them from scratch, assuming this is what you mean by "fully randomized model".
"Gradients kept exploding and vanishing or loss that went nan":
Again, I don't know RVC but this is weird... This kind of model should be rather stable. Are you sure that everything is correct with your audio samples (duration, sampling rate, bit depth)?
"Mixing datasets maybe not a good idea."
I don't see why this should be a problem. Of course, if you have a dataset with very good audio and a dataset with crapy/saturated/noisy audio this will probably show up in the result.

I'm sorry I can't help you more but, again, I've never used RVC. If you face problems training with RVC, you really should ask the RVC community.

ZegerUser · 2024-06-19T11:39:29Z

After my problems with rvc I have switched over to svc. I used the same dataset and started training from the pretrained G_0 and D_0 files. This went much better than rvc. This has been trained for 61 epochs and outputs decent audio for most new voices.

Here are some samples of random voices:
https://github.com/sbersier/pca_svc/assets/90973852/018a2bb6-ede0-4c10-8056-eb7ea83957bf

The embbedings don't seem to cluster to based on pitch or other but more so based on the datasets used.

I have made a minimal webui: https://github.com/ZegerUser/so-vits-svc-voice-lab,
This is the model link: https://huggingface.co/Zeger56644/voice-lab-v1

sbersier · 2024-06-19T16:29:38Z

Hi!

The samples generated with a random voice are very good. Congrats!
I wouldn't bother too much about this dataset clustering thing as long as the end result is useful and good enough. I had a look at the publicly available datasets you used and indeed they seem quiet different. The genshin dataset is very particular (probably a lot of young high pitched teen girls in it...) whereas the vctk dataset is probably more balanced in that respect.
After dowloading your model (.pth and .json) I tried your webui.py

The model doesn't spontaneously load. I had to duplicate the model files in order to be able to properly load the model. Otherwise, it says that the model is not loaded. (I checked that loaded_model is indeed 'None').
I had to replace line 47 (in your webui.py) with:
return "Model loaded succesful", gr.Dropdown(choices=sample_names)
Because the "update" method is deprecated.
Your model is loaded with:
loaded_model = torch.load(models[index], map_location="cpu")
Which is unsafe.
It should be loaded with:
loaded_model = torch.load(models[index], map_location="cpu", weights_only=True)
But when I modify that, in order to be safe, it says that it can't load the model without the 'weights_only=False' option.
I don't know why this is the case for your model. When you look at my "random_voices.py" file, on line 32, you see that the 'weights_only=True' option is set. And that's how it should be. You shouldn't have to disable a security feature in order to run your model.

So, unfortunately, I couldn't try it.

UPDATE:
My torch version: '2.0.1+cu117'
My gradio version: '4.28.3'

Setting the number of sliders to 32 looks a bit overkill... I'm not sure you have enough data to be that granular. When it comes to user's experience, it's better to have a few sliders that have a clear/audible effect.
It would be nice to add a "seed" value since all the point of this is to be able to generate voices that nobody can come and say "You stole my voice!" Now, given your model, the sliders position and a seed value, anyone would be able to "prove" that he/she didn't clone an existing voice.

Best regards,
Stéphane

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions #1

Questions #1

ZegerUser commented Jun 5, 2024

sbersier commented Jun 6, 2024

ZegerUser commented Jun 6, 2024

sbersier commented Jun 6, 2024

ZegerUser commented Jun 7, 2024

sbersier commented Jun 7, 2024

ZegerUser commented Jun 7, 2024

sbersier commented Jun 7, 2024

ZegerUser commented Jun 7, 2024

sbersier commented Jun 7, 2024

ZegerUser commented Jun 7, 2024

ZegerUser commented Jun 13, 2024

sbersier commented Jun 14, 2024

ZegerUser commented Jun 19, 2024 •

edited

Loading

sbersier commented Jun 19, 2024 •

edited

Loading

Questions #1

Questions #1

Comments

ZegerUser commented Jun 5, 2024

sbersier commented Jun 6, 2024

ZegerUser commented Jun 6, 2024

sbersier commented Jun 6, 2024

ZegerUser commented Jun 7, 2024

sbersier commented Jun 7, 2024

ZegerUser commented Jun 7, 2024

sbersier commented Jun 7, 2024

ZegerUser commented Jun 7, 2024

sbersier commented Jun 7, 2024

ZegerUser commented Jun 7, 2024

ZegerUser commented Jun 13, 2024

sbersier commented Jun 14, 2024

ZegerUser commented Jun 19, 2024 • edited Loading

sbersier commented Jun 19, 2024 • edited Loading

ZegerUser commented Jun 19, 2024 •

edited

Loading

sbersier commented Jun 19, 2024 •

edited

Loading