Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions #1

Open
ZegerUser opened this issue Jun 5, 2024 · 14 comments
Open

Questions #1

ZegerUser opened this issue Jun 5, 2024 · 14 comments

Comments

@ZegerUser
Copy link

Hello,

I recently came across your experiments on the so-vits-project link. Since I wanted a way to generate unique voices without the degredation when merging models.
Your results are promising.

But I have a few questions:
Did you do any further experiments on this?
Could this also work on RVC?

this is maybe a crazy idea but after some thinkering around with emb_g weights from rvc and so-vits I saw that almost all of the single speaker models use the same pretrained multispeaker model. But after looking at the emb_g weights I saw that they where different. But could it be possible to "reconstruct" the original pretrained embeddings from the trained single speakers and then scale the embedding of the single speaker. So we could just scrape models from sites and create a big dataset of embeddings and also create a very general model by summing all the generators with each other. This would reduce the compute cost immensely to create a robust enough model to create new voices.

I don't know enough about math and how these models work to know if this could work.

Zeger

@sbersier
Copy link
Owner

sbersier commented Jun 6, 2024

Hi,
I didn't respond immediately since I wanted to take a moment to think about your questions.

I didn't do any further experiment. It could be improved but it would need hundreds of voices.

Would it be possible to use voices available on the web (ie. hugginface)?
Honestly, I can't give you a definitive answer. But, just by curiosity, I downloaded a model from hugginface (checked that the configurations files were compatible) and gave it a try. That is, I copied the voice embedding form the downloaded model and put it into the G_38_speakers_0_v74.pth model.

The result from the original downloaded model (voice of agent-47 in https://huggingface.co/chameleon-ai/so-vits-svc-models/tree/main):

a.mp4

The result obtained by transferring the voice embedding to G_38_speakers_0_v74.pth:

b.mp4

The result is not good. It seems that the voice embeddings are intimately related to the rest of the model. If true, this means that there is no other way than training a multispeaker model including hundreds of speakers, starting from D_0.pth, G_0.pth and the audio data.

And, since it (probably) doesn't work between two svc models, I don't think you can transfer voice embeddings between so-vits-svc and RVC.

Best regards,
S. Bersier

@ZegerUser
Copy link
Author

Hello, thanks for your response

After my comment from yesterday I also did some small experiments.

I tried transfering embeddings from a trained model to the original pretrained without luck like you said they are somehow related to the whole model.

In green the original pretrained model, red the 'single' speaker embs and in blue the emb of the single speaker in the model
afbeelding
they only seem to shift a little bit so maybe a NN can be trained to reverse this or some other way.

I also tried straight up copy pasting embedding it gave something but not the ones needed to be.

I am now in the processes of creating/training a new rvc model with 278 english speakers to test it on a more robust model.
Each speaker has between 30min and 1h of data, the dataset includes vtck, vocalset, genshinvoice and a self made one, I tried to have lots of accents to create a diverse dataset. I will share more about this once its started training.

I am also trying to create a small gradio webapp to make it easier to create new voices.

Zeger

@sbersier
Copy link
Owner

sbersier commented Jun 6, 2024

I understand that you plan to train a RVC multispeaker model. Actually, when you look at a RVC model, there is also an 'emb_g.weight' tensor in the 'weights' entry of the model. That tensor looks suspiciously similar to the one we find in svc models. Assuming they are the same thing, then it will be easy to do PCA on your multispeaker RVC model.

Now, for your figure, I'm not sure about what I'm looking at. What do you mean by "original pretrained model"? Do you mean, the G_0.pth generator model? And what are these 'single' speaker embeddings? Does each red point come from a separately trained model? Sorry, I don't get it.

@ZegerUser
Copy link
Author

Yes, the original pretrained models are a collection of D and G models where almost all further trained models are derived from. These models are trained on the vtck dataset. This dataset has 109 speakers therefore it has 109 embeddings. These green dots are these embeddings of each speaker. When you train a rvc model on a new voice it uses these models to cut down on compute and make it easier to train with less data. The first embedding of this model becomes the new voices embedding leaving us with 108 from the original since these are not deleted but changed in someway. The red are these 108 embeddings from the 'single' voice model and the blue is the embedding of the trained voice from this model.

I hope this made it clearer.

@sbersier
Copy link
Owner

sbersier commented Jun 7, 2024

Aaaah! These points come from a RVC model. OK! So, it looks like you've already figured it out all by yourself. Great! I'm eager to listen at the result. Having good quality voices that you could tune with a few sliders corresponding the main principal directions would be great. I hope this will work.

@ZegerUser
Copy link
Author

An hour ago I have started the training of the model. It will probably take a while since a single epoch takes 1h30m on my 3090.

below are some graphs
afbeelding
afbeelding

@sbersier
Copy link
Owner

sbersier commented Jun 7, 2024

Is it like with SVC where you can listen at generated samples during the training? Because, I think this is the only pertinent metric. When it comes to generative adversial networks, losses are not such a good indicator because the generator and the discriminator co-evolve (they fight against each other) so that the losses tend to be rather constant.

@ZegerUser
Copy link
Author

you get the models at each epoch but no samples in the tensorboard

sample audio (not in dataset):
https://voca.ro/1icXuEQ0xVBm

Epoch 1:
id 0
sample in dataset: https://voca.ro/11XeosxXvSb1
model output: https://voca.ro/17kT0LKi205y
id 2
sample in dataset: https://voca.ro/1m4gXpnQUpzm
model output: https://voca.ro/1n8F6A0Adi4o

@sbersier
Copy link
Owner

sbersier commented Jun 7, 2024

For Id 0, the result is impressive. Now, it looks like there is a problem with Id 2: the output is very much like the voice you hear in you sample audio, not at all the one from the sample in dataset (which is a male voice). Probably a confusion. Anyway, the generated outputs are very good. Do you really have to train more than that?

@ZegerUser
Copy link
Author

I think it it because the embeddings and the model don't line up, I set the embeddings to all zeros which was probably not the best. If you look at the progression for the embedding pca components they are seem to diverge

The blue ones are the from the original pretrained models.
afbeelding

@ZegerUser
Copy link
Author

I may have found a problem with the training. My first attempts where finetuning from the provided rvc pretrained models these have failed because the gradients kept exploding and vanishing or loss that went nan. After this I switched over to a fully randomized model. This has now been training stable for 4 days and 12h. After doing some more analysis on the embeddings I found that mixing datasets maybe not a good idea. When I take the cosine similarity of all the embeddings you can see that there are 4 distinct squares. These match exactly with my datasets that I have used. I hope that it will fix themself if I train it longer.
Epoch 70
afbeelding
Epoch 1
afbeelding

@sbersier
Copy link
Owner

I haven't tried RVC, so I can't really help you with that.

  1. "I switched to a fully randomized model":
    With so-vits-svc, the program comes with two pre-trained models: D_0.pth and G_0.pth (the discriminator and the generator). Pre-training these models took a LOT of time and effort and you shouldn't try to train them from scratch, assuming this is what you mean by "fully randomized model".

  2. "Gradients kept exploding and vanishing or loss that went nan":
    Again, I don't know RVC but this is weird... This kind of model should be rather stable. Are you sure that everything is correct with your audio samples (duration, sampling rate, bit depth)?

  3. "Mixing datasets maybe not a good idea."
    I don't see why this should be a problem. Of course, if you have a dataset with very good audio and a dataset with crapy/saturated/noisy audio this will probably show up in the result.

I'm sorry I can't help you more but, again, I've never used RVC. If you face problems training with RVC, you really should ask the RVC community.

@ZegerUser
Copy link
Author

ZegerUser commented Jun 19, 2024

After my problems with rvc I have switched over to svc. I used the same dataset and started training from the pretrained G_0 and D_0 files. This went much better than rvc. This has been trained for 61 epochs and outputs decent audio for most new voices.

Here are some samples of random voices:
https://github.com/sbersier/pca_svc/assets/90973852/018a2bb6-ede0-4c10-8056-eb7ea83957bf

The embbedings don't seem to cluster to based on pitch or other but more so based on the datasets used.
dataset_components
male_female

I have made a minimal webui: https://github.com/ZegerUser/so-vits-svc-voice-lab,
This is the model link: https://huggingface.co/Zeger56644/voice-lab-v1

@sbersier
Copy link
Owner

sbersier commented Jun 19, 2024

Hi!

  1. The samples generated with a random voice are very good. Congrats!

  2. I wouldn't bother too much about this dataset clustering thing as long as the end result is useful and good enough. I had a look at the publicly available datasets you used and indeed they seem quiet different. The genshin dataset is very particular (probably a lot of young high pitched teen girls in it...) whereas the vctk dataset is probably more balanced in that respect.

  3. After dowloading your model (.pth and .json) I tried your webui.py

  • The model doesn't spontaneously load. I had to duplicate the model files in order to be able to properly load the model. Otherwise, it says that the model is not loaded. (I checked that loaded_model is indeed 'None').
  • I had to replace line 47 (in your webui.py) with:
    return "Model loaded succesful", gr.Dropdown(choices=sample_names)
    Because the "update" method is deprecated.
  • Your model is loaded with:
    loaded_model = torch.load(models[index], map_location="cpu")
    Which is unsafe.
    It should be loaded with:
    loaded_model = torch.load(models[index], map_location="cpu", weights_only=True)
    But when I modify that, in order to be safe, it says that it can't load the model without the 'weights_only=False' option.
    I don't know why this is the case for your model. When you look at my "random_voices.py" file, on line 32, you see that the 'weights_only=True' option is set. And that's how it should be. You shouldn't have to disable a security feature in order to run your model.

So, unfortunately, I couldn't try it.

UPDATE:
My torch version: '2.0.1+cu117'
My gradio version: '4.28.3'

  1. Setting the number of sliders to 32 looks a bit overkill... I'm not sure you have enough data to be that granular. When it comes to user's experience, it's better to have a few sliders that have a clear/audible effect.

  2. It would be nice to add a "seed" value since all the point of this is to be able to generate voices that nobody can come and say "You stole my voice!" Now, given your model, the sliders position and a seed value, anyone would be able to "prove" that he/she didn't clone an existing voice.

Best regards,
Stéphane

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants