Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker_id during inference #5

Open
Srija616 opened this issue Jan 17, 2024 · 4 comments
Open

Speaker_id during inference #5

Srija616 opened this issue Jan 17, 2024 · 4 comments

Comments

@Srija616
Copy link

Hi @ylacombe! I have a multi-speaker data using which I have trained the hindi checkpoint. I wanted to generate a particular speaker's voice during inference. Is there any way to do that using the inference code given in the README?

Here is how my current code looks:
`import scipy

from transformers import pipeline
import time
model_id = "./vits_finetuned_hindi"
synthesiser = pipeline("text-to-speech", model_id, device=0) # add device=0 if you want to use a GPU
speech = synthesiser("वहीं पंजाब सरकार ने सरबत खालसा के आयोजन के लिए, पंजाब के भठिंडा ज़िले में, तलवंडी साबो में, जगह देने से मना कर दिया है।")
scipy.io.wavfile.write("hindi_1.wav", rate=speech["sampling_rate"], data=speech["audio"][0])`

@ylacombe
Copy link
Owner

Hey @Srija616, you can use the kwarg speaker_id like this:

forward_params = {"speaker_id": XXXX}
text = "वहीं पंजाब सरकार ने सरबत खालसा के आयोजन के लिए, पंजाब के भठिंडा ज़िले में, तलवंडी साबो में, जगह देने से मना कर दिया है।"

speech = synthetiser(text, forward_params=forward_params)

Did you finetune using the multi-speaker feature from the training code ?
Also, I'm quite curious about your feeling on the quality of the model, don't hesitate to let me know,
Best

@Srija616
Copy link
Author

@ylacombe Yes we have two speakers for Hindi (male, female) and these are the two params we tweaked to enable multispeaker training. Just wondering if there are other params that need to be defined for multispeaker training.
image

We are also facing two issues:

  1. During finetuning, the train_loss_kl and val_loss_kl are both going to infinity - tested it with the English finetuning using ylacombe/mms-tts-eng-train model and here too, we are facing the same problem. The train_loss_disc has NaN values and the mel loss for both train and validation are not converging.
    The synthesized sample although seems to good with pronunciation and naturalness for English. For Hindi, we have pronunciation and naturalness issues.

  2. The speaker generated by the model is not similar to the speaker of the dataset, even though I passed the speaker_id as you mentioned in the previous comment.

Adding the wandb charts for our Hindi and English runs:

  1. Hindi
    image

  2. English
    image

@ylacombe Was wondering if you have some thoughts on why these losses are going to infinity or Nan. It is possible that we are missing something trivial.

I can share the generated samples over mail, if you'd like to here.

@ylacombe
Copy link
Owner

Hey @Srija616, sorry for the late response! Nice project here!
If you have two speakers, I'd recommend finetuning two different times, since the original model only had one speaker, so the speaker embeddings must be learn from scratch.

Can you send me your training config ? I do have some great finetuning results on a single speaker fine-tuning

@gsyllas
Copy link

gsyllas commented Jul 4, 2024

Hello @ylacombe , as I am currently finetuning the mms_tts_ell on a single speaker dataset would it be possible to assist me with the training configurations? My dataset consists ~4 hourds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants