Speaker_id during inference #5

Srija616 · 2024-01-17T06:14:10Z

Hi @ylacombe! I have a multi-speaker data using which I have trained the hindi checkpoint. I wanted to generate a particular speaker's voice during inference. Is there any way to do that using the inference code given in the README?

Here is how my current code looks:
`import scipy

from transformers import pipeline
import time
model_id = "./vits_finetuned_hindi"
synthesiser = pipeline("text-to-speech", model_id, device=0) # add device=0 if you want to use a GPU
speech = synthesiser("वहीं पंजाब सरकार ने सरबत खालसा के आयोजन के लिए, पंजाब के भठिंडा ज़िले में, तलवंडी साबो में, जगह देने से मना कर दिया है।")
scipy.io.wavfile.write("hindi_1.wav", rate=speech["sampling_rate"], data=speech["audio"][0])`

ylacombe · 2024-01-17T09:28:00Z

Hey @Srija616, you can use the kwarg speaker_id like this:

forward_params = {"speaker_id": XXXX}
text = "वहीं पंजाब सरकार ने सरबत खालसा के आयोजन के लिए, पंजाब के भठिंडा ज़िले में, तलवंडी साबो में, जगह देने से मना कर दिया है।"

speech = synthetiser(text, forward_params=forward_params)

Did you finetune using the multi-speaker feature from the training code ?
Also, I'm quite curious about your feeling on the quality of the model, don't hesitate to let me know,
Best

Srija616 · 2024-01-18T09:22:12Z

@ylacombe Yes we have two speakers for Hindi (male, female) and these are the two params we tweaked to enable multispeaker training. Just wondering if there are other params that need to be defined for multispeaker training.

We are also facing two issues:

During finetuning, the train_loss_kl and val_loss_kl are both going to infinity - tested it with the English finetuning using ylacombe/mms-tts-eng-train model and here too, we are facing the same problem. The train_loss_disc has NaN values and the mel loss for both train and validation are not converging.
The synthesized sample although seems to good with pronunciation and naturalness for English. For Hindi, we have pronunciation and naturalness issues.
The speaker generated by the model is not similar to the speaker of the dataset, even though I passed the speaker_id as you mentioned in the previous comment.

Adding the wandb charts for our Hindi and English runs:

Hindi
English

@ylacombe Was wondering if you have some thoughts on why these losses are going to infinity or Nan. It is possible that we are missing something trivial.

I can share the generated samples over mail, if you'd like to here.

ylacombe · 2024-02-13T11:27:13Z

Hey @Srija616, sorry for the late response! Nice project here!
If you have two speakers, I'd recommend finetuning two different times, since the original model only had one speaker, so the speaker embeddings must be learn from scratch.

Can you send me your training config ? I do have some great finetuning results on a single speaker fine-tuning

gsyllas · 2024-07-04T19:48:03Z

Hello @ylacombe , as I am currently finetuning the mms_tts_ell on a single speaker dataset would it be possible to assist me with the training configurations? My dataset consists ~4 hourds.

gangagyatso4364 mentioned this issue Oct 14, 2024

TTS0004: Train the mms TTS for multi speaker in GCP. OpenPecha/tts-model#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker_id during inference #5

Speaker_id during inference #5

Srija616 commented Jan 17, 2024

ylacombe commented Jan 17, 2024

Srija616 commented Jan 18, 2024

ylacombe commented Feb 13, 2024

gsyllas commented Jul 4, 2024

Speaker_id during inference #5

Speaker_id during inference #5

Comments

Srija616 commented Jan 17, 2024

ylacombe commented Jan 17, 2024

Srija616 commented Jan 18, 2024

ylacombe commented Feb 13, 2024

gsyllas commented Jul 4, 2024