about n_spks #30

0913ktg · 2024-02-19T06:47:21Z

Hello p0p4k,

I am deeply grateful for the code you have provided.

I have a question while adapting it to a Korean version. I am preparing to use a speech-to-text dataset with approximately 2000 speakers.

However, the dataset does not contain speaker labels for the voice data. From what I understand in the paper, it seems that speaker information is learned using a 3-second prompt without explicitly using speaker labels.

If my understanding is correct, this suggests that speaker labels are not necessary for training the model, and thus, not required in the filelist for multi-speaker synthesis.

Yet, I noticed that the code is designed to use speaker labels at the first index of the filelist when n_spks is greater than 1.

I would be extremely grateful if you could clarify this part for me.

My understanding of this paper is still quite limited, and I apologize if my question seems naive.

Thank you.

p0p4k · 2024-02-19T11:44:54Z

Just ignore the speaker list or anything regarding n_spks. Send audio text to the dataloader as if it's one big dataset. Pflow will take care of it. Your understanding is correct. Nspks exists for future use.

0913ktg · 2024-02-20T01:30:23Z

Thank you for your response.
Have a great day.

0913ktg · 2024-03-01T09:30:01Z

Hello p0p4k,

I've been testing single-speaker training using your pflow code. Unfortunately, due to poor data quality and training multi-speaker data with n_spks set to 1, the mel-spectrogram was not generated effectively.

I plan to experiment with a multi-speaker dataset that includes spk_id labels, hoping to train a robust Korean model. The preprocessing has begun, and the training is underway.

Additionally, I noticed you forked the naturalspeech2 code. I'm curious if you have any development plans related to it. I'm quite interested in training with naturalspeech2 myself.

Thank you.

p0p4k · 2024-03-01T17:58:50Z

I want to add text|audio pair in the ns2 repo for training it and see if it works.
In my head, I think for model like pflow even if data is poor, we can augment the pitch and generate infinite data for same text (spoken content is same, but pitch change will change the style). Since it doesnt take in spk_id and takes in a prompt, we should be able to train it with even a single speaker to generate any voice based on the prompt style. What do you think about this?

0913ktg · 2024-03-04T01:55:36Z

I changed the dataset and experimented again.
It didn't generate mel-spectrograms well on the old data, but it seems to generate them almost as well on this data.
I think the previous data was recorded conversations of normal people, so the text and speech were inconsistent and the speaker's speaking style was too strong.
The new data is much less noisy, the pronunciation is clearer, and the speakers are consistent and faithful in their recordings.

p0p4k · 2024-03-04T03:48:11Z

Good news! Thanks for update, train for longer, 그리고 샘플을 공유 해주세요.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about n_spks #30

about n_spks #30

0913ktg commented Feb 19, 2024

p0p4k commented Feb 19, 2024

0913ktg commented Feb 20, 2024

0913ktg commented Mar 1, 2024

p0p4k commented Mar 1, 2024

0913ktg commented Mar 4, 2024 •

edited

Loading

p0p4k commented Mar 4, 2024

about n_spks #30

about n_spks #30

Comments

0913ktg commented Feb 19, 2024

p0p4k commented Feb 19, 2024

0913ktg commented Feb 20, 2024

0913ktg commented Mar 1, 2024

p0p4k commented Mar 1, 2024

0913ktg commented Mar 4, 2024 • edited Loading

p0p4k commented Mar 4, 2024

0913ktg commented Mar 4, 2024 •

edited

Loading