Can I train the Chinese model? #111

Tsangchi-Lam · 2023-11-23T07:35:19Z

Tsangchi-Lam
Nov 23, 2023

I want to train the Chinese model. Do you support mixed input in Chinese and English?

Kreevoz · 2023-11-23T18:32:54Z

Kreevoz
Nov 23, 2023

Look at issue #41 to check the current progress.

0 replies

yl4579 · 2023-11-24T03:21:30Z

yl4579
Nov 24, 2023
Maintainer

You can, but with the current PL-BERT in English the quality won’t be as good it’s originally proposed to be. I’m working on multilingual PL-BERT now and it may take one or two months to finish.

1 reply

PleinAcces Nov 29, 2023

Hi @yl4579
Would it also be necessary to train new WavLM models for languages other than English?

yl4579 · 2023-11-24T03:22:09Z

yl4579
Nov 24, 2023
Maintainer

See yl4579/StyleTTS#10 for more details.

0 replies

hermanseu · 2023-11-24T06:50:14Z

hermanseu
Nov 24, 2023

@yl4579 I trained styletts2 successfully using Chinese data, it sound very good. As wavlm-base-plus only supporting English, I used a Chinese Hubert model as SLM. When I want to train a model both for Chinese and English, I can not find a pre-trained model sopport Chinese and English at the same time. About SLM，Do you have any suggestions ?

3 replies

zhouyong64 Dec 14, 2023

Could you tell me which Hubert model did you use? Also, what's your SLM configuration in StyleTTS's config.yml for this SLM model?

slm:
model: 'microsoft/wavlm-base-plus'
sr: 16000 # sampling rate of SLM
hidden: 768 # hidden size of SLM
nlayers: 13 # number of layers of SLM
initial_channel: 64 # initial channels of SLM discriminator head

hermanseu Dec 15, 2023

I use the hubert model of https://github.com/TencentGameMate/chinese_speech_pretrain as SLM for Chinese, and use whisper encoder as SLM for multi-language(Chinese and English currently).

Respaired Mar 4, 2024

@hermanseu Can you tell me how did you use the Whisper encoder for this purpose? any code snippets would help, thanks.

yl4579 · 2023-11-24T06:53:04Z

yl4579
Nov 24, 2023
Maintainer

You can try whisper encoder that was trained with multiple languages. You can also try multilingual wav2vec2.0: https://huggingface.co/facebook/wav2vec2-large-xlsr-53

0 replies

zhouyong64 · 2023-11-24T09:36:45Z

zhouyong64
Nov 24, 2023

@yl4579 I trained styletts2 successfully using Chinese data, it sound very good.

Did you use the English PL-BERT or did you train PL-BERT with Chinese data?

0 replies

hermanseu · 2023-11-24T09:59:42Z

hermanseu
Nov 24, 2023

train PL-BERT with Chinese data

1 reply

11721206 Dec 14, 2023

is convenient for you to provide your chinese PL-BERT model， thanks very much

Moonmore · 2023-11-27T10:35:27Z

Moonmore
Nov 27, 2023

I trained styletts2 successfully using Chinese data, it sound very good. As wavlm-base-plus only supporting English, I used a Chinese Hubert model as SLM. When I want to train a model both for Chinese and English, I can not find a pre-trained model sopport Chinese and English at the same time. About SLM，Do you have any suggestions ?

What is your modeling unit? IPA or Pinyin?

0 replies

hermanseu · 2023-11-28T01:54:19Z

hermanseu
Nov 28, 2023

@Moonmore The modeling unit is pinyin.

test.zip is a synth sample.

0 replies

zhouyong64 · 2023-11-28T03:29:04Z

zhouyong64
Nov 28, 2023

@Moonmore The modeling unit is pinyin.

test.zip is a synth sample.

Do you use the tone of pinyin when training Chinese PL-BERT? I believe StyleTTS uses F0 for Chinese tones. Can this PL-BERT with tones work with StyleTTS?

0 replies

hermanseu · 2023-11-28T03:38:52Z

hermanseu
Nov 28, 2023

I trained Chinese PL-BERT without pinyin tones. But maybe PL-BERT with tones will also work normally, so you can try.

0 replies

zhouyong64 · 2023-11-28T04:24:07Z

zhouyong64
Nov 28, 2023

I trained Chinese PL-BERT without pinyin tones. But maybe PL-BERT with tones will also work normally, so you can try.

How many samples did you use to train Chinese PL-BERT?

0 replies

hermanseu · 2023-11-29T01:04:50Z

hermanseu
Nov 29, 2023

@zhouyong64 I used about 84,000,000 text sentences to train the Chinese PL-BERT model.

0 replies

Moonmore · 2023-11-29T02:12:28Z

Moonmore
Nov 29, 2023

@Moonmore The modeling unit is pinyin.

test.zip is a synth sample.

Sounds really good. I would like to ask if the pinyin unit you mentioned cannot be disassembled into phones? How to align plbert and text input?

0 replies

hermanseu · 2023-11-29T03:06:30Z

hermanseu
Nov 29, 2023

@Moonmore
I used the same pinyin phonemes(sheng1 mu3 yun4 mu3) to train all the models. But when training asr, I used the phonemes without tones. if the pinyin uint cannot be disassembled, maybe the pinyin can be regard as an phoneme.

@zhouyong64 Sorry for the wrong information of yesterday, I tained PL-BERT with tones, and trained asr without tones.

I trained Chinese PL-BERT without pinyin tones. But maybe PL-BERT with tones will also work normally, so you can try.

0 replies

Moonmore · 2023-11-29T03:22:22Z

Moonmore
Nov 29, 2023

@Moonmore I used the same pinyin phonemes(sheng1 mu3 yun4 mu3) to train all the models. But when training asr, I used the phonemes without tones. if the pinyin uint cannot be disassembled, maybe the pinyin can be regard as an phoneme.

@zhouyong64 Sorry for the wrong information of yesterday, I tained PL-BERT with tones, and trained asr without tones.

I trained Chinese PL-BERT without pinyin tones. But maybe PL-BERT with tones will also work normally, so you can try.

So can I understand that all text-related models are trained using the same phoneme unit, and the characteristics of each minimum pronunciation modeling unit are obtained. like(ni3 hao3 -> n i3 h ao3), The input length is 4, and the output length of the model is also 4. text encoder and the bert model. and how to construct the plbert label?

0 replies

hermanseu · 2023-11-29T05:35:29Z

hermanseu
Nov 29, 2023

@Moonmore
Yes, the output lengths of text encoder and bert are same as input lengths.
About plbert label, you can read the logic of dataloader.py in plbert repo. It explained clearly.

0 replies

Moonmore · 2023-11-29T06:38:47Z

Moonmore
Nov 29, 2023

@Moonmore Yes, the output lengths of text encoder and bert are same as input lengths. About plbert label, you can read the logic of dataloader.py in plbert repo. It explained clearly.

@hermanseu Thank you for your reply.

0 replies

georgedei · 2024-01-21T01:03:21Z

georgedei
Jan 21, 2024

How can the above be applied to StyleTTS2? Is there a complete repo already I could look up that is specialized on Mandarin using this G2PW? As a non-expert I am looking at the puzzle pieces but don't see the entire picture. Perhaps its too early in the development.

0 replies

yijingshihenxiule · 2024-05-06T03:05:29Z

yijingshihenxiule
May 6, 2024

@hermanseu ,兄弟，请问一下你在训asr模块的时候，意思是分解音素时没有声调吗？比如sheng1 mu3 yun4 mu3 --> sheng mu yun mu？
我尝试了加声调，比如sheng1 mu3 yun4 mu3 --> sheng1 mu3 yun4 mu3，训练asr部分的时候出现了负ctc loss，不知道是什么原因。

1 reply

RoversCode Jun 11, 2024

负ctc loss是正常的，原因在于预测输出了太多blank label。

RoversCode · 2024-06-13T03:19:41Z

RoversCode
Jun 13, 2024

@hermanseu hi, I have a question about using the Whisper encoder as part of the Speech Language Model (SLM). The Whisper encoder requires preprocessing of the audio, which in the forward computation of the WavLMLoss, seems to necessitate detaching the gradients for y_rec. Will this not impact the training process, or have I misunderstood something? I look forward to your response.

7 replies

hermanseu Jun 13, 2024

我是把huggingface上代码拿出来了封装了下，只取其中的encode部分，然后前端处理都走torch了，不转numpy。

RoversCode Jun 13, 2024

感谢，明白了

weidezhang Jun 25, 2024

@hermanseu 你好，我想问一下，是不是就把slm模型替换成hubert chinese large(如果我只训练中文的话) 就行了, 我的理解前处理应该在data loader 里面已经换成log_mel了？在wavloss里面，里面取了hidden_states, 你上面说的只取encode 部分具体指啥？需要改代码吗？还有我想请教一下，如果只有1个小时chinese single speaker 数据，一般能够fine tune 成功吗？需要和英语音频混训吗？

RoversCode Jul 4, 2024

@hermanseu hello老哥，你训练出来的中英模型，是否存在下述情况。推理长文本，语速异常快，短文本，语速异常慢。

hermanseu Jul 5, 2024

之前训24k采样率音频的是偶，数据大概3000h，没出现这种问题。最近加大了数据到2.5万小时，改成训16k采样率的，出现了类似的问题，暂时不清楚是啥原因导致的。

RoversCode · 2024-07-03T09:53:27Z

RoversCode
Jul 3, 2024

I have successfully implemented Style-TTS in Chinese and english, but I'm encountering an issue with the speech rate. The shorter the sentence, the slower the speech, and the longer the sentence, the faster the speech. Does anyone else have the same problem?

0 replies

Can I train the Chinese model? #111

Replies: 22 comments · 13 replies

yl4579 Nov 24, 2023 Maintainer

yl4579 Nov 24, 2023 Maintainer

yl4579 Nov 24, 2023 Maintainer

Replies: 22 comments 13 replies

yl4579
Nov 24, 2023
Maintainer

yl4579
Nov 24, 2023
Maintainer

yl4579
Nov 24, 2023
Maintainer