-
Notifications
You must be signed in to change notification settings - Fork 102
Training 🍵 Matcha‐TTS with different dataset & languages
Hello! Thank you for your interest in 🍵 Matcha-TTS.
For training with a different dataset, most parameters would be the same as ljspeech.yaml
. So you can essentially just copy that.Generally, I prefer resampling all my audio files to 22050
sampling rate instead of changing the audio parameters, as this solves the problem of finding a different vocoder.
Then, You can generate mean and standard deviation for your dataset (for better standardisation) using these steps I have added in README.md.
The major changes you would require:
In YOUR_DATASET.yaml
name: NAME_YOUR_DATASET_ANYTHING_ARBITRARY
train_filelist_path: NEW_FILEPATHS
valid_filelist_path: NEW_FILEPATHS
data_statistics:
mel_mean: <generate (better) or use lj_speech's value>
mel_std: <generate (better) or use lj_speech's value>
cleaners: [?chinese_cleaner? ] # you will need to setup text normalisation rules as stated below
You can take a look at vctk.yaml and do something similar, use the defaults from ljspeech.yaml
and override what you need for your specific dataset.
For phonemisation: (again, I have no experience in training with majority of other datasets but you can change the phonemizer language here, I think for mandarin it is zh
with espeak
backend and cmn
with espeak-ng
backend.)
Is a relatively small dataset (like a 20-min dataset) okay?
I haven't tested it but the Monotonic alignment is very useful for jointly learning to speak and train. Even with a small dataset (especially if it is a studio-recorded dataset of read speech). I feel It depends largely on the dataset quality and the possibility of aligning it monotonically. However, fine-tuning should mostly work better than training from scratch in such scenarios, so, you can first train on a larger dataset and then fine-tune it for your specific one. Something similar to what we did for OverFlow and it worked very well.