Skip to content

Training 🍵 Matcha‐TTS with different dataset & languages

Shivam Mehta edited this page Dec 6, 2023 · 1 revision

Hello! Thank you for your interest in 🍵 Matcha-TTS.


For training with a different dataset, most parameters would be the same as ljspeech.yaml. So you can essentially just copy that.Generally, I prefer resampling all my audio files to 22050 sampling rate instead of changing the audio parameters, as this solves the problem of finding a different vocoder. Then, You can generate mean and standard deviation for your dataset (for better standardisation) using these steps I have added in README.md. The major changes you would require: In YOUR_DATASET.yaml

name: NAME_YOUR_DATASET_ANYTHING_ARBITRARY
train_filelist_path: NEW_FILEPATHS
valid_filelist_path: NEW_FILEPATHS
data_statistics:
  mel_mean: <generate (better) or use lj_speech's value> 
  mel_std: <generate (better) or use lj_speech's value> 
cleaners: [?chinese_cleaner? ] # you will need to setup text normalisation rules as stated below

You can take a look at vctk.yaml and do something similar, use the defaults from ljspeech.yaml and override what you need for your specific dataset.

For phonemisation: (again, I have no experience in training with majority of other datasets but you can change the phonemizer language here, I think for mandarin it is zh with espeak backend and cmn with espeak-ng backend.)

https://github.com/shivammehta25/Matcha-TTS/blob/c8d0d60f87147fe340f4627b84588e812e5fbb00/matcha/text/cleaners.py#L28


Is a relatively small dataset (like a 20-min dataset) okay?

I haven't tested it but the Monotonic alignment is very useful for jointly learning to speak and train. Even with a small dataset (especially if it is a studio-recorded dataset of read speech). I feel It depends largely on the dataset quality and the possibility of aligning it monotonically. However, fine-tuning should mostly work better than training from scratch in such scenarios, so, you can first train on a larger dataset and then fine-tune it for your specific one. Something similar to what we did for OverFlow and it worked very well.