This is an adapted FastSpeech 2 pytorch implementation accompanying the paper Towards Robust FastSpeech 2 by Modelling Residual Multimodality. It is based on the PyTorch implementations of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech by Chien 2020 and Liu 2020. The hifigan implementation is taken from Kong 2020.
Along with bugfixes and code reorganization, two major changes have been introduced to the standard FastSpeech 2:
- Adapted Variance Predictors (
model/modules.py
) - Trivariate-Chain Gaussian Mixture Modelling (Loss:
model/loss.py
, Sampling:interactive_tts.py
)
Audio samples of this implementation can be found on its project page.
After downloading the checkpoints and putting them into output/ckpt/<checkpoint name>
an interactive demo server can be started with python demo.py --checkpoint libritts_tvcgmm_k5 --device cpu --port 9000
and accessed in the browser at localhost:9000.
We recommend installing the dependencies in a Python 3.8 conda environment (or venv) using:
conda create tvcgmm python=3.8
conda activate tvcgmm
pip install -r requirements.txt
To run inference in our interactive demo it suffices to download our pre-trained model checkpoints.
The downloaded <checkpoint name>.zip
file contains a 40000.pth.tar
and a used_config.yaml
which should both be placed together in the directory output/ckpt/<checkpoint name>
.
You can then import and use the interactive_tts.py
class to generate samples.
The supported datasets are
- LJSpeech: a single-speaker English dataset consisting of 13k short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
- VCTK: a multi-speaker English datasets consisting of 88k samples from 109 English speakers with various accents, deliberately selected for contextual and phonetic coverage, approximately 44 hours in total.
- LibriTTS: a multi-speaker English dataset containing 585 hours of speech by 2456 speakers. We use the train-clean-360 split with 115k samples from 904 speakers over 192 hours.
We take LJSpeech as an example hereafter.
There are three config files for every dataset in config/
containing the default parameters.
Please edit/copy there and then run your experiments.
When training a new model the used configuration is copied next to the checkpoint in output/ckpt/<experiment name>
.
Training with TVC-GMM can be enabled and the number of mixtures can be set in model.yaml
.
First, run
python prepare_align.py config/ljspeech/preprocess.yaml
to process the source datasets into a unified format and prepare them for alignment.
Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Download the official MFA package and run
./montreal-forced-aligner/bin/mfa_align raw_data/ljspeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/ljspeech
to align the corpus and then run the preprocessing script, which will generate the mel, pitch, energy and duration targets for training.
python preprocess.py config/ljspeech/preprocess.yaml
Train your model with
python train.py --experiment <experiment name> -p config/ljspeech/preprocess.yaml -m config/ljspeech/model.yaml -t config/ljspeech/train.yaml
The model takes less than 10k steps of training to generate audio samples with acceptable quality and converges around 40k steps. Total training time in our tests was around 2-3 hours on a NVIDIA GeForce RTX 2080ti.
Use
tensorboard --logdir output/log
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audio samples can be inspected.
If you use TVC-GMM or any of our code in your project, please cite:
@inproceedings{koegel23_interspeech,
author={Fabian Kögel and Bac Nguyen and Fabien Cardinaux},
title={{Towards Robust FastSpeech 2 by Modelling Residual Multimodality}},
year={2023},
booktitle={Proc. Interspeech 2023}
}