Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion
Code for this paper Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion
Shaojin Ding, Ricardo Gutierrez-Osuna
In INTERSPEECH 2019
This is a Pytorch implementation. This implementation is based on the VQ-VAE-WaveRNN implementation at https://github.com/mkotha/WaveRNN.
The preparation is similar to that at https://github.com/mkotha/WaveRNN. We repeat it here for convenience.
- Python 3.6 or newer
- PyTorch with CUDA enabled
- librosa
- apex if you want to use FP16 (it probably doesn't work that well).
cp config.py.example config.py
You can skip this section if you don't need a multi-speaker dataset.
- Download and uncompress the VCTK dataset.
python preprocess_multispeaker.py /path/to/dataset/VCTK-Corpus/wav48 /path/to/output/directory
- In
config.py
, setmulti_speaker_data_path
to point to the output directory.
To run Group Latent Embedding:
$ python wavernn.py -m vqvae_group --num-group 41 --num-sample 10
The -m
option can be used to tell the the script what model to train. By default, it trains a vanilla VQ-VAE model.
Trained models are saved under the model_checkpoints
directory.
By default, the script will take the latest snapshot and continues training
from there. To train a new model freshly, use the --scratch
option.
Every 50k steps, the model is run to generate test audio outputs. The output
goes under the model_outputs
directory.
When the -g
option is given, the script produces the output using the saved
model, rather than training it.
--num-group
specifies the number of groups. --num-sample
specifies the number of atoms in each group. Note that num-group times num-sample should be equal to the total number of atoms in the embedding dictionary (n_classes
in class VectorQuantGroup
in vector_quant.py
)
The code is based on mkotha/WaveRNN.
@inproceedings{Ding2019,
author={Shaojin Ding and Ricardo Gutierrez-Osuna},
title={{Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion}},
year=2019,
booktitle={Proc. Interspeech 2019},
pages={724--728},
doi={10.21437/Interspeech.2019-1198},
url={http://dx.doi.org/10.21437/Interspeech.2019-1198}
}