Zero-shot voice conversion trained according to the scheme described in SEED-TTS.
The VC quality is surprisingly good in terms of both audio quality and timbre similarity. We decide to continue along this pathway see where it can achieve.
TODO:
- Release code
- Release v0.1 pretrained model:
- Huggingface space demo:
- HTML demo page (maybe with comparisons to other VC models): Demo
- Code for training on custom data
- Streaming inference
- Potential architecture improvements
- More to be added