Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multispeaker and new neural voice creation #88

Open
kafan1986 opened this issue Nov 7, 2022 · 12 comments
Open

Multispeaker and new neural voice creation #88

kafan1986 opened this issue Nov 7, 2022 · 12 comments

Comments

@kafan1986
Copy link

I used the fastpitch model for generating TTS for know speaker. Can I extend this model to multispeakers by using speaker embedding? If yes, then can the solution be used to extended so as to fine tune and mimic a new voice on limited audio data? Has anyone experimented on this path?

@cschaefer26
Copy link

Hi, just to let you know I am currently working on a multispeaker implementation that will be live soon. Fine-tuning is possible with about 5mins of fresh data.

@kafan1986
Copy link
Author

@cschaefer26 I can see you are actively developing multi-speaker implementation in one of the branches. Is it at a stage where I can experiment with it or should I wait some more?

@cschaefer26
Copy link

cschaefer26 commented Jan 5, 2023

Hi, yeah I am currently implementing it in the below branch:

https://github.com/as-ideas/ForwardTacotron/tree/feature/multispeaker

Its probably going to be ready in 2 weeks or so. I am currently testing it on the VCTK dataset and cannot guarantee it is working properly. It could be worth a try though if you like, training is implemented, inference will come soon. Use the multispeaker.yaml config, it supports vctk and a variant of the ljspeech format (can be set in preprocessing.audio_format). For the ljspeech format it expects rows as: id|speaker_id|text

@kafan1986
Copy link
Author

@cschaefer26 Thanks for the update. I will wait for another 2 weeks before experimenting with it. GPU time is expensive at my end. But I think you have only made the multi-speaker TTS with forward tacotron and not with FastPitch. Is it so? As per my previous experiment, FastPitch gave slightly better output quality compared to ForwardTacotron, can we get a FastPitch version of the same? Thanks again for all your work.

@cschaefer26
Copy link

Hi, yeah I gonna implement both (ForwardTaco first, then FastPitch) - in my experience ForwardTaco is actually performing better, but it may depend on the dataset...

@debasish-mihup
Copy link

@cschaefer26 I can see you are still experimenting through multiple branches. Can you keep one provision for keeping emotion as a parameter. So that apart from providing speaker embedding during training phase, I would also be able to provide emotion type of the audio segment. In case this emotion information is not available, it can be assumed to be of "neutral" emotion.

@kafan1986
Copy link
Author

@cschaefer26 Is the multispeaker branch ready for testing? Also, can you create a branch with Fastpitch?

@cschaefer26
Copy link

cschaefer26 commented Feb 5, 2023

Hi multispeaker is merged and ready for testing. I tested it on a custom dataset but as always with such large merges, there may be bugs - pls let me know if you find anything fishy. My colleague @alexteua will work on implementing FastPitch from next week.

@debasish-mihup Currently there is no plan to support emotion conditioning in the vanilla version, but it should be easy to add in a branch if you like. Hint - you can simply concatenate it to the speaker embedding.
I would be curious if you are experimenting with an annotated dataset?

@rmcpantoja
Copy link

Hi @cschaefer26, Congratulations on the final work on multispeaker.
I would like to try this new multi-forward to make a pretrained model with more than 15 Spanish speakers. Each speaker has a dataset, so merging all into one would be a good idea. Each dataset lasts from a minimum of 10 minutes to a maximum of one hour and 30 minutes. How many hours does it take at least to make a decent model?

@kafan1986
Copy link
Author

kafan1986 commented Feb 18, 2023

@cschaefer26 @alexteua Thanks for the multispeaker variant. Is there any progress on the Fastpitch version of it? I could not find a working branch for that. Also, if I have to train the model and work decently for unseen speaker, what should be the usual no. of speakers in the training data for both genders and how much hours per speaker? Any idea based on your experimentation?

@alexteua
Copy link

hi @kafan1986 Fastpitch version is coming in the following days

@alexteua
Copy link

alexteua commented Mar 23, 2023

@kafan1986 multispeaker fastpitch is ready to use ( #95 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants