Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplified upsampling #4

Open
bshall opened this issue Mar 19, 2019 · 3 comments
Open

Simplified upsampling #4

bshall opened this issue Mar 19, 2019 · 3 comments

Comments

@bshall
Copy link

bshall commented Mar 19, 2019

Hi @geneing, thanks for all your hard work! I was wondering why you decided to abandon the simplified upsampling in your model_simplification branch. Was the audio quality significantly worse?

@geneing
Copy link
Owner

geneing commented Mar 19, 2019

@bshall Well, the main reason for simplified upsampling was to improve data flow. The upsampling part contains a 5 tap convolution, which requires padding the input mels on both sides with at least 2 empty frames on each side. It adds significant amount of work when doing parallel synthesis (by splitting the input mels in time and synthesizing in parallel - each piece has to be padded), and one has to be very careful when stitching padded waveform pieces together.

It turned out that network based upsampling is actually shifting the mels in time a bit, which simple interpolation wasn't doing. This resulted in slightly lower quality speech.

Keep in mind that upsampling is a tiny part of overall timing. Most of the work is done in RNN and post-net FC layers.

I'm starting to thing about implementing streaming synthesis for the C++ library (i.e. don't wait for all the mel frames to be ready, instead generate as mel frames are added), so I may take another look at upsampling to avoid doing convolutions.

@bshall
Copy link
Author

bshall commented Mar 20, 2019

Thanks for the response @geneing. Yeah, streaming synthesis would be really cool. I was wondering whether simple "nearest" upsampling would be good enough to replace the upsampling network.

@kilolgupta
Copy link

Hi @geneing
I was wondering if you made any progress with the streaming synthesis. I am trying to do something similar to better estimate the inference time/First Time To Response, and the achieved improvements using very helpful techniques that you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants