Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion/Question about SpeechT5SpeechDecoderPostnet output #79

Open
Student204161 opened this issue Apr 28, 2024 · 0 comments
Open

Confusion/Question about SpeechT5SpeechDecoderPostnet output #79

Student204161 opened this issue Apr 28, 2024 · 0 comments

Comments

@Student204161
Copy link

Hi there,

I have a question regarding the ouput of SpeechT5SpeechDecoderPostnet.

The pretrained Speecht5Model from huggingface ('microsoft/speecht5_tts') returns an output that has shape (B, 6274,80) as the last layers it forwards through is the SpeechT5SpeechDecoderPostnet. I understand that we get 80 mel bins and that both the paper, code and huggingface mentions that the result is a mel-spectrogram - Where I'm confused is the 6274... This is the time dimension, no? But when I run 2s of 16kHz audio through the pretrained SpeechT5Processor, I get a mel-spectrogram of size (B,126,80)...
I would very much appreciate it if someone could tell me what is going on here.

Sincerely, Khalil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant