Confusion/Question about SpeechT5SpeechDecoderPostnet output #79

Student204161 · 2024-04-28T00:20:47Z

Hi there,

I have a question regarding the ouput of SpeechT5SpeechDecoderPostnet.

The pretrained Speecht5Model from huggingface ('microsoft/speecht5_tts') returns an output that has shape (B, 6274,80) as the last layers it forwards through is the SpeechT5SpeechDecoderPostnet. I understand that we get 80 mel bins and that both the paper, code and huggingface mentions that the result is a mel-spectrogram - Where I'm confused is the 6274... This is the time dimension, no? But when I run 2s of 16kHz audio through the pretrained SpeechT5Processor, I get a mel-spectrogram of size (B,126,80)...
I would very much appreciate it if someone could tell me what is going on here.

Sincerely, Khalil

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion/Question about SpeechT5SpeechDecoderPostnet output #79

Confusion/Question about SpeechT5SpeechDecoderPostnet output #79

Student204161 commented Apr 28, 2024

Confusion/Question about SpeechT5SpeechDecoderPostnet output #79

Confusion/Question about SpeechT5SpeechDecoderPostnet output #79

Comments

Student204161 commented Apr 28, 2024