You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question regarding the ouput of SpeechT5SpeechDecoderPostnet.
The pretrained Speecht5Model from huggingface ('microsoft/speecht5_tts') returns an output that has shape (B, 6274,80) as the last layers it forwards through is the SpeechT5SpeechDecoderPostnet. I understand that we get 80 mel bins and that both the paper, code and huggingface mentions that the result is a mel-spectrogram - Where I'm confused is the 6274... This is the time dimension, no? But when I run 2s of 16kHz audio through the pretrained SpeechT5Processor, I get a mel-spectrogram of size (B,126,80)...
I would very much appreciate it if someone could tell me what is going on here.
Sincerely, Khalil
The text was updated successfully, but these errors were encountered:
Hi there,
I have a question regarding the ouput of SpeechT5SpeechDecoderPostnet.
The pretrained Speecht5Model from huggingface ('microsoft/speecht5_tts') returns an output that has shape (B, 6274,80) as the last layers it forwards through is the SpeechT5SpeechDecoderPostnet. I understand that we get 80 mel bins and that both the paper, code and huggingface mentions that the result is a mel-spectrogram - Where I'm confused is the 6274... This is the time dimension, no? But when I run 2s of 16kHz audio through the pretrained SpeechT5Processor, I get a mel-spectrogram of size (B,126,80)...
I would very much appreciate it if someone could tell me what is going on here.
Sincerely, Khalil
The text was updated successfully, but these errors were encountered: