You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, first thank you for this amazing model!
I have small questions regarding the model, if I understand well, the model uses a text encoder to encode the prompt and an embedding layer to encode the lyric and prepend the lyrics token to the Parler-TTS decoder layer to create the codebook index (audio tokens).
In this code: https://github.com/huggingface/parler-tts/blob/8e465f1b5fcd223478e07175cb40494d19ffbe17/INFERENCE.md
You feed the attention_mask for the prompt and the lyrics, so I deduce that the model uses attention masks on these 2 parts, but for training the model, how do you create the attention mask for the audio tokens?
If you use an attention mask, do you consider that a column of token containing only paddings is equivalent to a pad token and that therefore it should not be considered in the attention mechanism? Since in the model architecture, the sum of embeddings by columns is done before making predictions with the transform model.
I have some difficulty understanding this point in the architecture of the model. Thanks for reading :)
The text was updated successfully, but these errors were encountered:
Hello, first thank you for this amazing model!
I have small questions regarding the model, if I understand well, the model uses a text encoder to encode the prompt and an embedding layer to encode the lyric and prepend the lyrics token to the Parler-TTS decoder layer to create the codebook index (audio tokens).
In this code: https://github.com/huggingface/parler-tts/blob/8e465f1b5fcd223478e07175cb40494d19ffbe17/INFERENCE.md
You feed the attention_mask for the prompt and the lyrics, so I deduce that the model uses attention masks on these 2 parts, but for training the model, how do you create the attention mask for the audio tokens?
If you use an attention mask, do you consider that a column of token containing only paddings is equivalent to a pad token and that therefore it should not be considered in the attention mechanism? Since in the model architecture, the sum of embeddings by columns is done before making predictions with the transform model.
I have some difficulty understanding this point in the architecture of the model. Thanks for reading :)
The text was updated successfully, but these errors were encountered: