Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the Attention mask for audio tokens (codebooks) #121

Open
Jourdelune opened this issue Aug 27, 2024 · 0 comments
Open

Question about the Attention mask for audio tokens (codebooks) #121

Jourdelune opened this issue Aug 27, 2024 · 0 comments

Comments

@Jourdelune
Copy link

Jourdelune commented Aug 27, 2024

Hello, first thank you for this amazing model!
I have small questions regarding the model, if I understand well, the model uses a text encoder to encode the prompt and an embedding layer to encode the lyric and prepend the lyrics token to the Parler-TTS decoder layer to create the codebook index (audio tokens).

In this code: https://github.com/huggingface/parler-tts/blob/8e465f1b5fcd223478e07175cb40494d19ffbe17/INFERENCE.md
You feed the attention_mask for the prompt and the lyrics, so I deduce that the model uses attention masks on these 2 parts, but for training the model, how do you create the attention mask for the audio tokens?
image
If you use an attention mask, do you consider that a column of token containing only paddings is equivalent to a pad token and that therefore it should not be considered in the attention mechanism? Since in the model architecture, the sum of embeddings by columns is done before making predictions with the transform model.

I have some difficulty understanding this point in the architecture of the model. Thanks for reading :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant