Question about the Attention mask for audio tokens (codebooks) #121

Jourdelune · 2024-08-27T09:59:27Z

Hello, first thank you for this amazing model!
I have small questions regarding the model, if I understand well, the model uses a text encoder to encode the prompt and an embedding layer to encode the lyric and prepend the lyrics token to the Parler-TTS decoder layer to create the codebook index (audio tokens).

In this code: https://github.com/huggingface/parler-tts/blob/8e465f1b5fcd223478e07175cb40494d19ffbe17/INFERENCE.md
You feed the attention_mask for the prompt and the lyrics, so I deduce that the model uses attention masks on these 2 parts, but for training the model, how do you create the attention mask for the audio tokens?

If you use an attention mask, do you consider that a column of token containing only paddings is equivalent to a pad token and that therefore it should not be considered in the attention mechanism? Since in the model architecture, the sum of embeddings by columns is done before making predictions with the transform model.

I have some difficulty understanding this point in the architecture of the model. Thanks for reading :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the Attention mask for audio tokens (codebooks) #121

Question about the Attention mask for audio tokens (codebooks) #121

Jourdelune commented Aug 27, 2024 •

edited

Loading

Question about the Attention mask for audio tokens (codebooks) #121

Question about the Attention mask for audio tokens (codebooks) #121

Comments

Jourdelune commented Aug 27, 2024 • edited Loading

Jourdelune commented Aug 27, 2024 •

edited

Loading