-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is the first dimension sequence_length
as opposed to batch_size
in the input tensor
#195
Comments
Having either batch size or sequence length as the first dimension of the input tensor can be beneficial for performance depending on the attention implementation and some other details. Our choice to stick with sequence length as the first dimension was so that we can add support for sequence parallelism. That being said, we will soon add a |
that makes sense, thank you! |
any updates on the |
The attention module has logic to handle multiple formats (e.g. SBHD, BSHD). See:
However, we haven't exposed this in the Transformer layer yet. Pinging @cyanguwa. |
@cyanguwa just following up, would appreciate an update! |
Why does the library use
sequence_length
as the first dimension of the input-tensor as opposed to thebatch_size
?Is this just a choice of convention from RNNs or is the difference performance related?
From the example code:
I see two successive transpose(0, 1) operations?
Thanks!
The text was updated successfully, but these errors were encountered: