Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the first dimension sequence_length as opposed to batch_size in the input tensor #195

Closed
vgoklani opened this issue May 3, 2023 · 5 comments
Assignees

Comments

@vgoklani
Copy link

vgoklani commented May 3, 2023

Why does the library use sequence_length as the first dimension of the input-tensor as opposed to the batch_size?

Is this just a choice of convention from RNNs or is the difference performance related?

From the example code:

bmm1 = torch.bmm(query.transpose(0, 1), key.transpose(0, 1).transpose(1, 2)) / self.norm_factor
https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/quickstart_utils.py#L93

I see two successive transpose(0, 1) operations?

Thanks!

@ksivaman ksivaman self-assigned this May 3, 2023
@ksivaman
Copy link
Member

ksivaman commented May 4, 2023

Having either batch size or sequence length as the first dimension of the input tensor can be beneficial for performance depending on the attention implementation and some other details. Our choice to stick with sequence length as the first dimension was so that we can add support for sequence parallelism. That being said, we will soon add a batch_first argument in our APIs that would optionally support input tensors with batch_size as their first dimension.

@vgoklani
Copy link
Author

vgoklani commented May 4, 2023

that makes sense, thank you!

@vgoklani vgoklani closed this as completed May 8, 2023
@bryangopal
Copy link

Having either batch size or sequence length as the first dimension of the input tensor can be beneficial for performance depending on the attention implementation and some other details. Our choice to stick with sequence length as the first dimension was so that we can add support for sequence parallelism. That being said, we will soon add a batch_first argument in our APIs that would optionally support input tensors with batch_size as their first dimension.

any updates on the batch_first addition?

@timmoon10
Copy link
Collaborator

The attention module has logic to handle multiple formats (e.g. SBHD, BSHD). See:


However, we haven't exposed this in the Transformer layer yet. Pinging @cyanguwa.

@bryangopal
Copy link

bryangopal commented Dec 27, 2023

@cyanguwa just following up, would appreciate an update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants