Why is the first dimension `sequence_length` as opposed to `batch_size` in the input tensor #195

vgoklani · 2023-05-03T11:21:53Z

Why does the library use sequence_length as the first dimension of the input-tensor as opposed to the batch_size?

Is this just a choice of convention from RNNs or is the difference performance related?

From the example code:

bmm1 = torch.bmm(query.transpose(0, 1), key.transpose(0, 1).transpose(1, 2)) / self.norm_factor

https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/quickstart_utils.py#L93

I see two successive transpose(0, 1) operations?

Thanks!

The text was updated successfully, but these errors were encountered:

ksivaman · 2023-05-04T01:25:42Z

Having either batch size or sequence length as the first dimension of the input tensor can be beneficial for performance depending on the attention implementation and some other details. Our choice to stick with sequence length as the first dimension was so that we can add support for sequence parallelism. That being said, we will soon add a batch_first argument in our APIs that would optionally support input tensors with batch_size as their first dimension.

vgoklani · 2023-05-04T03:20:50Z

that makes sense, thank you!

bryangopal · 2023-11-24T21:13:11Z

Having either batch size or sequence length as the first dimension of the input tensor can be beneficial for performance depending on the attention implementation and some other details. Our choice to stick with sequence length as the first dimension was so that we can add support for sequence parallelism. That being said, we will soon add a batch_first argument in our APIs that would optionally support input tensors with batch_size as their first dimension.

any updates on the batch_first addition?

timmoon10 · 2023-11-28T00:36:34Z

The attention module has logic to handle multiple formats (e.g. SBHD, BSHD). See:

TransformerEngine/transformer_engine/pytorch/attention.py

Line 1821 in 666539f

qkv_format: str = "sbhd",

However, we haven't exposed this in the Transformer layer yet. Pinging @cyanguwa.

bryangopal · 2023-12-27T04:02:38Z

@cyanguwa just following up, would appreciate an update!

ksivaman self-assigned this May 3, 2023

vgoklani closed this as completed May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the first dimension `sequence_length` as opposed to `batch_size` in the input tensor #195

Why is the first dimension `sequence_length` as opposed to `batch_size` in the input tensor #195

vgoklani commented May 3, 2023 •

edited

Loading

ksivaman commented May 4, 2023

vgoklani commented May 4, 2023

bryangopal commented Nov 24, 2023

timmoon10 commented Nov 28, 2023

bryangopal commented Dec 27, 2023 •

edited

Loading

Why is the first dimension sequence_length as opposed to batch_size in the input tensor #195

Why is the first dimension sequence_length as opposed to batch_size in the input tensor #195

Comments

vgoklani commented May 3, 2023 • edited Loading

ksivaman commented May 4, 2023

vgoklani commented May 4, 2023

bryangopal commented Nov 24, 2023

timmoon10 commented Nov 28, 2023

bryangopal commented Dec 27, 2023 • edited Loading

Why is the first dimension `sequence_length` as opposed to `batch_size` in the input tensor #195

Why is the first dimension `sequence_length` as opposed to `batch_size` in the input tensor #195

vgoklani commented May 3, 2023 •

edited

Loading

bryangopal commented Dec 27, 2023 •

edited

Loading