[PyTorch] Refactor caching of cumulative sequence lengths #630

timmoon10 · 2024-01-25T23:02:42Z

To avoid redundantly calculating cumulative sequence lengths for attention, we only compute when layer_number == 1 and otherwise use a cached value. However, pipeline parallelism breaks this optimization since most ranks will never have a layer with layer_number == 1. This PR removes the caching logic's dependency on layer_number, which unfortunately reintroduces redundant calculation except in cases where the sequence lengths are fixed.

This is a quick bugfix, but discussion is welcomed on how best to avoid the redundant calculation. Maybe we could split layer_number into two things: the local layer number and the scaling factor used at:

TransformerEngine/transformer_engine/pytorch/attention.py

Line 1318 in 6c1a8bb

scale *= self.layer_number

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-01-25T23:02:53Z

/te-ci pytorch

ptrendx · 2024-01-30T20:51:24Z

transformer_engine/pytorch/attention.py

+                    if cu_seqlens_q is None or cu_seqlens_kv is None:
+                        assert (attention_mask is not None
+                            ), "Please provide attention_mask for padding!"
+                        cu_seqlens_q, indices_q = get_cu_seqlens_and_indices(


Ok, so here we are basically not doing caching anymore, right? This makes sense I guess since it depends on the contents of the mask.

Yea, I think we'll need a new API to safely support this optimization.

minitu · 2024-02-05T17:18:53Z

Adding that MLPerf LLM training is currently using the release_v1.3 branch because this change was merged in that release branch but not into main, so please push this forward.

ksivaman

LGTM.

ksivaman · 2024-02-05T17:32:50Z

/te-ci pytorch

ksivaman · 2024-02-05T18:49:09Z

I just saw @ptrendx's comment on #635 regarding the issue of pipeline parallelism and how layer_number==1 is not set for any layer. To re-enable caching and remove this issue, wouldn't a simple fix be to assume/set layer number to 1 when not provided, or to always recompute the cu_seqlens* tensors and indices when layer number is not set?

@timmoon10

ptrendx · 2024-02-05T23:42:44Z

@ksivaman The problem is that NeMo is setting the layer number, but it is counting them "globally" in the full model which is cut using PP. This behavior makes sense so we should not assume that layer_number==1 will ever be true on a given GPU.

parthmannan · 2024-05-28T03:13:10Z

To add another observation here why layer_number based caching does not work - There are model implementations where attention is called multiple times in a transformer block with a different sequence length and batch size shape. Relying on layer number caching breaks this as the cache is set by the last attention to be called in layer_number=1. The next layer where the 1st attention is called, it ends up with an incorrect cu_seqlens_q/kv.

Do not cache sequence lengths based on layer number

8e00cfc

Signed-off-by: Tim Moon <[email protected]>

timmoon10 added the bug Something isn't working label Jan 25, 2024

timmoon10 requested review from cyanguwa and ksivaman January 25, 2024 23:02

timmoon10 changed the title ~~[PyTroch] Refactor caching of cumulative sequence lengths~~ [PyTorch] Refactor caching of cumulative sequence lengths Jan 25, 2024

ptrendx mentioned this pull request Jan 26, 2024

Fix pipeline parallelism with FusedAttn #635

Merged

ptrendx reviewed Jan 30, 2024

View reviewed changes

ksivaman approved these changes Feb 5, 2024

View reviewed changes

Merge branch 'main' into refactor-attention-cu-seqlens-caching

f7bcc9d

timmoon10 merged commit da30634 into NVIDIA:main Feb 6, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Refactor caching of cumulative sequence lengths #630

[PyTorch] Refactor caching of cumulative sequence lengths #630

timmoon10 commented Jan 25, 2024

timmoon10 commented Jan 25, 2024

ptrendx Jan 30, 2024

timmoon10 Jan 30, 2024

minitu commented Feb 5, 2024

ksivaman left a comment

ksivaman commented Feb 5, 2024

ksivaman commented Feb 5, 2024

ptrendx commented Feb 5, 2024

parthmannan commented May 28, 2024

[PyTorch] Refactor caching of cumulative sequence lengths #630

[PyTorch] Refactor caching of cumulative sequence lengths #630

Conversation

timmoon10 commented Jan 25, 2024

timmoon10 commented Jan 25, 2024

ptrendx Jan 30, 2024

Choose a reason for hiding this comment

timmoon10 Jan 30, 2024

Choose a reason for hiding this comment

minitu commented Feb 5, 2024

ksivaman left a comment

Choose a reason for hiding this comment

ksivaman commented Feb 5, 2024

ksivaman commented Feb 5, 2024

ptrendx commented Feb 5, 2024

parthmannan commented May 28, 2024