fix global cu_seqlens setting #537

xrennvidia · 2023-11-24T02:36:06Z

While virutal_pipeline_parallel_size > 1, model in each VP stage usually only has 1 transformer layer (i..e, layer number is always 1). Hence, we cannot rely on layer_number to avoid redundant cu_seqlen compute. Instead, we should check if global cu_seqlens is None and only compute cu_seqlens while it's None.

Signed-off-by: Xiaowei Ren <[email protected]>

ksivaman · 2023-11-27T17:04:56Z

/te-ci pytorch

timmoon10

Won't this only update _cu_seqlens_{q,kv} during the first batch?

I'm wondering if we should start thinking about better ways to implement this caching approach. For example, it doesn't support the case where different layers have different attention masks. Perhaps it should be a kwarg that's passed into forward.

timmoon10 · 2023-11-28T22:02:24Z

Closes #535.

xrennvidia · 2023-12-01T04:23:42Z

@timmoon10 This will update _cu_seqlens_{q,kv} at the very beginning of training, first step and first microbatch.
In the pretraining that we are doing, _cu_seqlens_{q,kv} is constant, so generating it at the very beginning makes sense. If you want a more general fix which can handle non-constant _cu_seqlens_{q,kv}, this PR can not handle it. But anyway, if it's not constant, it does not make sense to use a global variable?

I wonder do you want to merge this PR? Or close this PR and you guys fix it later with another PR?

timmoon10 · 2023-12-07T22:49:41Z

The current behavior of _cu_seqlens_{q,kv} is intended so that the first layer in the model updates the global variables, and later layers in the same microbatch can reuse it. It is a bit of a hacky thing where the "right" answer is to change them into kwargs of the forward function. Alternatively, could we change Megatron-core so that it passes layer_number based on the total number of layers, not just the number of layers in the current VP stage?

timmoon10 · 2024-01-02T18:26:52Z

#535 caches cu_seqlens_{q,kv} in between microbatches in the case the sequence length is fixed (SBHD/BSHD layouts) and the mask is causal. I think that fixes the bug this PR intended to fix.

This does not address the case when we use interleaved pipeline parallelism with variable sequence lengths and/or the mask is not causal. I think that is best handled outside of TE by passing in the correct value for layer_number, as discussed in #537 (comment).

fix global cu_seqlens setting

5578622

Signed-off-by: Xiaowei Ren <[email protected]>

timmoon10 reviewed Nov 28, 2023

View reviewed changes

Merge branch 'main' into xren/global_cu_seqlens

9f7e954

timmoon10 closed this Jan 2, 2024

xrennvidia deleted the xren/global_cu_seqlens branch January 2, 2024 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix global cu_seqlens setting #537

fix global cu_seqlens setting #537

xrennvidia commented Nov 24, 2023 •

edited

Loading

ksivaman commented Nov 27, 2023

timmoon10 left a comment

timmoon10 commented Nov 28, 2023

xrennvidia commented Dec 1, 2023

timmoon10 commented Dec 7, 2023

timmoon10 commented Jan 2, 2024

fix global cu_seqlens setting #537

fix global cu_seqlens setting #537

Conversation

xrennvidia commented Nov 24, 2023 • edited Loading

ksivaman commented Nov 27, 2023

timmoon10 left a comment

Choose a reason for hiding this comment

timmoon10 commented Nov 28, 2023

xrennvidia commented Dec 1, 2023

timmoon10 commented Dec 7, 2023

timmoon10 commented Jan 2, 2024

xrennvidia commented Nov 24, 2023 •

edited

Loading