Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix global cu_seqlens setting #537

Closed
wants to merge 2 commits into from

Conversation

xrennvidia
Copy link
Collaborator

@xrennvidia xrennvidia commented Nov 24, 2023

While virutal_pipeline_parallel_size > 1, model in each VP stage usually only has 1 transformer layer (i..e, layer number is always 1). Hence, we cannot rely on layer_number to avoid redundant cu_seqlen compute. Instead, we should check if global cu_seqlens is None and only compute cu_seqlens while it's None.

@ksivaman
Copy link
Member

/te-ci pytorch

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this only update _cu_seqlens_{q,kv} during the first batch?

I'm wondering if we should start thinking about better ways to implement this caching approach. For example, it doesn't support the case where different layers have different attention masks. Perhaps it should be a kwarg that's passed into forward.

@timmoon10
Copy link
Collaborator

Closes #535.

@xrennvidia
Copy link
Collaborator Author

@timmoon10 This will update _cu_seqlens_{q,kv} at the very beginning of training, first step and first microbatch.
In the pretraining that we are doing, _cu_seqlens_{q,kv} is constant, so generating it at the very beginning makes sense. If you want a more general fix which can handle non-constant _cu_seqlens_{q,kv}, this PR can not handle it. But anyway, if it's not constant, it does not make sense to use a global variable?

I wonder do you want to merge this PR? Or close this PR and you guys fix it later with another PR?

@timmoon10
Copy link
Collaborator

The current behavior of _cu_seqlens_{q,kv} is intended so that the first layer in the model updates the global variables, and later layers in the same microbatch can reuse it. It is a bit of a hacky thing where the "right" answer is to change them into kwargs of the forward function. Alternatively, could we change Megatron-core so that it passes layer_number based on the total number of layers, not just the number of layers in the current VP stage?

@timmoon10
Copy link
Collaborator

#535 caches cu_seqlens_{q,kv} in between microbatches in the case the sequence length is fixed (SBHD/BSHD layouts) and the mask is causal. I think that fixes the bug this PR intended to fix.

This does not address the case when we use interleaved pipeline parallelism with variable sequence lengths and/or the mask is not causal. I think that is best handled outside of TE by passing in the correct value for layer_number, as discussed in #537 (comment).

@timmoon10 timmoon10 closed this Jan 2, 2024
@xrennvidia xrennvidia deleted the xren/global_cu_seqlens branch January 2, 2024 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants