-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix global cu_seqlens setting #537
Conversation
Signed-off-by: Xiaowei Ren <[email protected]>
/te-ci pytorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this only update _cu_seqlens_{q,kv}
during the first batch?
I'm wondering if we should start thinking about better ways to implement this caching approach. For example, it doesn't support the case where different layers have different attention masks. Perhaps it should be a kwarg that's passed into forward
.
Closes #535. |
@timmoon10 This will update I wonder do you want to merge this PR? Or close this PR and you guys fix it later with another PR? |
The current behavior of |
#535 caches This does not address the case when we use interleaved pipeline parallelism with variable sequence lengths and/or the mask is not causal. I think that is best handled outside of TE by passing in the correct value for |
While virutal_pipeline_parallel_size > 1, model in each VP stage usually only has 1 transformer layer (i..e, layer number is always 1). Hence, we cannot rely on layer_number to avoid redundant cu_seqlen compute. Instead, we should check if global cu_seqlens is None and only compute cu_seqlens while it's None.