Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pipeline parallelism with FusedAttn #635

Merged
merged 1 commit into from
Jan 26, 2024

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Jan 26, 2024

This is a temporary workaround just for release 1.3 for the issue introduced in #497 (the actual fix getting into main is #630). The issue is that the code currently assumes that it is possible to use the layer_number (specifically layer_number being equal to 1) to cache the cu_seqlens needed for fused attention. There are a few problems with this assumption (e.g. it does not check that the seqlen did not actually change from the usage in the first layer), but in the case of pipeline parallelism it breaks completely inside NeMo, since there the layers are numbered globally and so some ranks never set layer_number == 1 to any layer.

In this commit I did not remove the global variables used for caching in order to minimize the changes introduced.

@ptrendx ptrendx requested review from timmoon10 and cyanguwa January 26, 2024 17:42
Copy link
Collaborator

@cyanguwa cyanguwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ptrendx ptrendx merged commit e7319f5 into NVIDIA:release_v1.3 Jan 26, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants