Fix pipeline parallelism with FusedAttn #635

ptrendx · 2024-01-26T17:42:20Z

This is a temporary workaround just for release 1.3 for the issue introduced in #497 (the actual fix getting into main is #630). The issue is that the code currently assumes that it is possible to use the layer_number (specifically layer_number being equal to 1) to cache the cu_seqlens needed for fused attention. There are a few problems with this assumption (e.g. it does not check that the seqlen did not actually change from the usage in the first layer), but in the case of pipeline parallelism it breaks completely inside NeMo, since there the layers are numbered globally and so some ranks never set layer_number == 1 to any layer.

In this commit I did not remove the global variables used for caching in order to minimize the changes introduced.

Signed-off-by: Przemek Tredak <[email protected]>

cyanguwa

LGTM

Fix pipeline parallelism with FusedAttn

ad58ec0

Signed-off-by: Przemek Tredak <[email protected]>

ptrendx requested review from timmoon10 and cyanguwa January 26, 2024 17:42

cyanguwa approved these changes Jan 26, 2024

View reviewed changes

ptrendx merged commit e7319f5 into NVIDIA:release_v1.3 Jan 26, 2024
9 checks passed

ksivaman mentioned this pull request Feb 5, 2024

[PyTorch] Refactor caching of cumulative sequence lengths #630

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pipeline parallelism with FusedAttn #635

Fix pipeline parallelism with FusedAttn #635

ptrendx commented Jan 26, 2024

cyanguwa left a comment

Fix pipeline parallelism with FusedAttn #635

Fix pipeline parallelism with FusedAttn #635

Conversation

ptrendx commented Jan 26, 2024

cyanguwa left a comment

Choose a reason for hiding this comment