[PyTorch] TransformerLayer: add support for Falcon architecture #513

Marks101 · 2023-11-10T10:49:52Z

Falcon-40b and Falcon-180b are two exciting publicly available models. Currently, transformer-engine does not support their architecture because in their implementation self attention and mlp are not computed in sequence. Instead, the blocks (layer norm -> self attention) and (layernorm -> mlp) are fed with the input into the layer. In a computational graph these operations are thus in parallel. In the Falcon configs this is denoted as new_decoder_architecture. This PR introduces this feature and thus makes it possible to finetune Falcon models with transformer-engine. We would be really happy if this feature finds it's way into transformer-engine.

Two notes on the implementation:

I denoted the new option parallel_attention_mlp. I am not sure if this is a perfect naming, this is up for discussion.
I created a new method _bias_dropout_add() in order to keep the new code clear
Falcon models do not have a bias, accordingly I think that a pretty clean solution is to set return_bias=False for attention and mlp and then use bias dropout add. For models using parallel_attention_mpl and bias, this might not be optimal.

…con models Signed-off-by: Markus Schnoes <[email protected]>

timmoon10 · 2023-11-14T05:50:03Z

/te-ci

timmoon10

LGTM

Marks101 · 2023-11-14T10:52:26Z

Thanks for proceeding so quickly with this PR 🥳
Sadly unittests for jax and paddle failed. Not sure how this could be influence based on my changes.

ptrendx · 2023-11-20T22:00:48Z

@Marks101 Could you add a test to test_numerics, similar to https://github.com/NVIDIA/TransformerEngine/blob/main/tests/pytorch/test_numerics.py#L622? Otherwise LGTM :-). The other frameworks tests seem to have failed because of machine issues, so not related to this PR.

Signed-off-by: Markus Schnoes <[email protected]>

Now uses nn.functional.dropout because depending on the path there are one or two dropouts. Signed-off-by: Markus Schnoes <[email protected]>

tests/pytorch/test_numerics.py

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2023-11-30T00:09:43Z

/te-ci pytorch

Signed-off-by: Markus Schnoes <[email protected]>

Marks101 · 2023-11-30T15:26:52Z

Thanks for fixing my spelling mistakes ... sorry for that. The tests failed because there was one last occurance of parallel_attention_ml. I fixed that and ran the tests locally. Now it should be fine.

timmoon10 · 2023-11-30T19:38:26Z

/te-ci pytorch

ptrendx · 2023-12-04T20:19:58Z

/te-ci pytorch

ptrendx · 2023-12-04T20:20:59Z

Tim's attempt to run the CI failed due to network issue apparently, I just retried it.

ptrendx · 2023-12-04T22:53:41Z

Merged. Thank you @Marks101 for the contribution!

Marks101 · 2023-12-05T07:01:44Z

Great, thank you for the support!

[PyTorch] TransformerLayer: add parallel_attention_mlp to support Fal…

50baf6a

…con models Signed-off-by: Markus Schnoes <[email protected]>

timmoon10 self-requested a review November 14, 2023 05:49

timmoon10 approved these changes Nov 14, 2023

View reviewed changes

ptrendx approved these changes Nov 20, 2023

View reviewed changes

[PyTorch] add test for parallel_attention_mlp to test_numerics

b7c908b

Signed-off-by: Markus Schnoes <[email protected]>

Marks101 force-pushed the add-parallel-attention-mlp branch from af7d81a to b7c908b Compare November 22, 2023 12:32

[PyTorch] TorchGPT: fix dropout for parallel_attention_mlp

93dad9a

Now uses nn.functional.dropout because depending on the path there are one or two dropouts. Signed-off-by: Markus Schnoes <[email protected]>

timmoon10 reviewed Nov 30, 2023

View reviewed changes

timmoon10 added 2 commits November 29, 2023 16:09

Apply suggestions from code review

228241f

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into add-parallel-attention-mlp

12d7c56

[PyTorch] test_gpt_accuracy: fix spelling in construction of TorchGPT

b642c08

Signed-off-by: Markus Schnoes <[email protected]>

Merge branch 'main' into add-parallel-attention-mlp

18200bb

ptrendx added the 1.2.0 label Dec 4, 2023

ptrendx merged commit 4e33a69 into NVIDIA:main Dec 4, 2023
20 checks passed

Marks101 deleted the add-parallel-attention-mlp branch December 5, 2023 07:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] TransformerLayer: add support for Falcon architecture #513

[PyTorch] TransformerLayer: add support for Falcon architecture #513

Marks101 commented Nov 10, 2023 •

edited

Loading

timmoon10 commented Nov 14, 2023

timmoon10 left a comment

Marks101 commented Nov 14, 2023 •

edited

Loading

ptrendx commented Nov 20, 2023

timmoon10 commented Nov 30, 2023

Marks101 commented Nov 30, 2023

timmoon10 commented Nov 30, 2023

ptrendx commented Dec 4, 2023

ptrendx commented Dec 4, 2023

ptrendx commented Dec 4, 2023

Marks101 commented Dec 5, 2023

[PyTorch] TransformerLayer: add support for Falcon architecture #513

[PyTorch] TransformerLayer: add support for Falcon architecture #513

Conversation

Marks101 commented Nov 10, 2023 • edited Loading

timmoon10 commented Nov 14, 2023

timmoon10 left a comment

Choose a reason for hiding this comment

Marks101 commented Nov 14, 2023 • edited Loading

ptrendx commented Nov 20, 2023

timmoon10 commented Nov 30, 2023

Marks101 commented Nov 30, 2023

timmoon10 commented Nov 30, 2023

ptrendx commented Dec 4, 2023

ptrendx commented Dec 4, 2023

ptrendx commented Dec 4, 2023

Marks101 commented Dec 5, 2023

Marks101 commented Nov 10, 2023 •

edited

Loading

Marks101 commented Nov 14, 2023 •

edited

Loading