-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PyTorch] TransformerLayer: add support for Falcon architecture #513
Conversation
…con models Signed-off-by: Markus Schnoes <[email protected]>
/te-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for proceeding so quickly with this PR 🥳 |
@Marks101 Could you add a test to test_numerics, similar to https://github.com/NVIDIA/TransformerEngine/blob/main/tests/pytorch/test_numerics.py#L622? Otherwise LGTM :-). The other frameworks tests seem to have failed because of machine issues, so not related to this PR. |
Signed-off-by: Markus Schnoes <[email protected]>
af7d81a
to
b7c908b
Compare
Now uses nn.functional.dropout because depending on the path there are one or two dropouts. Signed-off-by: Markus Schnoes <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
/te-ci pytorch |
Signed-off-by: Markus Schnoes <[email protected]>
Thanks for fixing my spelling mistakes ... sorry for that. The tests failed because there was one last occurance of |
/te-ci pytorch |
1 similar comment
/te-ci pytorch |
Tim's attempt to run the CI failed due to network issue apparently, I just retried it. |
Merged. Thank you @Marks101 for the contribution! |
Great, thank you for the support! |
Falcon-40b and Falcon-180b are two exciting publicly available models. Currently, transformer-engine does not support their architecture because in their implementation self attention and mlp are not computed in sequence. Instead, the blocks (layer norm -> self attention) and (layernorm -> mlp) are fed with the input into the layer. In a computational graph these operations are thus in parallel. In the Falcon configs this is denoted as
new_decoder_architecture
. This PR introduces this feature and thus makes it possible to finetune Falcon models with transformer-engine. We would be really happy if this feature finds it's way into transformer-engine.Two notes on the implementation:
parallel_attention_mlp
. I am not sure if this is a perfect naming, this is up for discussion._bias_dropout_add()
in order to keep the new code clearreturn_bias=False
for attention and mlp and then use bias dropout add. For models usingparallel_attention_mpl
andbias
, this might not be optimal.