Enabling FP8 all-gather for TE Float8Tensor when using Torch FSDP2 #1358

youngeunkwon0405 · 2024-12-05T01:04:49Z

Description

This PR enables FP8 all-gather for TE Float8Tensor when using the Torch FSDP2 (a.k.a. per-parameter-sharding FSDP).
This feature will be automatically enabled when a user creates a module with the transformer_engine.pytorch.fp8_model_init.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Youngeun Kwon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Youngeun Kwon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Youngeun Kwon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Youngeun Kwon <[email protected]>

… into fsdp2

for more information, see https://pre-commit.ci

Signed-off-by: Youngeun Kwon <[email protected]>

youngeunkwon0405 · 2024-12-05T06:14:09Z

/te-ci pytorch L0 L1

Signed-off-by: Youngeun Kwon <[email protected]>

for more information, see https://pre-commit.ci

denera

LGTM!

youngeunkwon0405 and others added 17 commits November 30, 2024 19:51

draft implementation of fsdp2 fp8 all gather

47ad24a

Signed-off-by: Youngeun Kwon <[email protected]>

fix the convergence issue

2f4c102

Signed-off-by: Youngeun Kwon <[email protected]>

Merge branch 'main' into fsdp2

03a98c0

Add warning

76ff010

Signed-off-by: Youngeun Kwon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

aed545b

for more information, see https://pre-commit.ci

Merge branch 'main' into fsdp2

8d81f56

disable lint error

38e060d

Signed-off-by: Youngeun Kwon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f01245e

for more information, see https://pre-commit.ci

fix the lint error

4e7694d

Signed-off-by: Youngeun Kwon <[email protected]>

fix lint error

ff6d1d6

Signed-off-by: Youngeun Kwon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

16eb7b3

for more information, see https://pre-commit.ci

fix lint error

74f4d17

Signed-off-by: Youngeun Kwon <[email protected]>

Merge branch 'fsdp2' of github.com:youngeunkwon0405/TransformerEngine…

fb7690d

… into fsdp2

[pre-commit.ci] auto fixes from pre-commit.com hooks

aeb851f

for more information, see https://pre-commit.ci

fix lint error

941dbcb

Signed-off-by: Youngeun Kwon <[email protected]>

add comments

daba5a6

Signed-off-by: Youngeun Kwon <[email protected]>

add ref

689e30a

Signed-off-by: Youngeun Kwon <[email protected]>

youngeunkwon0405 self-assigned this Dec 6, 2024

youngeunkwon0405 requested a review from denera December 9, 2024 18:14

youngeunkwon0405 and others added 2 commits December 10, 2024 19:28

add related tests

7ecfe04

Signed-off-by: Youngeun Kwon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e4cf960

for more information, see https://pre-commit.ci

denera approved these changes Dec 16, 2024

View reviewed changes

Merge branch 'main' into fsdp2

5c1f189

youngeunkwon0405 merged commit 0196ed4 into NVIDIA:main Dec 16, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling FP8 all-gather for TE Float8Tensor when using Torch FSDP2 #1358

Enabling FP8 all-gather for TE Float8Tensor when using Torch FSDP2 #1358

youngeunkwon0405 commented Dec 5, 2024

youngeunkwon0405 commented Dec 5, 2024

denera left a comment

Enabling FP8 all-gather for TE Float8Tensor when using Torch FSDP2 #1358

Enabling FP8 all-gather for TE Float8Tensor when using Torch FSDP2 #1358

Conversation

youngeunkwon0405 commented Dec 5, 2024

Description

Type of change

Changes

Checklist:

youngeunkwon0405 commented Dec 5, 2024

denera left a comment

Choose a reason for hiding this comment