[JAX] Fix failure on pattern matching of FP8 GEMM when enabling FSDP. #547

mingxu1067 · 2023-12-01T06:01:33Z

Add a custom call, cast.
Replace cast_and_transpose with cast to the kernel of layernorm_fp8_dot and the kernel_1 of layernrom_geglu_fp8_mlp to allow XLA handle transpose for avoiding unnecessary copy to break FP8 GEMM pattern matching.
Replace cast_and_transpose with native XLA cast to the x and kernel of fp8_dot and the kernel_2 of layernrom_geglu_fp8_mlp to allow XLA handle transpose for avoiding unnecessary copy to break FP8 GEMM pattern matching.
Fix a bug of enabling layernrom_geglu_fp8_mlp in flax.LayernormMLP.

kaixih

Basically, it seems the cast_transpose is replaced with either the custom cast or the native quantize. I feel the cast and quantize are functionally same. Do we have a rule or guide on when to use which?

transformer_engine/jax/layernorm.py

nouiz · 2023-12-01T17:48:42Z

@denera to review.
@mingxu1067 Can you extent the all current fp8 sharding tests to verify that the gemm is in fp8?
This will test the failure you fix and make sure it doesn't regress and make sure no equivalent issue happen elsewhere.

transformer_engine/jax/cpp_extensions.py

mingxu1067 · 2023-12-04T05:20:51Z

Basically, it seems the cast_transpose is replaced with either the custom cast or the native quantize. I feel the cast and quantize are functionally same. Do we have a rule or guide on when to use which?

We apply the native quantize and transpose when the tensors is split along its column. For example, a tensor in shape (M, N), and it is sharded along N. In this case, native quantize and transpose offer better flexibility for XLA to schedule all-gather and transpose, then avoid unnecessary copy.

However, there are few cases in backward, like x of fp8_dot, excluded from the above rule, but we found it also introduces unexpected copy. For this case, we also apply the native quantize and transpose to solve.

We also target on replacing all custom cast_fp8 and cast_transpose with the native implementation in the feature. Will schedule a time to evaluate the performance gap and work effort needed.

mingxu1067 · 2023-12-04T05:24:13Z

@denera to review. @mingxu1067 Can you extent the all current fp8 sharding tests to verify that the gemm is in fp8? This will test the failure you fix and make sure it doesn't regress and make sure no equivalent issue happen elsewhere.

Currently, UTs does not include FSDP related tests, therefore the tests cannot capture this kind of failures. Extending the UTs to conver wide range of cases require some time. Had an internal discussion, will add this into TODO.

denera

LGTM!

mingxu1067 · 2023-12-05T01:56:44Z

/te-ci jax

nouiz · 2023-12-06T15:31:35Z

This PR description tell that this fix a bug.
We need a test that make sure we don't regress.

mingxu1067 · 2023-12-13T02:53:14Z

/te-ci jax

Signed-off-by: Ming Huang <[email protected]>

mingxu1067 · 2023-12-15T01:40:06Z

/te-ci jax

mingxu1067 self-assigned this Dec 1, 2023

mingxu1067 force-pushed the mingh/fix_failure_of_xla_fp8_with_fsdp branch from a63ec56 to 591e21c Compare December 1, 2023 06:03

kaixih reviewed Dec 1, 2023

View reviewed changes

transformer_engine/jax/layernorm.py Outdated Show resolved Hide resolved

nouiz reviewed Dec 1, 2023

View reviewed changes

transformer_engine/jax/cpp_extensions.py Show resolved Hide resolved

denera approved these changes Dec 4, 2023

View reviewed changes

mingxu1067 force-pushed the mingh/fix_failure_of_xla_fp8_with_fsdp branch from 39ca75d to 60a0eab Compare December 14, 2023 07:28

mingxu1067 added 8 commits December 15, 2023 09:39

Adding Cast custom call

0272f76

Signed-off-by: Ming Huang <[email protected]>

Applying cast to the kernel of layernorm_fp8_dot

dbc4c28

Signed-off-by: Ming Huang <[email protected]>

Applying native cast to the kernel of fp8_dot.

525c25a

Signed-off-by: Ming Huang <[email protected]>

Apply Cast and native cast to layernorm_geglu_fp8_dot

15b8509

Signed-off-by: Ming Huang <[email protected]>

Fix the bug to enable layernorm_geglu_fp8_dot in LayernormMlp

5c07f23

Signed-off-by: Ming Huang <[email protected]>

Modifiied code with the review feedback.

7a0de5b

Signed-off-by: Ming Huang <[email protected]>

Adding 2xACC control to FP8 GEMMs.

ca3f0a5

Signed-off-by: Ming Huang <[email protected]>

Set precision as an static arg

2ce5724

Signed-off-by: Ming Huang <[email protected]>

mingxu1067 force-pushed the mingh/fix_failure_of_xla_fp8_with_fsdp branch from 60a0eab to 2ce5724 Compare December 15, 2023 01:39

denera merged commit 2ae121d into NVIDIA:main Jan 12, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JAX] Fix failure on pattern matching of FP8 GEMM when enabling FSDP. #547

[JAX] Fix failure on pattern matching of FP8 GEMM when enabling FSDP. #547

mingxu1067 commented Dec 1, 2023

kaixih left a comment

nouiz commented Dec 1, 2023

mingxu1067 commented Dec 4, 2023

mingxu1067 commented Dec 4, 2023

denera left a comment

mingxu1067 commented Dec 5, 2023

nouiz commented Dec 6, 2023

mingxu1067 commented Dec 13, 2023

mingxu1067 commented Dec 15, 2023

[JAX] Fix failure on pattern matching of FP8 GEMM when enabling FSDP. #547

[JAX] Fix failure on pattern matching of FP8 GEMM when enabling FSDP. #547

Conversation

mingxu1067 commented Dec 1, 2023

kaixih left a comment

Choose a reason for hiding this comment

nouiz commented Dec 1, 2023

mingxu1067 commented Dec 4, 2023

mingxu1067 commented Dec 4, 2023

denera left a comment

Choose a reason for hiding this comment

mingxu1067 commented Dec 5, 2023

nouiz commented Dec 6, 2023

mingxu1067 commented Dec 13, 2023

mingxu1067 commented Dec 15, 2023