Support CUDA Graph for MoE models #1233

buptzyb · 2024-10-09T09:31:55Z

Description

Different from non-MoE models like llama2, MoE models have dynamic-shaped activations in FFN layers, so one cudagraph can only capture a part of one transformer layer, instead of covering the whole layer. We call this a "breaking-layer" cudagraph mode. This PR adds breaking-layer cudagraph supports for MoE models on the TE side, and fixes several related bugs in TE.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Fix wrong per_callable_module_params order bug in _make_graphed_callables when _order is given.
Fix warmup argument mismatch bug in _make_graphed_callables when _order is given.
Fix fp8 accuracy issue by adding fp8_group argument to make_graphed_callables() and modifing is_first_microbatch, skip_fp8_weight_update and fp8_meta code.
Support MoE models cudagraph by filtering graphed TE modules during warmup.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

timmoon10

Technically this seems mostly reasonable, although I have questions and stylistic suggestions. Have you tested that it works with Mcore?

@ptrendx @ksivaman @sbhavani What is our priority for this feature? The custom Mcore logic in make_graphed_callables is already messy and fragile, and this PR does exacerbate those problems.

transformer_engine/pytorch/module/layernorm_linear.py

timmoon10 · 2024-10-10T20:17:35Z

transformer_engine/pytorch/graph.py

+        for m_chunk in range(num_model_chunks):
+            for _ in range(num_microbatches):
+                for l_no in range(num_layers):
+                    per_callable_module_params.append(
+                        tuple(callables[m_chunk * num_layers + l_no].parameters())
+                        if isinstance(callables[m_chunk * num_layers + l_no], torch.nn.Module)
+                        else ()
+                    )


This change seems correct to me, but it's odd if the Mcore integration was working before. @ksivaman Have we run this with Mcore, or did we run with num_microbatches=1?

This changes the interpretation of per_callable_module_params from (num_chunks, layers_per_chunk, num_microbatches) to (num_chunks, num_microbatches, layers_per_chunk). This matches the interpretation of per_callable_* lists when capturing graphs:

TransformerEngine/transformer_engine/pytorch/graph.py

Lines 237 to 239 in 3b89c36

per_callable_fwd_idx = (m_chunk * num_microbatches * num_layers) + (

fwd_idx[m_chunk] * num_layers + l_no

)

transformer_engine/pytorch/graph.py

timmoon10 · 2024-10-10T21:34:39Z

/te-ci pytorch

yaox12 · 2024-10-11T08:14:26Z

/te-ci pytorch

buptzyb · 2024-10-11T14:07:59Z

Have you tested that it works with Mcore?

Yes, we also made some changes in Mcore, together with TE changes in this PR, to enable MoE cudagraph. You can refer to issue 193 in our Megatron-LM repo.

yaox12 · 2024-10-11T15:06:31Z

/te-ci pytorch

Signed-off-by: Robin Zhang <[email protected]> Co-authored-by: Yifei Song <[email protected]>

Signed-off-by: Robin Zhang <[email protected]>

Signed-off-by: Robin Zhang <[email protected]> Co-authored-by: Yifei Song <[email protected]>

Signed-off-by: Robin Zhang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Xin Yao <[email protected]>

Signed-off-by: Robin Zhang <[email protected]>

Signed-off-by: Yifei Song <[email protected]>

This reverts commit 73a22e2. Signed-off-by: Yifei Song <[email protected]>

yaox12 · 2024-11-20T10:09:23Z

/te-ci pytorch

Signed-off-by: Robin Zhang <[email protected]>

yaox12 · 2024-11-21T03:56:57Z

/te-ci pytorch

buptzyb · 2024-11-22T03:13:00Z

Hi @timmoon10 , do you have more suggestions on this PR?

timmoon10 · 2024-11-23T02:45:56Z

/te-ci pytorch L1

timmoon10

LGTM

yaox12 · 2024-11-25T08:01:26Z

Merge this PR since pipeline 20710146 passed and Tim approved.

buptzyb force-pushed the cudagraph_moe branch from 83ea7c8 to 34967b6 Compare October 9, 2024 23:29

timmoon10 reviewed Oct 10, 2024

View reviewed changes

timmoon10 requested review from timmoon10, ptrendx and ksivaman October 10, 2024 21:30

yifeis-nv force-pushed the cudagraph_moe branch 2 times, most recently from bb1c160 to 66748b9 Compare November 20, 2024 06:50

buptzyb and others added 12 commits November 20, 2024 06:50

Align RNG tracker with megatron

9951d78

Signed-off-by: Robin Zhang <[email protected]> Co-authored-by: Yifei Song <[email protected]>

Fix module_params order and warmup bug in cudagraph

b5f7cdf

Signed-off-by: Robin Zhang <[email protected]> Co-authored-by: Yifei Song <[email protected]>

Add fp8_group argument and fix fp8 accuracy issue for cudagraph

0ede5cb

Signed-off-by: Robin Zhang <[email protected]> Co-authored-by: Yifei Song <[email protected]>

Add TE modules and weights filters to support MoE models

5596437

Signed-off-by: Robin Zhang <[email protected]> Co-authored-by: Yifei Song <[email protected]>

Revert self.fp8

1d0759e

Signed-off-by: Robin Zhang <[email protected]>

Use hooks to filter module params

73a22e2

Signed-off-by: Robin Zhang <[email protected]>

Filter all TE modules in hooks

cd56618

Signed-off-by: Robin Zhang <[email protected]> Co-authored-by: Yifei Song <[email protected]>

Format code

2a7f54b

Signed-off-by: Robin Zhang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c6dddaf

for more information, see https://pre-commit.ci

Update graph.py

a01602e

Signed-off-by: Xin Yao <[email protected]>

Revert CudaRNGStatesTracker

8017b6d

Signed-off-by: Robin Zhang <[email protected]>

Format Update

41d6100

Signed-off-by: Yifei Song <[email protected]>

yifeis-nv force-pushed the cudagraph_moe branch from 66748b9 to 41d6100 Compare November 20, 2024 06:51

Revert "Use hooks to filter module params"

c82faa2

This reverts commit 73a22e2. Signed-off-by: Yifei Song <[email protected]>

Remove filtering module params

938f325

Signed-off-by: Robin Zhang <[email protected]>

buptzyb force-pushed the cudagraph_moe branch from 70c514d to 938f325 Compare November 20, 2024 12:57

Merge branch 'main' into cudagraph_moe

1522ecc

Merge branch 'main' into cudagraph_moe

4487315

timmoon10 approved these changes Nov 23, 2024

View reviewed changes

yaox12 merged commit ae393e8 into NVIDIA:main Nov 25, 2024
14 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CUDA Graph for MoE models #1233

Support CUDA Graph for MoE models #1233

buptzyb commented Oct 9, 2024 •

edited

Loading

timmoon10 left a comment •

edited

Loading

timmoon10 Oct 10, 2024

timmoon10 commented Oct 10, 2024

yaox12 commented Oct 11, 2024

buptzyb commented Oct 11, 2024 •

edited

Loading

yaox12 commented Oct 11, 2024

yaox12 commented Nov 20, 2024

yaox12 commented Nov 21, 2024

buptzyb commented Nov 22, 2024

timmoon10 commented Nov 23, 2024

timmoon10 left a comment

yaox12 commented Nov 25, 2024

	per_callable_fwd_idx = (m_chunk * num_microbatches * num_layers) + (
	fwd_idx[m_chunk] * num_layers + l_no
	)

Support CUDA Graph for MoE models #1233

Support CUDA Graph for MoE models #1233

Conversation

buptzyb commented Oct 9, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

timmoon10 left a comment • edited Loading

Choose a reason for hiding this comment

timmoon10 Oct 10, 2024

Choose a reason for hiding this comment

timmoon10 commented Oct 10, 2024

yaox12 commented Oct 11, 2024

buptzyb commented Oct 11, 2024 • edited Loading

yaox12 commented Oct 11, 2024

yaox12 commented Nov 20, 2024

yaox12 commented Nov 21, 2024

buptzyb commented Nov 22, 2024

timmoon10 commented Nov 23, 2024

timmoon10 left a comment

Choose a reason for hiding this comment

yaox12 commented Nov 25, 2024

buptzyb commented Oct 9, 2024 •

edited

Loading

timmoon10 left a comment •

edited

Loading

buptzyb commented Oct 11, 2024 •

edited

Loading