Support other NVLink scenarios #218

fzyzcjy · 2025-06-17T10:28:21Z

test with 4 rank, 8 rank, 24 rank all pass. code is messy and I will refactor later, and also will try to improve a bit of performance if possible.

Please subtract code diff from #217.

This reverts commit acf108a.

This reverts commit 8cf6bd8.

This reverts commit 0613b1f.

pkuleo · 2025-06-21T10:16:51Z

Are these codes for NVL72？

vinjn · 2025-06-29T02:53:51Z

Are these codes for NVL72？

Very likely yes.

vinjn · 2025-06-29T19:23:47Z

csrc/kernels/configs.cuh

+#define NVLINK_DOMAIN_LARGE
+
+#ifdef NVLINK_DOMAIN_LARGE
+#define NUM_MAX_NVL_PEERS 24


Can we increase it to 72?

wondering the use case of it - it seems large scale EP on prefill with 72 gpus does not have benefits iirc

It's for training in NVL72.

Oh that looks pretty reasonable! I think it is implementable, but since there are already a lot of PRs pending waiting for LyricZhao to have time to review and merge, I may continue this PR a bit later.

Correct me if I'm wrong. NVL72 is 18 nodes of 4-GPU, so the intra-node nvlink peer number is no more than 4, while the inter-node nvshmem can itself find cross-node nvlink. Why do we need extend the intra-node nvlink peer to 24 or larger?

My understanding

cross-node nvlink / MNNVL is implemented as intra-node.

DeepEP uses nvshmem low level infiniband API in inter-node, so it doesn't benefit from nvshmem MNNVL feature.

@vinjn Thanks for the reply. Wondering without changes here , how did sglang run DeepEP with EP48 on nvlink-only NVL72 ?

@vinjn Thanks for the reply. Wondering without changes here , how did sglang run DeepEP with EP48 on nvlink-only NVL72 ?

For SGLang Decoding, we can get performance gain with large EP size, such as EP48. It uses low latency dispatch/combine, which already support NVL72 for any EP size.

For SGLang Prefill, it uses intranode/internode dispatch/combine, which is the kernels we are talking about.

Without the pr-218, intranode dispatch/combine cannot support EP size larger than 4. Internode dispatch/combine supports any EP size, but it uses two hops transition, so it is not the best solution for NVL72.
With the pr-218, intranode dispatch/combine can expand to EP24.

fzyzcjy · 2025-09-29T02:29:13Z

Thanks @shifangx who says can do code cleanup for this PR! I really have no time recently to do these...

fzyzcjy added 30 commits June 17, 2025 11:03

more

be2eed8

more

9ecb941

more

9683d94

more

8cf6bd8

more

acf108a

Revert "more"

3e2cede

This reverts commit acf108a.

Revert "more"

45fa1af

This reverts commit 8cf6bd8.

more

443bfa8

more

b986cce

more

3ea6f58

more

5d3513b

more

bda5695

more

3740762

more

ad4aee8

more

b5e4aad

more

240d058

more

5379d59

more

4fc8e79

more

2e90afe

more

3639a57

more

4ef8f05

more

047656e

more

c21f36d

more

7f3e4c0

more

92fb573

more

29f86f3

more

5557e70

more

9fd34e7

more

6417393

more

faaeaad

fzyzcjy added 21 commits June 17, 2025 16:44

more

0525f8f

Merge branch 'feat/cu_mem_api' into feat/deepep_normal_update

3032ede

more

dc652ea

more

151993b

more

06169d5

more

4b54c98

more

dec3315

more

04f6a5b

more

0613b1f

Revert "more"

b0ba0ea

This reverts commit 0613b1f.

more

01f0f90

more

b80e0d4

more

26130b2

moew

e395621

more

5b7e55a

temp

a8c6df8

more

e895366

more

af060e6

more

378f9b2

more

0fc2a30

more

1b14ad6

augustinjujutsu approved these changes Jun 19, 2025

View reviewed changes

fzyzcjy mentioned this pull request Jun 20, 2025

Optimize get_dispatch_layout from 170us to 45us #232

Closed

vinjn reviewed Jun 29, 2025

View reviewed changes

LyricZhao force-pushed the main branch from 6a7e456 to 7705f53 Compare July 2, 2025 10:37

sphish force-pushed the main branch from 8ff19f5 to bdd119f Compare July 22, 2025 03:33

wangdong1991 mentioned this pull request Sep 22, 2025

Add HybridEP solution for normal mode intranode dispatch/combine #420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support other NVLink scenarios #218

Support other NVLink scenarios #218

Uh oh!

fzyzcjy commented Jun 17, 2025 •

edited

Loading

Uh oh!

pkuleo commented Jun 21, 2025

Uh oh!

vinjn commented Jun 29, 2025

Uh oh!

vinjn Jun 29, 2025

Uh oh!

fzyzcjy Jun 29, 2025 •

edited

Loading

Uh oh!

vinjn Jun 30, 2025

Uh oh!

fzyzcjy Jun 30, 2025

Uh oh!

DorianZi Jul 17, 2025 •

edited

Loading

Uh oh!

vinjn Jul 17, 2025

Uh oh!

DorianZi Jul 18, 2025

Uh oh!

shifangx Sep 28, 2025 •

edited

Loading

Uh oh!

fzyzcjy commented Sep 29, 2025

Uh oh!

Uh oh!

Support other NVLink scenarios #218

Are you sure you want to change the base?

Support other NVLink scenarios #218

Uh oh!

Conversation

fzyzcjy commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pkuleo commented Jun 21, 2025

Uh oh!

vinjn commented Jun 29, 2025

Uh oh!

vinjn Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinjn Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

DorianZi Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinjn Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

DorianZi Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

shifangx Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fzyzcjy commented Sep 29, 2025

Uh oh!

Uh oh!

fzyzcjy commented Jun 17, 2025 •

edited

Loading

fzyzcjy Jun 29, 2025 •

edited

Loading

DorianZi Jul 17, 2025 •

edited

Loading

shifangx Sep 28, 2025 •

edited

Loading