Skip to content

Conversation

fzyzcjy
Copy link
Contributor

@fzyzcjy fzyzcjy commented Jun 17, 2025

test with 4 rank, 8 rank, 24 rank all pass. code is messy and I will refactor later, and also will try to improve a bit of performance if possible.

Please subtract code diff from #217.

@pkuleo
Copy link

pkuleo commented Jun 21, 2025

Are these codes for NVL72?

@vinjn
Copy link

vinjn commented Jun 29, 2025

Are these codes for NVL72?

Very likely yes.

#define NVLINK_DOMAIN_LARGE

#ifdef NVLINK_DOMAIN_LARGE
#define NUM_MAX_NVL_PEERS 24
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we increase it to 72?

Copy link
Contributor Author

@fzyzcjy fzyzcjy Jun 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering the use case of it - it seems large scale EP on prefill with 72 gpus does not have benefits iirc

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for training in NVL72.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that looks pretty reasonable! I think it is implementable, but since there are already a lot of PRs pending waiting for LyricZhao to have time to review and merge, I may continue this PR a bit later.

Copy link

@DorianZi DorianZi Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong. NVL72 is 18 nodes of 4-GPU, so the intra-node nvlink peer number is no more than 4, while the inter-node nvshmem can itself find cross-node nvlink. Why do we need extend the intra-node nvlink peer to 24 or larger?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding

  1. cross-node nvlink / MNNVL is implemented as intra-node.
  2. DeepEP uses nvshmem low level infiniband API in inter-node, so it doesn't benefit from nvshmem MNNVL feature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinjn Thanks for the reply. Wondering without changes here , how did sglang run DeepEP with EP48 on nvlink-only NVL72 ?

Copy link
Contributor

@shifangx shifangx Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinjn Thanks for the reply. Wondering without changes here , how did sglang run DeepEP with EP48 on nvlink-only NVL72 ?

  • For SGLang Decoding, we can get performance gain with large EP size, such as EP48. It uses low latency dispatch/combine, which already support NVL72 for any EP size.

  • For SGLang Prefill, it uses intranode/internode dispatch/combine, which is the kernels we are talking about.

Without the pr-218, intranode dispatch/combine cannot support EP size larger than 4. Internode dispatch/combine supports any EP size, but it uses two hops transition, so it is not the best solution for NVL72.
With the pr-218, intranode dispatch/combine can expand to EP24.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Sep 29, 2025

Thanks @shifangx who says can do code cleanup for this PR! I really have no time recently to do these...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants