-
Notifications
You must be signed in to change notification settings - Fork 946
Support other NVLink scenarios #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This reverts commit 0613b1f.
Are these codes for NVL72? |
Very likely yes. |
#define NVLINK_DOMAIN_LARGE | ||
|
||
#ifdef NVLINK_DOMAIN_LARGE | ||
#define NUM_MAX_NVL_PEERS 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we increase it to 72?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering the use case of it - it seems large scale EP on prefill with 72 gpus does not have benefits iirc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for training in NVL72.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that looks pretty reasonable! I think it is implementable, but since there are already a lot of PRs pending waiting for LyricZhao to have time to review and merge, I may continue this PR a bit later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong. NVL72 is 18 nodes of 4-GPU, so the intra-node nvlink peer number is no more than 4, while the inter-node nvshmem can itself find cross-node nvlink. Why do we need extend the intra-node nvlink peer to 24 or larger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding
- cross-node nvlink / MNNVL is implemented as intra-node.
- DeepEP uses nvshmem low level infiniband API in inter-node, so it doesn't benefit from nvshmem MNNVL feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinjn Thanks for the reply. Wondering without changes here , how did sglang run DeepEP with EP48 on nvlink-only NVL72 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinjn Thanks for the reply. Wondering without changes here , how did sglang run DeepEP with EP48 on nvlink-only NVL72 ?
-
For SGLang Decoding, we can get performance gain with large EP size, such as EP48. It uses low latency dispatch/combine, which already support NVL72 for any EP size.
-
For SGLang Prefill, it uses intranode/internode dispatch/combine, which is the kernels we are talking about.
Without the pr-218, intranode dispatch/combine cannot support EP size larger than 4. Internode dispatch/combine supports any EP size, but it uses two hops transition, so it is not the best solution for NVL72.
With the pr-218, intranode dispatch/combine can expand to EP24.
Thanks @shifangx who says can do code cleanup for this PR! I really have no time recently to do these... |
test with 4 rank, 8 rank, 24 rank all pass. code is messy and I will refactor later, and also will try to improve a bit of performance if possible.
Please subtract code diff from #217.