Skip to content

Conversation

yifeizhang-c
Copy link
Contributor

Support CUDA Graph for internode dispatch kernels with the same logic as what has been done for intranode dispatch kernels.

while (ld_volatile_global(moe_recv_rdma_counter_mapped) != -1);
*moe_recv_rdma_counter_mapped = sum;
if (num_worst_tokens == 0) {
while (ld_volatile_global(moe_recv_rdma_counter_mapped) != -1);
Copy link
Contributor Author

@yifeizhang-c yifeizhang-c Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish to double confirm the design here. Is the while (ld_volatile_global(...)) logic here aiming for cache coherency? That device side need to check whether the host side value update has already been written back before device side make the update.
I wish to confirm this because intranode dispatch does not have such logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant