Support CUDA Graph for internode dispatch normal kernel #438

yifeizhang-c · 2025-09-30T03:11:37Z

Support CUDA Graph for internode dispatch kernels with the same logic as what has been done for intranode dispatch kernels.

yifeizhang-c · 2025-09-30T05:23:08Z

csrc/kernels/internode.cu

-            while (ld_volatile_global(moe_recv_rdma_counter_mapped) != -1);
-            *moe_recv_rdma_counter_mapped = sum;
+            if (num_worst_tokens == 0) {
+                while (ld_volatile_global(moe_recv_rdma_counter_mapped) != -1);


I wish to double confirm the design here. Is the while (ld_volatile_global(...)) logic here aiming for cache coherency? That device side need to check whether the host side value update has already been written back before device side make the update.
I wish to confirm this because intranode dispatch does not have such logic.

Enable CUDA Graph for internode dispatch

1c8420b

yifeizhang-c commented Sep 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support CUDA Graph for internode dispatch normal kernel #438

Support CUDA Graph for internode dispatch normal kernel #438

Uh oh!

yifeizhang-c commented Sep 30, 2025

Uh oh!

yifeizhang-c Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Support CUDA Graph for internode dispatch normal kernel #438

Are you sure you want to change the base?

Support CUDA Graph for internode dispatch normal kernel #438

Uh oh!

Conversation

yifeizhang-c commented Sep 30, 2025

Uh oh!

yifeizhang-c Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yifeizhang-c Sep 30, 2025 •

edited

Loading