Fix: fix inconsistency in internode dispatch/combine #80

dongmin-ra · 2025-10-02T08:16:39Z

Motivation

Fixed an intermittent issue where the results of dispatch would be corrupted when running dispatch/combine repeatedly

The condition for this issue to occur is that there must be no global barrier (i.e. torch.distributed.barrier()) between dispatch and combine.

Technical Details

Issue

After integrating mori-EP into vLLM (refer), an intermittent GPU memory access fault occurred during multi-node EP.
- This happened because the expert indices after dispatch were corrupted on some ranks.
After investigating it, I figured out that the dispatch result could be corrupted in internode EP.

Cause

Combine and dispatch share the same input buffer, shmemInpTokMemObj.
- In internode dispatch, the last warp, after collecting data from remaining warps, sends data to the shmemInpTokMemObj buffer on remote GPU.
  - During the recv phase, data is copied from local shmemInpTokMemObj to local shmemOutTokMemObj.
- In combine’s send phase, similar to dispatch, The last warp sends data to shmemInpTokMemObj on the remote GPU.
If combine starts immediately after dispatch on a fast GPU, its send phase may perform RDMA writes to shmemInpTokMemObj before dispatch’s recv phase has completed the memcpy.
- This can overwrite the buffer and cause data corruption.

Fix

Separated the input buffers used by dispatch and combine.

Test Plan

pytest ./tests/python/ops/test_dispatch_combine_internode_inconsistency.py -s

This should be tested on a single node. Internally, the MORI_DISABLE_P2P environment variable is enabled to force communication via RDMA within the single node.

Test Result

Before modification : incorrect expert index values are produced as the dispatch result.

=================================================================================================================================================== test session starts ====================================================================================================================================================
platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0
rootdir: /app/mori
plugins: assume-2.4.3, anyio-4.9.0, asyncio-1.0.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collecting ...
collected 1 item

tests/python/ops/test_dispatch_combine_internode_inconsistency.py Multiprocessing start method set to spawn

rank 0 RDMA devices: mlx5_0, mlx5_2, mlx5_3, mlx5_4
rank 0 rankInNode 0 select device [0] mlx5_0
rank 3 rankInNode 3 select device [3] mlx5_4
rank 6 rankInNode 6 select device [0] mlx5_0
rank 5 rankInNode 5 select device [3] mlx5_4
rank 7 rankInNode 7 select device [1] mlx5_2
rank 2 rankInNode 2 select device [2] mlx5_3
rank 1 rankInNode 1 select device [1] mlx5_2
rank 4 rankInNode 4 select device [2] mlx5_3
Passed 0/2048
Passed 1/2048
Passed 2/2048
...
Passed 33/2048
Passed 34/2048
Passed 35/2048
Invalid expert id: 1261946812

After modification : no error occurs

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

jhchouuu · 2025-10-09T03:16:40Z

Hi dongmin @dongmin-ra , many thanks for your PR.

We also discovered this problem when integrating mori-ep into vllm. Our initial idea was to add synchronization for all devices within the kernel at the end of the dispatch recv phase, which ensure that the buffer is not overwritten by combine. But this sync operation is expensive.
Your approach seems more practical, although it will consume more memory, we will find ways to further optimize memory utilization. We're going to merge it later and look forward to more contributions from MOREH team!

include/mori/ops/dispatch_combine/dispatch_combine.hpp

jhchouuu · 2025-10-13T03:00:08Z

tests/python/ops/test_dispatch_combine_internode_inconsistency.py

+        num_experts_per_token=num_experts_per_token,
+        max_token_type_size=2,
+        block_num=16,
+        warp_num_per_block=1,


Could you please help to change the "warp_num_per_block" to 16, warp_num_per_block=16 ? Otherwise, the test will not pass. However, it might be caused by the new changes incorporated into the kernel.

After our testing on vllm, this fix also resolves the issue. We will merge it soon.
@dongmin-ra thanks to moreh team again

fix inconsistency in internode dispatch/combine

bde2e87

jhchouuu reviewed Oct 10, 2025

View reviewed changes

include/mori/ops/dispatch_combine/dispatch_combine.hpp Outdated Show resolved Hide resolved

Changed variable names

4c18874

jhchouuu reviewed Oct 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: fix inconsistency in internode dispatch/combine #80

Fix: fix inconsistency in internode dispatch/combine #80

dongmin-ra commented Oct 2, 2025 •

edited

Loading

Uh oh!

jhchouuu commented Oct 9, 2025

Uh oh!

Uh oh!

jhchouuu Oct 13, 2025

Uh oh!

jhchouuu Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: fix inconsistency in internode dispatch/combine #80

Are you sure you want to change the base?

Fix: fix inconsistency in internode dispatch/combine #80

Conversation

dongmin-ra commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

jhchouuu commented Oct 9, 2025

Uh oh!

Uh oh!

jhchouuu Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

jhchouuu Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dongmin-ra commented Oct 2, 2025 •

edited

Loading