Skip to content

Conversation

@erieaton-amd
Copy link
Contributor

This enables the RDMA/GDA support that was added to rocshmem. It's working on a single node with an MLX5 card. This WIP has not yet been tested on separate machines.

@CLAassistant
Copy link

CLAassistant commented Nov 10, 2025

CLA assistant check
All committers have signed the CLA.

@wenlei-bao
Copy link
Collaborator

wenlei-bao commented Dec 16, 2025

Thanks for the contribution. @erieaton-amd @drprajap Is this PR ready?
BTW we just update Triton-distributed/Triton, so you may want to rebase.

@KnowingNothing
Copy link
Collaborator

We have updated the code. If you have any problems in refactoring your PR, feel free to leave comments : )

@KnowingNothing KnowingNothing self-assigned this Dec 25, 2025
This updates the initialization code for test_ag_gemm_intra_node.py.

Signed-off-by: Eric Eaton <[email protected]>
Signed-off-by: Eric Eaton <[email protected]>
Signed-off-by: Eric Eaton <[email protected]>
@erieaton-amd
Copy link
Contributor Author

I refactored the PR. However, I am having some trouble getting even the main branch to work now.

@erieaton-amd
Copy link
Contributor Author

I have the tests test_put_signal.py and 03a-inter-node-allgather.py reporting that they passed, but they crash immediately afterward. There appears to be some memory corruption issue.

@wenlei-bao
Copy link
Collaborator

@erieaton-amd @drprajap Seems CI failed, can you please take a look?

@erieaton-amd
Copy link
Contributor Author

The code right now only works if the machine has a Mellanox card set up. This is because the rocshmem backend is hard coded into the bitcode, which has to be cleaned up somehow so it can use the IPC backend also. Right now I'm trying to figure out why the tests I wrote are now failing on the machine that does have a Mellanox card.

Also, adjusted test_put_signal.py to be more consistent with original
test.

Signed-off-by: Eric Eaton <[email protected]>
Passes now with no errors, with and without ROCSHMEM_DISABLE_MIXED_IPC=1

Signed-off-by: Eric Eaton <[email protected]>
@erieaton-amd
Copy link
Contributor Author

Ok, the tests seems to be working again. I've enabled a dispatch feature in rocshmem that should select the right backend in the bitcode, so maybe the CI will work now.

@erieaton-amd
Copy link
Contributor Author

@wenlei-bao I am investigating an issue that the function sleep_async doesn't work in newer versions of ROCm. This doesn't appear to block this work though, so the patch can be run through the CI and reviewed.

@erieaton-amd erieaton-amd changed the title (WIP) Update rocshmem and enable GDA Update rocshmem and enable GDA Jan 27, 2026
@erieaton-amd
Copy link
Contributor Author

The bitcode was missing a file, try the CI again.

@wenlei-bao wenlei-bao merged commit 36cfb4b into ByteDance-Seed:main Jan 30, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants