Skip to content

TEST/MPI: Added MPI+CUDA example. #10601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

rakhmets
Copy link
Contributor

@rakhmets rakhmets commented Apr 3, 2025

What?

Added MPI+CUDA example.

@rakhmets rakhmets force-pushed the topic/test-mpi-cuda branch 2 times, most recently from e50b97e to ff5b829 Compare April 3, 2025 18:55
@rakhmets rakhmets force-pushed the topic/test-mpi-cuda branch 5 times, most recently from e0995a6 to e3ccd4c Compare April 4, 2025 16:00
@rakhmets rakhmets force-pushed the topic/test-mpi-cuda branch from e3ccd4c to 9011fcd Compare April 4, 2025 16:33
@rakhmets rakhmets marked this pull request as ready for review April 4, 2025 16:51
@rakhmets rakhmets force-pushed the topic/test-mpi-cuda branch from 9011fcd to a15d178 Compare April 7, 2025 09:37
@rakhmets rakhmets added the WIP-DNM Work in progress / Do not review label Apr 7, 2025
@rakhmets rakhmets force-pushed the topic/test-mpi-cuda branch from a15d178 to 4641483 Compare April 7, 2025 11:29
@rakhmets rakhmets removed the WIP-DNM Work in progress / Do not review label Apr 7, 2025
Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakhmets Overall the tests stress multi-GPU support in a good way.

I see the following not being tested:

  1. The case where one thread has device context bound to it and it allocates device memory. Later another thread issues the MPI operations with the allocated memory.
  2. Also use cudaSetDevice/cudaDeviceReset runtime API Instead of explicitly using ctxRetain/Release. This is more commonly the API exercised by high-level applications so it would be good to ensure that no cases break from just testing driver API.

The above two cases are supported by multi-GPU support, right? If so, will they be tested in a separate tests?

for (dev_idx = dev_count - 1; dev_idx > -1; --dev_idx) {
CUDA_CALL(cuDeviceGet(&cu_dev, dev_idx));
CUDA_CALL(cuDevicePrimaryCtxRetain(&cu_ctx, cu_dev));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakhmets Minor comment -- By this logic, if the MPI job were to be launched on the same node, both ranks would end up initializing cu_dev to 0 by the end of the above loop. If there are multiple GPUs on the node, I think it would be good to check where the ranks manage different GPUs. As an example, osu benchmarks use LOCAL_RANK env for this reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the test is only for two processes, I fixed this by calling cuDeviceGet(&cu_dev, 1) for the 2nd rank.

@rakhmets
Copy link
Contributor Author

rakhmets commented Apr 10, 2025

@rakhmets Overall the tests stress multi-GPU support in a good way.

I see the following not being tested:

  1. The case where one thread has device context bound to it and it allocates device memory. Later another thread issues the MPI operations with the allocated memory.
  2. Also use cudaSetDevice/cudaDeviceReset runtime API Instead of explicitly using ctxRetain/Release. This is more commonly the API exercised by high-level applications so it would be good to ensure that no cases break from just testing driver API.

The above two cases are supported by multi-GPU support, right? If so, will they be tested in a separate tests?

  1. test_alloc_prim_send_no does pretty much the same thing (from the perspective of the feature implementation). There is an active (retained) primary device context, but it is not bound to the thread at the moment of MPI send/recv.
  2. I think it would be better to have separate test for CUDA Runtime API only. Because some scenarios (e.g. create user context) are only valid for Driver API.

I will add a separate test using CUDA Runtime API (probably in another PR). And I will include the test case described in the first bullet to this new test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants