TEST/MPI: Added MPI+CUDA example. #10601

rakhmets · 2025-04-03T18:49:33Z

What?

Added MPI+CUDA example.

Akshay-Venkatesh

@rakhmets Overall the tests stress multi-GPU support in a good way.

I see the following not being tested:

The case where one thread has device context bound to it and it allocates device memory. Later another thread issues the MPI operations with the allocated memory.
Also use cudaSetDevice/cudaDeviceReset runtime API Instead of explicitly using ctxRetain/Release. This is more commonly the API exercised by high-level applications so it would be good to ensure that no cases break from just testing driver API.

The above two cases are supported by multi-GPU support, right? If so, will they be tested in a separate tests?

Akshay-Venkatesh · 2025-04-09T22:21:21Z

test/mpi/test_mpi_cuda.c

+    for (dev_idx = dev_count - 1; dev_idx > -1; --dev_idx) {
+        CUDA_CALL(cuDeviceGet(&cu_dev, dev_idx));
+        CUDA_CALL(cuDevicePrimaryCtxRetain(&cu_ctx, cu_dev));
+    }


@rakhmets Minor comment -- By this logic, if the MPI job were to be launched on the same node, both ranks would end up initializing cu_dev to 0 by the end of the above loop. If there are multiple GPUs on the node, I think it would be good to check where the ranks manage different GPUs. As an example, osu benchmarks use LOCAL_RANK env for this reason.

Since the test is only for two processes, I fixed this by calling cuDeviceGet(&cu_dev, 1) for the 2nd rank.

rakhmets · 2025-04-10T12:13:03Z

@rakhmets Overall the tests stress multi-GPU support in a good way.

I see the following not being tested:

The case where one thread has device context bound to it and it allocates device memory. Later another thread issues the MPI operations with the allocated memory.

Also use cudaSetDevice/cudaDeviceReset runtime API Instead of explicitly using ctxRetain/Release. This is more commonly the API exercised by high-level applications so it would be good to ensure that no cases break from just testing driver API.

The above two cases are supported by multi-GPU support, right? If so, will they be tested in a separate tests?

test_alloc_prim_send_no does pretty much the same thing (from the perspective of the feature implementation). There is an active (retained) primary device context, but it is not bound to the thread at the moment of MPI send/recv.
I think it would be better to have separate test for CUDA Runtime API only. Because some scenarios (e.g. create user context) are only valid for Driver API.

I will add a separate test using CUDA Runtime API (probably in another PR). And I will include the test case described in the first bullet to this new test.

rakhmets force-pushed the topic/test-mpi-cuda branch 2 times, most recently from e50b97e to ff5b829 Compare April 3, 2025 18:55

brminich mentioned this pull request Apr 4, 2025

UCP: Use correct device for cuda-managed memory ppln #10603

Merged

rakhmets force-pushed the topic/test-mpi-cuda branch 5 times, most recently from e0995a6 to e3ccd4c Compare April 4, 2025 16:00

brminich mentioned this pull request Apr 4, 2025

UCT/CUDA: Detect sys_dev for async allocations #10607

Merged

rakhmets force-pushed the topic/test-mpi-cuda branch from e3ccd4c to 9011fcd Compare April 4, 2025 16:33

rakhmets marked this pull request as ready for review April 4, 2025 16:51

rakhmets force-pushed the topic/test-mpi-cuda branch from 9011fcd to a15d178 Compare April 7, 2025 09:37

rakhmets added the WIP-DNM Work in progress / Do not review label Apr 7, 2025

TEST/MPI: Added MPI+CUDA example.

4641483

rakhmets force-pushed the topic/test-mpi-cuda branch from a15d178 to 4641483 Compare April 7, 2025 11:29

rakhmets removed the WIP-DNM Work in progress / Do not review label Apr 7, 2025

rakhmets mentioned this pull request Apr 8, 2025

UCT/CUDA/CUDA_COPY: Changed log level in mem_free. #10606

Merged

Akshay-Venkatesh reviewed Apr 9, 2025

View reviewed changes

rakhmets added 5 commits April 10, 2025 16:16

TEST/MPI: Addressed a review comment.

ed0d386

TEST/MPI: Added pthread test.

9d85d39

TEST/MPI: Fixed typos.

df6e1d3

TEST/MPI: Fixed MT compilation.

99efb30

TEST/MPI: Fixed alignment.

e4777bf

rakhmets mentioned this pull request Apr 16, 2025

UCP/RNDV: Fix degradation by avoiding alloc_md cache update from RNDV send response #10636

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEST/MPI: Added MPI+CUDA example. #10601

TEST/MPI: Added MPI+CUDA example. #10601

rakhmets commented Apr 3, 2025

Akshay-Venkatesh left a comment •

edited

Loading

Akshay-Venkatesh Apr 9, 2025

rakhmets Apr 10, 2025

rakhmets commented Apr 10, 2025 •

edited

Loading

TEST/MPI: Added MPI+CUDA example. #10601

Are you sure you want to change the base?

TEST/MPI: Added MPI+CUDA example. #10601

Conversation

rakhmets commented Apr 3, 2025

What?

Akshay-Venkatesh left a comment • edited Loading

Choose a reason for hiding this comment

Akshay-Venkatesh Apr 9, 2025

Choose a reason for hiding this comment

rakhmets Apr 10, 2025

Choose a reason for hiding this comment

rakhmets commented Apr 10, 2025 • edited Loading

Akshay-Venkatesh left a comment •

edited

Loading

rakhmets commented Apr 10, 2025 •

edited

Loading