-
Notifications
You must be signed in to change notification settings - Fork 453
TEST/MPI: Added MPI+CUDA example. #10601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e50b97e
to
ff5b829
Compare
e0995a6
to
e3ccd4c
Compare
e3ccd4c
to
9011fcd
Compare
9011fcd
to
a15d178
Compare
a15d178
to
4641483
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rakhmets Overall the tests stress multi-GPU support in a good way.
I see the following not being tested:
- The case where one thread has device context bound to it and it allocates device memory. Later another thread issues the MPI operations with the allocated memory.
- Also use cudaSetDevice/cudaDeviceReset runtime API Instead of explicitly using ctxRetain/Release. This is more commonly the API exercised by high-level applications so it would be good to ensure that no cases break from just testing driver API.
The above two cases are supported by multi-GPU support, right? If so, will they be tested in a separate tests?
for (dev_idx = dev_count - 1; dev_idx > -1; --dev_idx) { | ||
CUDA_CALL(cuDeviceGet(&cu_dev, dev_idx)); | ||
CUDA_CALL(cuDevicePrimaryCtxRetain(&cu_ctx, cu_dev)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rakhmets Minor comment -- By this logic, if the MPI job were to be launched on the same node, both ranks would end up initializing cu_dev to 0 by the end of the above loop. If there are multiple GPUs on the node, I think it would be good to check where the ranks manage different GPUs. As an example, osu benchmarks use LOCAL_RANK env for this reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the test is only for two processes, I fixed this by calling cuDeviceGet(&cu_dev, 1)
for the 2nd rank.
I will add a separate test using CUDA Runtime API (probably in another PR). And I will include the test case described in the first bullet to this new test. |
What?
Added MPI+CUDA example.