-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proof of Concept] Multi-GPU prototype (single node) #89
base: main
Are you sure you want to change the base?
Conversation
UpdateWith help in the Julia Slack, I found out that there is a synchronization happening when an array is accessed from another stream. I now disabled both the prefetching (see my first post) and this synchronization, and I get the following amazing results:
|
5b025d4
to
6282767
Compare
6282767
to
54f6a07
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## ef/localmem-kernel #89 +/- ##
=====================================================
Coverage ? 70.12%
=====================================================
Files ? 15
Lines ? 626
Branches ? 0
=====================================================
Hits ? 439
Misses ? 187
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There are some major obstacles in the way of this being merged.
|
Amazing - great work! |
5e679d8
to
10a37f2
Compare
This is a quick and dirty prototype to run code on multiple GPUs of a single node.
I just made everything use unified memory and split the
ndrange
by the number of GPUs.The particles are ordered, so this should partition the domain in blocks. Each GPU works on one of these blocks, with only limited communication between them. Nvidia should take care of optimizing where the memory lives.
Note that I had to change some internals of CUDA.jl, as unified memory is otherwise pre-fetched to device memory, which might make sense when sharing memory between CPU and GPU, but certainly not when sharing memory between two GPUs that work on it simultaneously.
As you can see here, it is indeed using all 4 GPUs.
Here are some results using the WCSPH benchmark with 65M particles.
As you can see, there is no difference between device and unified memory on a single GPU.
Device memory with 4 GPUs is unsurprisingly very slow. Unified memory with 4 GPUs is slightly faster than a single GPU, but the difference is a bit underwhelming.
On 2 GPUs, things are getting interesting. Most of the time, the runtime is very similar to 1 GPU, but about 1 out of 20 runs, it's almost twice as fast:
So it seems that only sometimes the two GPUs are working in parallel, while most of the time only one is active.
I have no idea why this is happening.
CC @vchuravy