Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proof of Concept] Multi-GPU prototype (single node) #89

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

efaulhaber
Copy link
Member

@efaulhaber efaulhaber commented Dec 23, 2024

This is a quick and dirty prototype to run code on multiple GPUs of a single node.
I just made everything use unified memory and split the ndrange by the number of GPUs.
The particles are ordered, so this should partition the domain in blocks. Each GPU works on one of these blocks, with only limited communication between them. Nvidia should take care of optimizing where the memory lives.

Note that I had to change some internals of CUDA.jl, as unified memory is otherwise pre-fetched to device memory, which might make sense when sharing memory between CPU and GPU, but certainly not when sharing memory between two GPUs that work on it simultaneously.

As you can see here, it is indeed using all 4 GPUs.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100                    On  | 00000000:26:00.0 Off |                    0 |
| N/A   44C    P0             244W / 700W |  24350MiB / 95830MiB |     44%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100                    On  | 00000000:46:00.0 Off |                    0 |
| N/A   40C    P0             173W / 700W |   2660MiB / 95830MiB |      1%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100                    On  | 00000000:A6:00.0 Off |                    0 |
| N/A   45C    P0             247W / 700W |   2658MiB / 95830MiB |     43%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100                    On  | 00000000:C6:00.0 Off |                    0 |
| N/A   47C    P0             222W / 700W |   2654MiB / 95830MiB |     25%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

Here are some results using the WCSPH benchmark with 65M particles.

1 GPU device memory (main) 1 GPU unified memory 4 GPUs device memory 4 GPUs unified memory 2 GPUs unified memory
FP64 827.574 ms 827.062 ms 6.665 s 642.866 ms 417.144 ms ~ 836.871 ms
FP32 421.291 ms 420.933 ms 4.191 s 328.196 ms 426.163 ms

As you can see, there is no difference between device and unified memory on a single GPU.
Device memory with 4 GPUs is unsurprisingly very slow. Unified memory with 4 GPUs is slightly faster than a single GPU, but the difference is a bit underwhelming.

On 2 GPUs, things are getting interesting. Most of the time, the runtime is very similar to 1 GPU, but about 1 out of 20 runs, it's almost twice as fast:

BenchmarkTools.Trial: 30 samples with 1 evaluation.
 Range (min … max):  416.990 ms … 851.266 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     838.658 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   811.535 ms ± 107.264 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                             █   
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃█▄ ▁
  417 ms           Histogram: frequency by time          851 ms <

Memory estimate: 68.70 KiB, allocs estimate: 543.

So it seems that only sometimes the two GPUs are working in parallel, while most of the time only one is active.
I have no idea why this is happening.

CC @vchuravy

@efaulhaber efaulhaber added the gpu label Dec 23, 2024
@efaulhaber efaulhaber self-assigned this Dec 24, 2024
@efaulhaber
Copy link
Member Author

Update

With help in the Julia Slack, I found out that there is a synchronization happening when an array is accessed from another stream. I now disabled both the prefetching (see my first post) and this synchronization, and I get the following amazing results:

1 GPU 2 GPUs 4 GPUs
FP64 827.574 ms (1x) 425.970 ms (1.94x) 220.332 ms (3.76x)
FP32 421.291 ms (1x) 219.901 ms (1.92x) 115.364 ms (3.65x)

Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Please upload report for BASE (ef/localmem-kernel@9560a82). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/util.jl 0.00% 3 Missing ⚠️
Additional details and impacted files
@@                  Coverage Diff                  @@
##             ef/localmem-kernel      #89   +/-   ##
=====================================================
  Coverage                      ?   70.12%           
=====================================================
  Files                         ?       15           
  Lines                         ?      626           
  Branches                      ?        0           
=====================================================
  Hits                          ?      439           
  Misses                        ?      187           
  Partials                      ?        0           
Flag Coverage Δ
unit 70.12% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@efaulhaber efaulhaber changed the base branch from ef/localmem-kernel to main January 10, 2025 23:11
@efaulhaber
Copy link
Member Author

There are some major obstacles in the way of this being merged.

  1. Disable or make automatic prefecthing of unified memory optional JuliaGPU/CUDA.jl#2618 and Ability to opt out of / improved automatic synchronization between tasks for shared array usage JuliaGPU/CUDA.jl#2617 need to be solved to get any benefit from using multiple GPUs. I used the branch https://github.com/efaulhaber/CUDA.jl/tree/disable-prefetch as a workaround to produce the results above.
  2. The atomic neighborhood search update does not work on multiple GPUs, so even with these two issues solved, we can't run a full simulation. Atomics with unified memory seem to be a bigger obstacle than the two issues above: Atomics: configurable scope (for multi-device unified memory) JuliaGPU/CUDA.jl#2619

@sloede
Copy link
Member

sloede commented Jan 16, 2025

Update

With help in the Julia Slack, I found out that there is a synchronization happening when an array is accessed from another stream. I now disabled both the prefetching (see my first post) and this synchronization, and I get the following amazing results:
1 GPU 2 GPUs 4 GPUs
FP64 827.574 ms (1x) 425.970 ms (1.94x) 220.332 ms (3.76x)
FP32 421.291 ms (1x) 219.901 ms (1.92x) 115.364 ms (3.65x)

Amazing - great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants