[Proof of Concept] Multi-GPU prototype (single node) #89

efaulhaber · 2024-12-23T10:14:21Z

This is a quick and dirty prototype to run code on multiple GPUs of a single node.
I just made everything use unified memory and split the ndrange by the number of GPUs.
The particles are ordered, so this should partition the domain in blocks. Each GPU works on one of these blocks, with only limited communication between them. Nvidia should take care of optimizing where the memory lives.

Note that I had to change some internals of CUDA.jl, as unified memory is otherwise pre-fetched to device memory, which might make sense when sharing memory between CPU and GPU, but certainly not when sharing memory between two GPUs that work on it simultaneously.

As you can see here, it is indeed using all 4 GPUs.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100                    On  | 00000000:26:00.0 Off |                    0 |
| N/A   44C    P0             244W / 700W |  24350MiB / 95830MiB |     44%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100                    On  | 00000000:46:00.0 Off |                    0 |
| N/A   40C    P0             173W / 700W |   2660MiB / 95830MiB |      1%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100                    On  | 00000000:A6:00.0 Off |                    0 |
| N/A   45C    P0             247W / 700W |   2658MiB / 95830MiB |     43%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100                    On  | 00000000:C6:00.0 Off |                    0 |
| N/A   47C    P0             222W / 700W |   2654MiB / 95830MiB |     25%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

Here are some results using the WCSPH benchmark with 65M particles.

	1 GPU device memory (main)	1 GPU unified memory	4 GPUs device memory	4 GPUs unified memory	2 GPUs unified memory
FP64	827.574 ms	827.062 ms	6.665 s	642.866 ms	417.144 ms ~ 836.871 ms
FP32	421.291 ms	420.933 ms	4.191 s	328.196 ms	426.163 ms

As you can see, there is no difference between device and unified memory on a single GPU.
Device memory with 4 GPUs is unsurprisingly very slow. Unified memory with 4 GPUs is slightly faster than a single GPU, but the difference is a bit underwhelming.

On 2 GPUs, things are getting interesting. Most of the time, the runtime is very similar to 1 GPU, but about 1 out of 20 runs, it's almost twice as fast:

BenchmarkTools.Trial: 30 samples with 1 evaluation.
 Range (min … max):  416.990 ms … 851.266 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     838.658 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   811.535 ms ± 107.264 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                             █   
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃█▄ ▁
  417 ms           Histogram: frequency by time          851 ms <

Memory estimate: 68.70 KiB, allocs estimate: 543.

So it seems that only sometimes the two GPUs are working in parallel, while most of the time only one is active.
I have no idea why this is happening.

CC @vchuravy

efaulhaber · 2025-01-10T17:40:17Z

Update

With help in the Julia Slack, I found out that there is a synchronization happening when an array is accessed from another stream. I now disabled both the prefetching (see my first post) and this synchronization, and I get the following amazing results:

	1 GPU	2 GPUs	4 GPUs
FP64	827.574 ms (1x)	425.970 ms (1.94x)	220.332 ms (3.76x)
FP32	421.291 ms (1x)	219.901 ms (1.92x)	115.364 ms (3.65x)

codecov · 2025-01-10T23:04:18Z

Codecov Report

Attention: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Please upload report for BASE (ef/localmem-kernel@9560a82). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/util.jl	0.00%	3 Missing ⚠️

Additional details and impacted files

@@                  Coverage Diff                  @@
##             ef/localmem-kernel      #89   +/-   ##
=====================================================
  Coverage                      ?   70.12%           
=====================================================
  Files                         ?       15           
  Lines                         ?      626           
  Branches                      ?        0           
=====================================================
  Hits                          ?      439           
  Misses                        ?      187           
  Partials                      ?        0

Flag	Coverage Δ
unit	`70.12% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

efaulhaber · 2025-01-16T10:12:37Z

There are some major obstacles in the way of this being merged.

Disable or make automatic prefecthing of unified memory optional JuliaGPU/CUDA.jl#2618 and Ability to opt out of / improved automatic synchronization between tasks for shared array usage JuliaGPU/CUDA.jl#2617 need to be solved to get any benefit from using multiple GPUs. I used the branch https://github.com/efaulhaber/CUDA.jl/tree/disable-prefetch as a workaround to produce the results above.
The atomic neighborhood search update does not work on multiple GPUs, so even with these two issues solved, we can't run a full simulation. Atomics with unified memory seem to be a bigger obstacle than the two issues above: Atomics: configurable scope (for multi-device unified memory) JuliaGPU/CUDA.jl#2619

sloede · 2025-01-16T10:19:09Z

Update

With help in the Julia Slack, I found out that there is a synchronization happening when an array is accessed from another stream. I now disabled both the prefetching (see my first post) and this synchronization, and I get the following amazing results:
1 GPU 2 GPUs 4 GPUs
FP64 827.574 ms (1x) 425.970 ms (1.94x) 220.332 ms (3.76x)
FP32 421.291 ms (1x) 219.901 ms (1.92x) 115.364 ms (3.65x)

Amazing - great work!

Co-authored-by: Valentin Churavy <[email protected]>

efaulhaber added the gpu label Dec 23, 2024

efaulhaber self-assigned this Dec 24, 2024

efaulhaber force-pushed the ef/multi-gpu branch from 5b025d4 to 6282767 Compare January 10, 2025 22:53

Add backend to use unified memory and multiple GPUs

54f6a07

efaulhaber force-pushed the ef/multi-gpu branch from 6282767 to 54f6a07 Compare January 10, 2025 22:55

efaulhaber changed the base branch from ef/localmem-kernel to main January 10, 2025 23:11

efaulhaber and others added 3 commits January 17, 2025 18:01

Update on each GPU in device memory and then copy to unified memory

fc7180f

Add workaround for system atomics

234dbb5

Co-authored-by: Valentin Churavy <[email protected]>

Add semi-fast second and slower third approach

10a37f2

efaulhaber force-pushed the ef/multi-gpu branch from 5e679d8 to 10a37f2 Compare January 22, 2025 19:37

Define custom type to make broadcasting fast

86b465e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proof of Concept] Multi-GPU prototype (single node) #89

[Proof of Concept] Multi-GPU prototype (single node) #89

efaulhaber commented Dec 23, 2024 •

edited

Loading

efaulhaber commented Jan 10, 2025

codecov bot commented Jan 10, 2025

efaulhaber commented Jan 16, 2025

sloede commented Jan 16, 2025

Update

[Proof of Concept] Multi-GPU prototype (single node) #89

Are you sure you want to change the base?

[Proof of Concept] Multi-GPU prototype (single node) #89

Conversation

efaulhaber commented Dec 23, 2024 • edited Loading

efaulhaber commented Jan 10, 2025

Update

codecov bot commented Jan 10, 2025

Codecov Report

efaulhaber commented Jan 16, 2025

sloede commented Jan 16, 2025

Update

efaulhaber commented Dec 23, 2024 •

edited

Loading