Support System Allocated Memory (SAM) #701

rongou · 2024-08-05T19:40:14Z

Allowing running benchmarks using System Allocated Memory (SAM) on HMM and Grace Hopper systems. This depends on cupy/cupy#8442.

Signed-off-by: Rong Ou <[email protected]>

lijinf2

Thanks for the PR! Exciting to have SAM. Minor questions.

lijinf2 · 2024-08-05T19:59:00Z

python/run_benchmark.sh

@@ -98,7 +98,10 @@ cat <<EOF
 --spark_confs spark.python.worker.reuse=true \
 --spark_confs spark.master=local[$local_threads] \
 --spark_confs spark.driver.memory=128g \
--spark_confs spark.rapids.ml.uvm.enabled=true
+--spark_confs spark.rapids.ml.uvm.enabled=false \
+--spark_confs spark.rapids.ml.sam.enabled=true \


Does sam work on non-GH machines? Will it fall back to uvm?

Only when HMM is supported:

$ nvidia-smi -q | grep Addressing Addressing Mode : HMM

What will happen if HMM is not supported but we've enabled SAM?

Invoking the RMM system mr would cause a crash. I guess we should figure out what value to default to that's the most convenient to us.

Agree.
Does our ci machine support RMM? Nightly ci executes run_benchmark.sh with a A100 40G GPU I think.

It should work as long as we install the open source driver. These are the requirements:

NVIDIA CUDA 12.2 with the open-source r535_00 driver or newer.

A sufficiently recent Linux kernel: 6.1.24+, 6.2.11+, or 6.3+.

A GPU with one of the following supported architectures: NVIDIA Turing, NVIDIA Ampere, NVIDIA Ada Lovelace, NVIDIA Hopper, or newer.

A 64-bit x86 CPU.

python/run_benchmark.sh

Signed-off-by: Rong Ou <[email protected]>

wbo4958 · 2024-08-05T23:16:41Z

docs/site/configuration.md

+| spark.rapids.ml.uvm.enabled       | false   | if set to true, enables [CUDA unified virtual memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) (aka managed memory) during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory                                                                                                                                             |
+| spark.rapids.ml.sam.enabled       | false   | if set to true, enables System Allocated Memory (SAM) on [HMM](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/) or [ATS](https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/) systems during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory |
+| spark.rapids.ml.sam.headroom      | None    | when using System Allocated Memory (SAM) and GPU memory is oversubscribed, we may need to reserve some GPU memory as headroom to allow other CUDA calls to function without running out memory. Set a size appropriate for your application                                                                                                                                                    |
+| spark.executorEnv.CUPY_ENABLE_SAM | 0       | if set to 1, enables System Allocated Memory (SAM) for CuPy operations. This enabled CuPy to work with SAM, and also avoid unnecessary memory coping                                                                                                                                                                                                                                           |


Probably, do we need to add spark.driverEnv.CUPY_ENABLE_SAM when the spark is in local mode?

eordentlich · 2024-08-06T06:47:35Z

docs/site/configuration.md

+| spark.rapids.ml.uvm.enabled       | false   | if set to true, enables [CUDA unified virtual memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) (aka managed memory) during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory                                                                                                                                             |
+| spark.rapids.ml.sam.enabled       | false   | if set to true, enables System Allocated Memory (SAM) on [HMM](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/) or [ATS](https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/) systems during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory |
+| spark.rapids.ml.sam.headroom      | None    | when using System Allocated Memory (SAM) and GPU memory is oversubscribed, we may need to reserve some GPU memory as headroom to allow other CUDA calls to function without running out memory. Set a size appropriate for your application                                                                                                                                                    |
+| spark.executorEnv.CUPY_ENABLE_SAM | 0       | if set to 1, enables System Allocated Memory (SAM) for CuPy operations. This enabled CuPy to work with SAM, and also avoid unnecessary memory coping                                                                                                                                                                                                                                           |


enabled Cupy -> enables Cupy
coping -> copying

Signed-off-by: Rong Ou <[email protected]>

support system allocated memory

9221c96

Signed-off-by: Rong Ou <[email protected]>

rongou force-pushed the sam branch from 78303ab to 9221c96 Compare August 5, 2024 19:48

add timing info

590fb38

Signed-off-by: Rong Ou <[email protected]>

lijinf2 reviewed Aug 5, 2024

View reviewed changes

rongou added 3 commits August 5, 2024 14:36

parse headroom size into integer

51d9248

Signed-off-by: Rong Ou <[email protected]>

document cupy sam env var

ca5ed08

Signed-off-by: Rong Ou <[email protected]>

convert headroom from MiB to bytes

d2641ba

Signed-off-by: Rong Ou <[email protected]>

wbo4958 reviewed Aug 5, 2024

View reviewed changes

eordentlich reviewed Aug 6, 2024

View reviewed changes

rongou added 4 commits August 6, 2024 10:51

address review feedback

da5b3bb

Signed-off-by: Rong Ou <[email protected]>

use system memory in rmm cupy allocator

d423f2a

Signed-off-by: Rong Ou <[email protected]>

Merge remote-tracking branch 'upstream/branch-24.10' into sam

a35cf8e

Signed-off-by: Rong Ou <[email protected]>

switch back to rmm_cupy_allocator

31faca4

Signed-off-by: Rong Ou <[email protected]>

rongou force-pushed the sam branch from 5b12424 to 31faca4 Compare August 7, 2024 21:37

rongou marked this pull request as draft August 7, 2024 21:44

rongou added 7 commits August 8, 2024 16:09

fix sparse

53eace8

Signed-off-by: Rong Ou <[email protected]>

use custom numpy allocator

ac3ec87

Signed-off-by: Rong Ou <[email protected]>

configure custom numpy allocator correctly

b50ed60

Signed-off-by: Rong Ou <[email protected]>

add numpy_allcoator as requirement

49bf63b

Signed-off-by: Rong Ou <[email protected]>

Merge remote-tracking branch 'upstream/branch-24.10' into sam

d182146

Merge remote-tracking branch 'upstream/branch-24.10' into sam

f19e079

remove unused import

b2bb98b

Signed-off-by: Rong Ou <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support System Allocated Memory (SAM) #701

Support System Allocated Memory (SAM) #701

rongou commented Aug 5, 2024

lijinf2 left a comment

lijinf2 Aug 5, 2024

rongou Aug 5, 2024

wbo4958 Aug 5, 2024 •

edited

Loading

rongou Aug 6, 2024

lijinf2 Aug 6, 2024

rongou Aug 7, 2024

wbo4958 Aug 5, 2024

rongou Aug 6, 2024

eordentlich Aug 6, 2024

rongou Aug 6, 2024

Support System Allocated Memory (SAM) #701

Are you sure you want to change the base?

Support System Allocated Memory (SAM) #701

Conversation

rongou commented Aug 5, 2024

lijinf2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 Aug 5, 2024 •

edited

Loading