-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support System Allocated Memory (SAM) #701
base: branch-24.10
Are you sure you want to change the base?
Conversation
Signed-off-by: Rong Ou <[email protected]>
Signed-off-by: Rong Ou <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Exciting to have SAM. Minor questions.
@@ -98,7 +98,10 @@ cat <<EOF | |||
--spark_confs spark.python.worker.reuse=true \ | |||
--spark_confs spark.master=local[$local_threads] \ | |||
--spark_confs spark.driver.memory=128g \ | |||
--spark_confs spark.rapids.ml.uvm.enabled=true | |||
--spark_confs spark.rapids.ml.uvm.enabled=false \ | |||
--spark_confs spark.rapids.ml.sam.enabled=true \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does sam work on non-GH machines? Will it fall back to uvm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only when HMM is supported:
$ nvidia-smi -q | grep Addressing
Addressing Mode : HMM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if HMM is not supported but we've enabled SAM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invoking the RMM system mr would cause a crash. I guess we should figure out what value to default to that's the most convenient to us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
Does our ci machine support RMM? Nightly ci executes run_benchmark.sh with a A100 40G GPU I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work as long as we install the open source driver. These are the requirements:
- NVIDIA CUDA 12.2 with the open-source r535_00 driver or newer.
- A sufficiently recent Linux kernel: 6.1.24+, 6.2.11+, or 6.3+.
- A GPU with one of the following supported architectures: NVIDIA Turing, NVIDIA Ampere, NVIDIA Ada Lovelace, NVIDIA Hopper, or newer.
- A 64-bit x86 CPU.
Signed-off-by: Rong Ou <[email protected]>
Signed-off-by: Rong Ou <[email protected]>
Signed-off-by: Rong Ou <[email protected]>
docs/site/configuration.md
Outdated
| spark.rapids.ml.uvm.enabled | false | if set to true, enables [CUDA unified virtual memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) (aka managed memory) during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory | | ||
| spark.rapids.ml.sam.enabled | false | if set to true, enables System Allocated Memory (SAM) on [HMM](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/) or [ATS](https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/) systems during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory | | ||
| spark.rapids.ml.sam.headroom | None | when using System Allocated Memory (SAM) and GPU memory is oversubscribed, we may need to reserve some GPU memory as headroom to allow other CUDA calls to function without running out memory. Set a size appropriate for your application | | ||
| spark.executorEnv.CUPY_ENABLE_SAM | 0 | if set to 1, enables System Allocated Memory (SAM) for CuPy operations. This enabled CuPy to work with SAM, and also avoid unnecessary memory coping | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, do we need to add spark.driverEnv.CUPY_ENABLE_SAM
when the spark is in local mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/site/configuration.md
Outdated
| spark.rapids.ml.uvm.enabled | false | if set to true, enables [CUDA unified virtual memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) (aka managed memory) during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory | | ||
| spark.rapids.ml.sam.enabled | false | if set to true, enables System Allocated Memory (SAM) on [HMM](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/) or [ATS](https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/) systems during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory | | ||
| spark.rapids.ml.sam.headroom | None | when using System Allocated Memory (SAM) and GPU memory is oversubscribed, we may need to reserve some GPU memory as headroom to allow other CUDA calls to function without running out memory. Set a size appropriate for your application | | ||
| spark.executorEnv.CUPY_ENABLE_SAM | 0 | if set to 1, enables System Allocated Memory (SAM) for CuPy operations. This enabled CuPy to work with SAM, and also avoid unnecessary memory coping | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabled Cupy -> enables Cupy
coping -> copying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Rong Ou <[email protected]>
Signed-off-by: Rong Ou <[email protected]>
Signed-off-by: Rong Ou <[email protected]>
Signed-off-by: Rong Ou <[email protected]>
Signed-off-by: Rong Ou <[email protected]>
Signed-off-by: Rong Ou <[email protected]>
Signed-off-by: Rong Ou <[email protected]>
Signed-off-by: Rong Ou <[email protected]>
Signed-off-by: Rong Ou <[email protected]>
Allowing running benchmarks using System Allocated Memory (SAM) on HMM and Grace Hopper systems. This depends on cupy/cupy#8442.