Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support System Allocated Memory (SAM) #701

Draft
wants to merge 16 commits into
base: branch-24.10
Choose a base branch
from
Draft

Conversation

rongou
Copy link
Collaborator

@rongou rongou commented Aug 5, 2024

Allowing running benchmarks using System Allocated Memory (SAM) on HMM and Grace Hopper systems. This depends on cupy/cupy#8442.

Signed-off-by: Rong Ou <[email protected]>
Copy link
Collaborator

@lijinf2 lijinf2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Exciting to have SAM. Minor questions.

@@ -98,7 +98,10 @@ cat <<EOF
--spark_confs spark.python.worker.reuse=true \
--spark_confs spark.master=local[$local_threads] \
--spark_confs spark.driver.memory=128g \
--spark_confs spark.rapids.ml.uvm.enabled=true
--spark_confs spark.rapids.ml.uvm.enabled=false \
--spark_confs spark.rapids.ml.sam.enabled=true \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sam work on non-GH machines? Will it fall back to uvm?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only when HMM is supported:

$ nvidia-smi -q | grep Addressing
    Addressing Mode                       : HMM

Copy link
Collaborator

@wbo4958 wbo4958 Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if HMM is not supported but we've enabled SAM?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invoking the RMM system mr would cause a crash. I guess we should figure out what value to default to that's the most convenient to us.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.
Does our ci machine support RMM? Nightly ci executes run_benchmark.sh with a A100 40G GPU I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work as long as we install the open source driver. These are the requirements:

  • NVIDIA CUDA 12.2 with the open-source r535_00 driver or newer.
  • A sufficiently recent Linux kernel: 6.1.24+, 6.2.11+, or 6.3+.
  • A GPU with one of the following supported architectures: NVIDIA Turing, NVIDIA Ampere, NVIDIA Ada Lovelace, NVIDIA Hopper, or newer.
  • A 64-bit x86 CPU.

python/run_benchmark.sh Show resolved Hide resolved
| spark.rapids.ml.uvm.enabled | false | if set to true, enables [CUDA unified virtual memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) (aka managed memory) during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory |
| spark.rapids.ml.sam.enabled | false | if set to true, enables System Allocated Memory (SAM) on [HMM](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/) or [ATS](https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/) systems during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory |
| spark.rapids.ml.sam.headroom | None | when using System Allocated Memory (SAM) and GPU memory is oversubscribed, we may need to reserve some GPU memory as headroom to allow other CUDA calls to function without running out memory. Set a size appropriate for your application |
| spark.executorEnv.CUPY_ENABLE_SAM | 0 | if set to 1, enables System Allocated Memory (SAM) for CuPy operations. This enabled CuPy to work with SAM, and also avoid unnecessary memory coping |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, do we need to add spark.driverEnv.CUPY_ENABLE_SAM when the spark is in local mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

| spark.rapids.ml.uvm.enabled | false | if set to true, enables [CUDA unified virtual memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) (aka managed memory) during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory |
| spark.rapids.ml.sam.enabled | false | if set to true, enables System Allocated Memory (SAM) on [HMM](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/) or [ATS](https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/) systems during estimator.fit() operations to allow processing of larger datasets than would fit in GPU memory |
| spark.rapids.ml.sam.headroom | None | when using System Allocated Memory (SAM) and GPU memory is oversubscribed, we may need to reserve some GPU memory as headroom to allow other CUDA calls to function without running out memory. Set a size appropriate for your application |
| spark.executorEnv.CUPY_ENABLE_SAM | 0 | if set to 1, enables System Allocated Memory (SAM) for CuPy operations. This enabled CuPy to work with SAM, and also avoid unnecessary memory coping |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabled Cupy -> enables Cupy
coping -> copying

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants