-
Notifications
You must be signed in to change notification settings - Fork 234
Add RMM User Guide #2087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
bdice
wants to merge
44
commits into
rapidsai:main
Choose a base branch
from
bdice:docs-overhaul
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Add RMM User Guide #2087
Changes from all commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
3ad1eec
Add RMM User Guide
bdice d48c2bd
Sort extensions
bdice c7cb90a
Add myst-parser
bdice a7abcfa
Fix Notes section
bdice 1a13b9e
Fix warnings in docs builds
bdice f069318
Move content around
bdice e75e6ab
Restructure docs
bdice 23eac43
Use eval-rst for Python autodoc.
bdice 0815903
Restructure Python docs
bdice 3fa3e65
Initial rmm.librmm / rmm.pylibrmm docs
bdice 7823303
Overhauling C++ docs
bdice 377f709
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice 32da0ed
Cleanup
bdice 3583068
Fix wrapping
bdice 423b1a4
Ignore latex docs
bdice 06168d2
More overhaul
bdice 6036060
Update docstrings
bdice e1e8ffa
Improve docstrings
bdice c1fc01c
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice bc49e92
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice eb40dd8
Fix warnings, add changes from 2137
bdice bc38482
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice 3c36640
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice c06a09f
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice b49da3e
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice f0f6761
Remove "-c nvidia"
bdice 8682ecf
Remove host memory resources
bdice 2d136ea
Update introduction
bdice 6e2812a
Avoid teaching rmm.reinitialize
bdice 81d7229
Improve introduction
bdice 153ded7
Remove stream-ordered allocation from introduction.
bdice 5162c1b
Containers, not data structures
bdice 558a159
Introduction revisions
bdice 90b0652
Update installation page
bdice e76c9c8
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice 9ccbdb6
Temporary consolidation of C++ docs
bdice 3d69641
Add sphinx-tabs
bdice 6bd80e7
Fuse C++ and Python programming guides
bdice 53c03d4
Align code snippets
bdice 8eb4339
Require CUDA 12.2
bdice c987aa3
Improve code snippets
bdice 59ac950
Improve integration descriptions
bdice c56554a
Improve imports/includes
bdice 770ee19
Refactoring choosing memory resources
bdice File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,301 @@ | ||
| # Choosing a Memory Resource | ||
|
|
||
| One of the most common questions when using RMM is: "Which memory resource should I use?" | ||
|
|
||
| This guide provides recommendations for selecting the appropriate memory resource based on your application's needs. | ||
|
|
||
| ## Recommended Defaults | ||
|
|
||
| For most applications, use the CUDA async memory pool. | ||
|
|
||
| `````{tabs} | ||
| ````{code-tab} c++ | ||
| #include <rmm/mr/cuda_async_memory_resource.hpp> | ||
| #include <rmm/mr/per_device_resource.hpp> | ||
|
|
||
| rmm::mr::cuda_async_memory_resource mr; | ||
| rmm::mr::set_current_device_resource_ref(mr); | ||
| ```` | ||
| ````{code-tab} python | ||
| import rmm | ||
|
|
||
| mr = rmm.mr.CudaAsyncMemoryResource() | ||
| rmm.mr.set_current_device_resource(mr) | ||
| ```` | ||
| ````` | ||
|
|
||
| For applications exceeding GPU memory limits, use a pooled managed memory resource with prefetching. Note: managed memory is not supported on WSL2 systems. | ||
|
|
||
| `````{tabs} | ||
| ````{code-tab} c++ | ||
| #include <rmm/mr/managed_memory_resource.hpp> | ||
| #include <rmm/mr/pool_memory_resource.hpp> | ||
| #include <rmm/mr/prefetch_resource_adaptor.hpp> | ||
| #include <rmm/mr/per_device_resource.hpp> | ||
| #include <rmm/cuda_device.hpp> | ||
|
|
||
| // Use 80% of GPU memory, rounded down to nearest 256 bytes | ||
| auto [free_memory, total_memory] = rmm::available_device_memory(); | ||
| std::size_t pool_size = (static_cast<std::size_t>(total_memory * 0.8) / 256) * 256; | ||
|
|
||
| rmm::mr::managed_memory_resource managed_mr; | ||
| rmm::mr::pool_memory_resource pool_mr{managed_mr, pool_size}; | ||
| rmm::mr::prefetch_resource_adaptor prefetch_mr{pool_mr}; | ||
| rmm::mr::set_current_device_resource_ref(prefetch_mr); | ||
| ```` | ||
| ````{code-tab} python | ||
| import rmm | ||
|
|
||
| # Use 80% of GPU memory, rounded down to nearest 256 bytes | ||
| free_memory, total_memory = rmm.mr.available_device_memory() | ||
| pool_size = int(total_memory * 0.8) // 256 * 256 | ||
|
|
||
| mr = rmm.mr.PrefetchResourceAdaptor( | ||
| rmm.mr.PoolMemoryResource( | ||
| rmm.mr.ManagedMemoryResource(), | ||
| initial_pool_size=pool_size, | ||
| ) | ||
| ) | ||
| rmm.mr.set_current_device_resource(mr) | ||
| ```` | ||
| ````` | ||
|
|
||
| ## Memory Resource Considerations | ||
|
|
||
| It is usually best to use resources that allow the CUDA driver to manage pool suballocation via `cudaMallocFromPoolAsync`. | ||
|
|
||
| ### CudaAsyncMemoryResource | ||
|
|
||
| The `CudaAsyncMemoryResource` uses CUDA's driver-managed memory pool (via `cudaMallocAsync`). This is the **recommended default** for most applications. | ||
|
|
||
| **Advantages:** | ||
| - **Driver-managed pool**: Uses efficient suballocation with virtual addressing to avoid fragmentation | ||
| - **Cross-library sharing**: The pool can be shared across multiple applications and libraries, even those not using RMM directly | ||
| - **Stream-ordered semantics**: Allocations and deallocations are stream-ordered by default | ||
| - **Performance**: Similar or better performance compared to RMM's pool implementations | ||
|
|
||
| **When to use:** | ||
| - Default choice for GPU-accelerated applications | ||
| - Multi-stream or multi-threaded applications | ||
| - Applications using multiple GPU libraries (e.g., cuDF + PyTorch) | ||
| - Most production workloads | ||
|
|
||
| ### CudaMemoryResource | ||
|
|
||
| The `CudaMemoryResource` uses `cudaMalloc` directly for each allocation, with no pooling. | ||
|
|
||
| **Advantages:** | ||
| - Simple and predictable | ||
| - No fragmentation concerns | ||
| - Memory is immediately returned to the system on deallocation | ||
|
|
||
| **Disadvantages:** | ||
| - Slower than pooled allocators due to synchronization overhead | ||
|
|
||
| **Example:** | ||
| ```python | ||
| import rmm | ||
|
|
||
| rmm.mr.set_current_device_resource(rmm.mr.CudaMemoryResource()) | ||
| ``` | ||
|
|
||
| **When to use:** | ||
| - Simple applications with infrequent allocations | ||
| - Debugging memory issues | ||
| - Testing or benchmarking baseline performance | ||
|
|
||
| ### PoolMemoryResource | ||
|
|
||
| The `PoolMemoryResource` maintains a pool of memory allocated from an upstream resource. | ||
|
|
||
| **Advantages:** | ||
| - Fast suballocation from pre-allocated pool | ||
| - Configurable initial and maximum pool sizes | ||
|
|
||
| **Disadvantages:** | ||
| - Can suffer from fragmentation (unlike async MR) | ||
| - Pool is not shared across applications | ||
| - Requires careful tuning of pool sizes | ||
|
|
||
| **Example:** | ||
| ```python | ||
| import rmm | ||
|
|
||
| pool = rmm.mr.PoolMemoryResource( | ||
| rmm.mr.CudaMemoryResource(), # upstream resource | ||
| initial_pool_size=2**30, # 1 GiB | ||
| maximum_pool_size=2**32 # 4 GiB | ||
| ) | ||
| rmm.mr.set_current_device_resource(pool) | ||
| ``` | ||
|
|
||
| **When to use:** | ||
| - Legacy applications (prefer `CudaAsyncMemoryResource` for new code) | ||
| - Specific tuning requirements not met by async MR | ||
| - Wrapping non-CUDA memory sources | ||
|
|
||
| **Important**: If using `PoolMemoryResource`, prefer wrapping `CudaAsyncMemoryResource` as the upstream rather than `CudaMemoryResource`: | ||
|
|
||
| ```python | ||
| # Better: Pool wrapping async MR | ||
| pool = rmm.mr.PoolMemoryResource( | ||
| rmm.mr.CudaAsyncMemoryResource(), | ||
| initial_pool_size=2**30 | ||
| ) | ||
| ``` | ||
|
|
||
| This combines the benefits of both: fast suballocation from RMM's pool and the driver's virtual addressing capabilities. | ||
|
|
||
| ### ManagedMemoryResource | ||
|
|
||
| The `ManagedMemoryResource` uses CUDA unified memory (via `cudaMallocManaged`), allowing memory to be accessible from both CPU and GPU. | ||
|
|
||
| **Advantages:** | ||
| - Enables working with datasets larger than GPU memory | ||
| - Automatic page migration between CPU and GPU | ||
| - Simplifies memory management for host/device code | ||
|
|
||
| **Disadvantages:** | ||
| - Performance overhead due to page faults and migration | ||
| - Requires careful prefetching for optimal performance | ||
|
|
||
| **Example:** | ||
| ```python | ||
| import rmm | ||
|
|
||
| rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource()) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm hesitant to include this because people will copy pasta and see the poor performance that usually causes people to reject managed memory. |
||
| ``` | ||
|
|
||
| **When to use:** | ||
| - Datasets larger than available GPU memory | ||
| - Prototyping or applications where performance is not critical | ||
| - Always combine with prefetching strategies (see [Managed Memory guide](managed_memory.md)) | ||
|
|
||
| ### ArenaMemoryResource | ||
|
|
||
| The `ArenaMemoryResource` divides a large allocation into size-binned arenas, reducing fragmentation. | ||
|
|
||
| **Advantages:** | ||
| - Better fragmentation characteristics than basic pool | ||
| - Good for mixed allocation sizes | ||
| - Predictable performance | ||
|
|
||
| **Disadvantages:** | ||
| - More complex configuration | ||
| - May waste memory if bin sizes don't match allocation patterns | ||
|
|
||
| **Example:** | ||
| ```python | ||
| import rmm | ||
|
|
||
| arena = rmm.mr.ArenaMemoryResource( | ||
| rmm.mr.CudaMemoryResource(), | ||
| arena_size=2**28 # 256 MiB arenas | ||
| ) | ||
| rmm.mr.set_current_device_resource(arena) | ||
| ``` | ||
|
|
||
| **When to use:** | ||
| - Applications with diverse allocation sizes | ||
| - Long-running services with complex allocation patterns | ||
| - When fragmentation is observed with pool allocators | ||
|
|
||
| ## Composing Memory Resources | ||
|
|
||
| Memory resources can be composed (wrapped) to combine their properties. The general pattern is: | ||
|
|
||
| ```python | ||
| # Adaptor wrapping a base resource | ||
| adaptor = rmm.mr.SomeAdaptor(base_resource) | ||
| ``` | ||
|
|
||
| ### Common Compositions | ||
|
|
||
| **Prefetching with managed memory:** | ||
| ```python | ||
| import rmm | ||
|
|
||
| # Prefetch adaptor wrapping managed memory pool | ||
| base = rmm.mr.ManagedMemoryResource() | ||
| pool = rmm.mr.PoolMemoryResource(base, initial_pool_size=2**30) | ||
| prefetch = rmm.mr.PrefetchResourceAdaptor(pool) | ||
| rmm.mr.set_current_device_resource(prefetch) | ||
| ``` | ||
|
|
||
| **Statistics tracking:** | ||
| ```python | ||
| import rmm | ||
|
|
||
| # Track allocation statistics | ||
| base = rmm.mr.CudaAsyncMemoryResource() | ||
| stats = rmm.mr.StatisticsResourceAdaptor(base) | ||
| rmm.mr.set_current_device_resource(stats) | ||
| ``` | ||
|
|
||
| ## Multi-Library Applications | ||
|
|
||
| When using RMM with multiple GPU libraries (e.g., cuDF, PyTorch, CuPy), `CudaAsyncMemoryResource` is especially important because: | ||
|
|
||
| 1. The driver-managed pool is shared automatically across all libraries | ||
| 2. You don't need to configure every library to use RMM | ||
| 3. Memory is not artificially partitioned between libraries | ||
|
|
||
| **Example: RMM + PyTorch** | ||
| ```python | ||
| import rmm | ||
| import torch | ||
| from rmm.allocators.torch import rmm_torch_allocator | ||
|
|
||
| # Use async MR as the base | ||
| rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource()) | ||
|
|
||
| # Configure PyTorch to use RMM | ||
| torch.cuda.memory.change_current_allocator(rmm_torch_allocator) | ||
| ``` | ||
|
|
||
| With this setup, both PyTorch and any other RMM-using code (like cuDF) will share the same driver-managed pool. | ||
|
|
||
| ## Performance Considerations | ||
|
|
||
| ### Async MR vs. Pool MR | ||
|
|
||
| In most cases, `CudaAsyncMemoryResource` provides similar or better performance than `PoolMemoryResource`: | ||
|
|
||
| - Both use pooling for fast suballocation | ||
| - Async MR uses virtual addressing to avoid fragmentation | ||
| - Async MR shares memory across applications | ||
|
|
||
| **When Pool MR might be faster:** | ||
| - Very specific allocation patterns that align well with pool design | ||
| - Custom upstream resources (not CUDA memory) | ||
|
|
||
| ### Multi-stream Applications | ||
|
|
||
| For applications using multiple CUDA streams or threads: | ||
|
|
||
| - `CudaAsyncMemoryResource` is **strongly recommended** | ||
| - Pool allocators can create "pipeline bubbles" where streams wait for allocations | ||
| - The async MR handles stream synchronization efficiently | ||
|
|
||
| ## Best Practices | ||
|
|
||
| 1. **Set the memory resource before any allocations**: Once memory is allocated, changing the resource can lead to crashes | ||
|
|
||
| ```python | ||
| import rmm | ||
|
|
||
| # Do this first, before any GPU allocations | ||
| rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource()) | ||
| ``` | ||
|
|
||
| 2. **Prefer async MR by default**: Unless you have specific requirements, start with `CudaAsyncMemoryResource` | ||
|
|
||
| 3. **Use statistics for tuning**: If you need to understand allocation patterns, wrap with `StatisticsResourceAdaptor` | ||
|
|
||
| 4. **Don't over-engineer**: Start simple, profile, and optimize only if needed | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Pool Allocators](pool_allocators.md) - Detailed guide on pool and arena allocators | ||
| - [Managed Memory](managed_memory.md) - Guide to using managed memory and prefetching | ||
| - [Stream-Ordered Allocation](stream_ordered_allocation.md) - Understanding stream-ordered semantics | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially in multi-stream workloads. Single-stream seems to be les impacted