Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
3ad1eec
Add RMM User Guide
bdice Oct 11, 2025
d48c2bd
Sort extensions
bdice Oct 13, 2025
c7cb90a
Add myst-parser
bdice Oct 13, 2025
a7abcfa
Fix Notes section
bdice Oct 13, 2025
1a13b9e
Fix warnings in docs builds
bdice Oct 13, 2025
f069318
Move content around
bdice Oct 13, 2025
e75e6ab
Restructure docs
bdice Oct 13, 2025
23eac43
Use eval-rst for Python autodoc.
bdice Oct 13, 2025
0815903
Restructure Python docs
bdice Oct 15, 2025
3fa3e65
Initial rmm.librmm / rmm.pylibrmm docs
bdice Oct 15, 2025
7823303
Overhauling C++ docs
bdice Nov 4, 2025
377f709
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice Nov 4, 2025
32da0ed
Cleanup
bdice Nov 4, 2025
3583068
Fix wrapping
bdice Nov 4, 2025
423b1a4
Ignore latex docs
bdice Nov 4, 2025
06168d2
More overhaul
bdice Nov 4, 2025
6036060
Update docstrings
bdice Nov 4, 2025
e1e8ffa
Improve docstrings
bdice Nov 4, 2025
c1fc01c
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice Nov 10, 2025
bc49e92
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice Nov 11, 2025
eb40dd8
Fix warnings, add changes from 2137
bdice Nov 11, 2025
bc38482
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice Nov 11, 2025
3c36640
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice Nov 13, 2025
c06a09f
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice Nov 20, 2025
b49da3e
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice Nov 26, 2025
f0f6761
Remove "-c nvidia"
bdice Nov 26, 2025
8682ecf
Remove host memory resources
bdice Nov 26, 2025
2d136ea
Update introduction
bdice Nov 26, 2025
6e2812a
Avoid teaching rmm.reinitialize
bdice Nov 26, 2025
81d7229
Improve introduction
bdice Nov 26, 2025
153ded7
Remove stream-ordered allocation from introduction.
bdice Nov 26, 2025
5162c1b
Containers, not data structures
bdice Nov 26, 2025
558a159
Introduction revisions
bdice Nov 26, 2025
90b0652
Update installation page
bdice Nov 26, 2025
e76c9c8
Merge remote-tracking branch 'upstream/main' into docs-overhaul
bdice Dec 2, 2025
9ccbdb6
Temporary consolidation of C++ docs
bdice Dec 2, 2025
3d69641
Add sphinx-tabs
bdice Dec 2, 2025
6bd80e7
Fuse C++ and Python programming guides
bdice Dec 2, 2025
53c03d4
Align code snippets
bdice Dec 2, 2025
8eb4339
Require CUDA 12.2
bdice Dec 2, 2025
c987aa3
Improve code snippets
bdice Dec 2, 2025
59ac950
Improve integration descriptions
bdice Dec 2, 2025
c56554a
Improve imports/includes
bdice Dec 2, 2025
770ee19
Refactoring choosing memory resources
bdice Dec 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ RMM can be installed with conda. You can get a minimal conda installation with [
Install RMM with:

```bash
conda install -c rapidsai -c conda-forge -c nvidia rmm cuda-version=13.0
conda install -c rapidsai -c conda-forge rmm cuda-version=13.0
```

We also provide [nightly conda packages](https://anaconda.org/rapidsai-nightly) built from the HEAD
Expand Down
1 change: 1 addition & 0 deletions conda/environments/all_cuda-129_arch-aarch64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ dependencies:
- sphinx
- sphinx-copybutton
- sphinx-markdown-tables
- sphinx-tabs
- sphinxcontrib-jquery
- sysroot_linux-aarch64==2.28
name: all_cuda-129_arch-aarch64
1 change: 1 addition & 0 deletions conda/environments/all_cuda-129_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ dependencies:
- sphinx
- sphinx-copybutton
- sphinx-markdown-tables
- sphinx-tabs
- sphinxcontrib-jquery
- sysroot_linux-64==2.28
name: all_cuda-129_arch-x86_64
1 change: 1 addition & 0 deletions conda/environments/all_cuda-130_arch-aarch64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ dependencies:
- sphinx
- sphinx-copybutton
- sphinx-markdown-tables
- sphinx-tabs
- sphinxcontrib-jquery
- sysroot_linux-aarch64==2.28
name: all_cuda-130_arch-aarch64
1 change: 1 addition & 0 deletions conda/environments/all_cuda-130_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ dependencies:
- sphinx
- sphinx-copybutton
- sphinx-markdown-tables
- sphinx-tabs
- sphinxcontrib-jquery
- sysroot_linux-64==2.28
name: all_cuda-130_arch-x86_64
1 change: 1 addition & 0 deletions dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,7 @@ dependencies:
- sphinx
- sphinx-copybutton
- sphinx-markdown-tables
- sphinx-tabs
- sphinxcontrib-jquery
py_version:
specific:
Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
"sphinx.ext.intersphinx",
"sphinx_copybutton",
"sphinx_markdown_tables",
"sphinx_tabs.tabs",
"sphinxcontrib.jquery",
]

Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ RMM (RAPIDS Memory Manager) is a library for allocating and managing GPU memory
:maxdepth: 2
:caption: Contents

user_guide/guide
user_guide/index
cpp/index
python/index
```
301 changes: 301 additions & 0 deletions docs/user_guide/choosing_memory_resources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
# Choosing a Memory Resource

One of the most common questions when using RMM is: "Which memory resource should I use?"

This guide provides recommendations for selecting the appropriate memory resource based on your application's needs.

## Recommended Defaults

For most applications, use the CUDA async memory pool.

`````{tabs}
````{code-tab} c++
#include <rmm/mr/cuda_async_memory_resource.hpp>
#include <rmm/mr/per_device_resource.hpp>

rmm::mr::cuda_async_memory_resource mr;
rmm::mr::set_current_device_resource_ref(mr);
````
````{code-tab} python
import rmm

mr = rmm.mr.CudaAsyncMemoryResource()
rmm.mr.set_current_device_resource(mr)
````
`````

For applications exceeding GPU memory limits, use a pooled managed memory resource with prefetching. Note: managed memory is not supported on WSL2 systems.

`````{tabs}
````{code-tab} c++
#include <rmm/mr/managed_memory_resource.hpp>
#include <rmm/mr/pool_memory_resource.hpp>
#include <rmm/mr/prefetch_resource_adaptor.hpp>
#include <rmm/mr/per_device_resource.hpp>
#include <rmm/cuda_device.hpp>

// Use 80% of GPU memory, rounded down to nearest 256 bytes
auto [free_memory, total_memory] = rmm::available_device_memory();
std::size_t pool_size = (static_cast<std::size_t>(total_memory * 0.8) / 256) * 256;

rmm::mr::managed_memory_resource managed_mr;
rmm::mr::pool_memory_resource pool_mr{managed_mr, pool_size};
rmm::mr::prefetch_resource_adaptor prefetch_mr{pool_mr};
rmm::mr::set_current_device_resource_ref(prefetch_mr);
````
````{code-tab} python
import rmm

# Use 80% of GPU memory, rounded down to nearest 256 bytes
free_memory, total_memory = rmm.mr.available_device_memory()
pool_size = int(total_memory * 0.8) // 256 * 256

mr = rmm.mr.PrefetchResourceAdaptor(
rmm.mr.PoolMemoryResource(
rmm.mr.ManagedMemoryResource(),
initial_pool_size=pool_size,
)
)
rmm.mr.set_current_device_resource(mr)
````
`````

## Memory Resource Considerations

It is usually best to use resources that allow the CUDA driver to manage pool suballocation via `cudaMallocFromPoolAsync`.

### CudaAsyncMemoryResource

The `CudaAsyncMemoryResource` uses CUDA's driver-managed memory pool (via `cudaMallocAsync`). This is the **recommended default** for most applications.

**Advantages:**
- **Driver-managed pool**: Uses efficient suballocation with virtual addressing to avoid fragmentation
- **Cross-library sharing**: The pool can be shared across multiple applications and libraries, even those not using RMM directly
- **Stream-ordered semantics**: Allocations and deallocations are stream-ordered by default
- **Performance**: Similar or better performance compared to RMM's pool implementations

**When to use:**
- Default choice for GPU-accelerated applications
- Multi-stream or multi-threaded applications
- Applications using multiple GPU libraries (e.g., cuDF + PyTorch)
- Most production workloads

### CudaMemoryResource

The `CudaMemoryResource` uses `cudaMalloc` directly for each allocation, with no pooling.

**Advantages:**
- Simple and predictable
- No fragmentation concerns
- Memory is immediately returned to the system on deallocation

**Disadvantages:**
- Slower than pooled allocators due to synchronization overhead

**Example:**
```python
import rmm

rmm.mr.set_current_device_resource(rmm.mr.CudaMemoryResource())
```

**When to use:**
- Simple applications with infrequent allocations
- Debugging memory issues
- Testing or benchmarking baseline performance

### PoolMemoryResource

The `PoolMemoryResource` maintains a pool of memory allocated from an upstream resource.

**Advantages:**
- Fast suballocation from pre-allocated pool
- Configurable initial and maximum pool sizes

**Disadvantages:**
- Can suffer from fragmentation (unlike async MR)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially in multi-stream workloads. Single-stream seems to be les impacted

- Pool is not shared across applications
- Requires careful tuning of pool sizes

**Example:**
```python
import rmm

pool = rmm.mr.PoolMemoryResource(
rmm.mr.CudaMemoryResource(), # upstream resource
initial_pool_size=2**30, # 1 GiB
maximum_pool_size=2**32 # 4 GiB
)
rmm.mr.set_current_device_resource(pool)
```

**When to use:**
- Legacy applications (prefer `CudaAsyncMemoryResource` for new code)
- Specific tuning requirements not met by async MR
- Wrapping non-CUDA memory sources

**Important**: If using `PoolMemoryResource`, prefer wrapping `CudaAsyncMemoryResource` as the upstream rather than `CudaMemoryResource`:

```python
# Better: Pool wrapping async MR
pool = rmm.mr.PoolMemoryResource(
rmm.mr.CudaAsyncMemoryResource(),
initial_pool_size=2**30
)
```

This combines the benefits of both: fast suballocation from RMM's pool and the driver's virtual addressing capabilities.

### ManagedMemoryResource

The `ManagedMemoryResource` uses CUDA unified memory (via `cudaMallocManaged`), allowing memory to be accessible from both CPU and GPU.

**Advantages:**
- Enables working with datasets larger than GPU memory
- Automatic page migration between CPU and GPU
- Simplifies memory management for host/device code

**Disadvantages:**
- Performance overhead due to page faults and migration
- Requires careful prefetching for optimal performance

**Example:**
```python
import rmm

rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to include this because people will copy pasta and see the poor performance that usually causes people to reject managed memory.

```

**When to use:**
- Datasets larger than available GPU memory
- Prototyping or applications where performance is not critical
- Always combine with prefetching strategies (see [Managed Memory guide](managed_memory.md))

### ArenaMemoryResource

The `ArenaMemoryResource` divides a large allocation into size-binned arenas, reducing fragmentation.

**Advantages:**
- Better fragmentation characteristics than basic pool
- Good for mixed allocation sizes
- Predictable performance

**Disadvantages:**
- More complex configuration
- May waste memory if bin sizes don't match allocation patterns

**Example:**
```python
import rmm

arena = rmm.mr.ArenaMemoryResource(
rmm.mr.CudaMemoryResource(),
arena_size=2**28 # 256 MiB arenas
)
rmm.mr.set_current_device_resource(arena)
```

**When to use:**
- Applications with diverse allocation sizes
- Long-running services with complex allocation patterns
- When fragmentation is observed with pool allocators

## Composing Memory Resources

Memory resources can be composed (wrapped) to combine their properties. The general pattern is:

```python
# Adaptor wrapping a base resource
adaptor = rmm.mr.SomeAdaptor(base_resource)
```

### Common Compositions

**Prefetching with managed memory:**
```python
import rmm

# Prefetch adaptor wrapping managed memory pool
base = rmm.mr.ManagedMemoryResource()
pool = rmm.mr.PoolMemoryResource(base, initial_pool_size=2**30)
prefetch = rmm.mr.PrefetchResourceAdaptor(pool)
rmm.mr.set_current_device_resource(prefetch)
```

**Statistics tracking:**
```python
import rmm

# Track allocation statistics
base = rmm.mr.CudaAsyncMemoryResource()
stats = rmm.mr.StatisticsResourceAdaptor(base)
rmm.mr.set_current_device_resource(stats)
```

## Multi-Library Applications

When using RMM with multiple GPU libraries (e.g., cuDF, PyTorch, CuPy), `CudaAsyncMemoryResource` is especially important because:

1. The driver-managed pool is shared automatically across all libraries
2. You don't need to configure every library to use RMM
3. Memory is not artificially partitioned between libraries

**Example: RMM + PyTorch**
```python
import rmm
import torch
from rmm.allocators.torch import rmm_torch_allocator

# Use async MR as the base
rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())

# Configure PyTorch to use RMM
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
```

With this setup, both PyTorch and any other RMM-using code (like cuDF) will share the same driver-managed pool.

## Performance Considerations

### Async MR vs. Pool MR

In most cases, `CudaAsyncMemoryResource` provides similar or better performance than `PoolMemoryResource`:

- Both use pooling for fast suballocation
- Async MR uses virtual addressing to avoid fragmentation
- Async MR shares memory across applications

**When Pool MR might be faster:**
- Very specific allocation patterns that align well with pool design
- Custom upstream resources (not CUDA memory)

### Multi-stream Applications

For applications using multiple CUDA streams or threads:

- `CudaAsyncMemoryResource` is **strongly recommended**
- Pool allocators can create "pipeline bubbles" where streams wait for allocations
- The async MR handles stream synchronization efficiently

## Best Practices

1. **Set the memory resource before any allocations**: Once memory is allocated, changing the resource can lead to crashes

```python
import rmm

# Do this first, before any GPU allocations
rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
```

2. **Prefer async MR by default**: Unless you have specific requirements, start with `CudaAsyncMemoryResource`

3. **Use statistics for tuning**: If you need to understand allocation patterns, wrap with `StatisticsResourceAdaptor`

4. **Don't over-engineer**: Start simple, profile, and optimize only if needed

## See Also

- [Pool Allocators](pool_allocators.md) - Detailed guide on pool and arena allocators
- [Managed Memory](managed_memory.md) - Guide to using managed memory and prefetching
- [Stream-Ordered Allocation](stream_ordered_allocation.md) - Understanding stream-ordered semantics
Loading