-
Notifications
You must be signed in to change notification settings - Fork 175
Ensure correct handling of buffers allocated with LegacyPinnedMemoryResource.allocate
as kernel parameters
#717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
0784b39
Add memory ops example
shwina e1d1f9e
Fix handling of buffer with int handle
shwina fb65ab8
pre-commit fixes
shwina 55bd3b1
Simplify pinned memory example
shwina 70f0da6
Copy pinned memory tests to test_launcher.py
shwina 40df59e
Remove dlpack assertions and address other review comments
shwina 0b2e207
Try addressing issues that may be causing CI failures
shwina 2699ff1
Use per device device MR, add numpy requirement to test
shwina 246e8a1
Use SynchronousMemoryResource if memory pools are not supported
shwina 7e3c468
apply nit
leofang c2ff8cc
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
leofang marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. ALL RIGHTS RESERVED. | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
# ################################################################################ | ||
# | ||
# This demo illustrates: | ||
# | ||
# 1. How to use different memory resources to allocate and manage memory | ||
# 2. How to copy data between different memory types | ||
# 3. How to use DLPack to interoperate with other libraries | ||
# | ||
# ################################################################################ | ||
|
||
import sys | ||
|
||
import cupy as cp | ||
import numpy as np | ||
|
||
from cuda.core.experimental import ( | ||
Device, | ||
LaunchConfig, | ||
LegacyPinnedMemoryResource, | ||
Program, | ||
ProgramOptions, | ||
launch, | ||
) | ||
|
||
if np.__version__ < "2.1.0": | ||
print("This example requires NumPy 2.1.0 or later", file=sys.stderr) | ||
sys.exit(0) | ||
|
||
# Kernel for memory operations | ||
code = """ | ||
extern "C" | ||
__global__ void memory_ops(float* device_data, | ||
float* pinned_data, | ||
size_t N) { | ||
const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x; | ||
if (tid < N) { | ||
// Access device memory | ||
device_data[tid] = device_data[tid] + 1.0f; | ||
|
||
// Access pinned memory (zero-copy from GPU) | ||
pinned_data[tid] = pinned_data[tid] * 3.0f; | ||
} | ||
} | ||
""" | ||
|
||
dev = Device() | ||
dev.set_current() | ||
stream = dev.create_stream() | ||
leofang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# tell CuPy to use our stream as the current stream: | ||
cp.cuda.ExternalStream(int(stream.handle)).use() | ||
|
||
# Compile kernel | ||
arch = "".join(f"{i}" for i in dev.compute_capability) | ||
program_options = ProgramOptions(std="c++17", arch=f"sm_{arch}") | ||
prog = Program(code, code_type="c++", options=program_options) | ||
mod = prog.compile("cubin") | ||
kernel = mod.get_kernel("memory_ops") | ||
|
||
# Create different memory resources | ||
device_mr = dev.memory_resource | ||
pinned_mr = LegacyPinnedMemoryResource() | ||
|
||
# Allocate different types of memory | ||
size = 1024 | ||
dtype = cp.float32 | ||
element_size = dtype().itemsize | ||
total_size = size * element_size | ||
|
||
# 1. Device Memory (GPU-only) | ||
device_buffer = device_mr.allocate(total_size, stream=stream) | ||
device_array = cp.from_dlpack(device_buffer).view(dtype=dtype) | ||
|
||
# 2. Pinned Memory (CPU memory, GPU accessible) | ||
pinned_buffer = pinned_mr.allocate(total_size, stream=stream) | ||
pinned_array = np.from_dlpack(pinned_buffer).view(dtype=dtype) | ||
|
||
# Initialize data | ||
rng = cp.random.default_rng() | ||
device_array[:] = rng.random(size, dtype=dtype) | ||
pinned_array[:] = rng.random(size, dtype=dtype).get() | ||
|
||
# Store original values for verification | ||
device_original = device_array.copy() | ||
pinned_original = pinned_array.copy() | ||
|
||
# Sync before kernel launch | ||
stream.sync() | ||
leofang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Launch kernel | ||
block = 256 | ||
grid = (size + block - 1) // block | ||
config = LaunchConfig(grid=grid, block=block) | ||
|
||
launch(stream, config, kernel, device_buffer, pinned_buffer, cp.uint64(size)) | ||
stream.sync() | ||
|
||
# Verify kernel operations | ||
assert cp.allclose(device_array, device_original + 1.0), "Device memory operation failed" | ||
assert cp.allclose(pinned_array, pinned_original * 3.0), "Pinned memory operation failed" | ||
|
||
# Copy data between different memory types | ||
print("\nCopying data between memory types...") | ||
|
||
# Copy from device to pinned memory | ||
device_buffer.copy_to(pinned_buffer, stream=stream) | ||
stream.sync() | ||
|
||
# Verify the copy operation | ||
assert cp.allclose(pinned_array, device_array), "Device to pinned copy failed" | ||
|
||
# Create a new device buffer and copy from pinned | ||
new_device_buffer = device_mr.allocate(total_size, stream=stream) | ||
new_device_array = cp.from_dlpack(new_device_buffer).view(dtype=dtype) | ||
|
||
pinned_buffer.copy_to(new_device_buffer, stream=stream) | ||
stream.sync() | ||
|
||
# Verify the copy operation | ||
assert cp.allclose(new_device_array, pinned_array), "Pinned to device copy failed" | ||
|
||
# Clean up | ||
device_buffer.close(stream) | ||
pinned_buffer.close(stream) | ||
new_device_buffer.close(stream) | ||
stream.close() | ||
leofang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
cp.cuda.Stream.null.use() # reset CuPy's current stream to the null stream | ||
|
||
# Verify buffers are properly closed | ||
assert device_buffer.handle == 0, "Device buffer should be closed" | ||
assert pinned_buffer.handle == 0, "Pinned buffer should be closed" | ||
assert new_device_buffer.handle == 0, "New device buffer should be closed" | ||
|
||
print("Memory management example completed!") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we stomach the cost of an
isinstance
check here?One alternative is to use a
try..except
, where entering thetry
block is cheap, but entering theexcept
block is expensive.Another alternative, which will eliminate the need to make any changes to the kernel arg handling logic here:
HostPtr
which wraps an integer representing a pointer, and exposes agetPtr()
method to get it.Buffer.handle
toDevicePtrT | HostPtr
LegacyPinnedMemoryResource
to return a buffer whose handle is aHostPtr
.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
isinstance
in Cython is cheap and what you have here is good. I don't want to introduce more types than needed, partly because we want MR providers to focus on the MR properties (is_host_accessible
etc), which is nicer for programmatic checks. I actually think thatBuffer.handle
should be ofAny
type so as to not get in the way of the MR providers. From both CUDA and cccl-rt perspectives they should be allvoid*
. We don't want to encode the memory space information as part of the type.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did type it as
Any
, how would_kernel_arg_handler
know how to grab the pointer from underneath theBuffer
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well Python does not care about type annotations, right? 🙂
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern wasn't so much about the type annotation, but more that the kernel handler won't know what to do with a
Buffer
whose.handle
is any arbitrary type.Prior to this PR it could only handle the case when
.handle
is aCUdeviceptr
, or something that has a.getPtr()
method.cuda-python/cuda_core/cuda/core/experimental/_kernel_arg_handler.pyx
Lines 213 to 215 in 24fde17
This PR adds the ability to handle
int
.Technically,
.handle
is also allowed to beNone
:cuda-python/cuda_core/cuda/core/experimental/_memory.py
Line 22 in 24fde17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see, you meant the mini dispatcher here needs to enumerate all possible types.
Let me think about it. What you have is good and a generic treatment can follow later.
Most likely with #564 we could rewrite the dispatcher that looks like this
On the MR provider side, we just need them to implement a protocol
if they are not using generic cuda.bindings or Python types. (FWIW we already have
IsStreamT
.) So maybe eventuallyBuffer.handle
can be typed asUh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a reasonable approach and I agree it would simplify the handling here. A couple of comments:
DevicePointerT
to justPointerT
? In the case of pinned memory for instance, it doesn't actually represent a device pointer AFAIU.Union[IsHandleT, int]
is equivalent to justIsHandleT
(int
type implements__int__
). The protocol would also allow types likefloat
orbool
.__cuda_handle__()
method or something, rather than__int__()