Skip to content

Ensure correct handling of buffers allocated with LegacyPinnedMemoryResource.allocate as kernel parameters #717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 26, 2025

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Jun 18, 2025

Description

Buffers allocated using the LegacyPinnedMemoryResource have , but the kernel param handler doesn't know how to handle these buffers. This PR fixes that.

Closes #715

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link
Contributor

copy-pr-bot bot commented Jun 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@@ -212,7 +212,13 @@ cdef class ParamHolder:
for i, arg in enumerate(kernel_args):
if isinstance(arg, Buffer):
# we need the address of where the actual buffer address is stored
self.data_addresses[i] = <void*><intptr_t>(arg.handle.getPtr())
if isinstance(arg.handle, int):
Copy link
Contributor Author

@shwina shwina Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we stomach the cost of an isinstance check here?

  • One alternative is to use a try..except, where entering the try block is cheap, but entering the except block is expensive.

  • Another alternative, which will eliminate the need to make any changes to the kernel arg handling logic here:

    • introduce a new type HostPtr which wraps an integer representing a pointer, and exposes a getPtr() method to get it.
    • Expand the return type of Buffer.handle to DevicePtrT | HostPtr
    • Change LegacyPinnedMemoryResource to return a buffer whose handle is a HostPtr.

Copy link
Member

@leofang leofang Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think isinstance in Cython is cheap and what you have here is good. I don't want to introduce more types than needed, partly because we want MR providers to focus on the MR properties (is_host_accessible etc), which is nicer for programmatic checks. I actually think that Buffer.handle should be of Any type so as to not get in the way of the MR providers. From both CUDA and cccl-rt perspectives they should be all void*. We don't want to encode the memory space information as part of the type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think that Buffer.handle should be of Any type so as to not get in the way of the MR providers.

If we did type it as Any, how would _kernel_arg_handler know how to grab the pointer from underneath the Buffer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well Python does not care about type annotations, right? 🙂

Copy link
Contributor Author

@shwina shwina Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern wasn't so much about the type annotation, but more that the kernel handler won't know what to do with a Buffer whose .handle is any arbitrary type.

Prior to this PR it could only handle the case when .handle is a CUdeviceptr, or something that has a .getPtr() method.

if isinstance(arg, Buffer):
# we need the address of where the actual buffer address is stored
self.data_addresses[i] = <void*><intptr_t>(arg.handle.getPtr())

This PR adds the ability to handle int.

Technically, .handle is also allowed to be None:

DevicePointerT = Union[driver.CUdeviceptr, int, None]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see, you meant the mini dispatcher here needs to enumerate all possible types.

Let me think about it. What you have is good and a generic treatment can follow later.

Most likely with #564 we could rewrite the dispatcher that looks like this

if isinstance(arg, Buffer):
    prepare_arg[intptr_t](self.data, self.data_addresses, get_cuda_native_handle(arg.handle), i)

On the MR provider side, we just need them to implement a protocol

class IsHandleT(Protocol):
    def __int__(self) -> int: ...

if they are not using generic cuda.bindings or Python types. (FWIW we already have IsStreamT.) So maybe eventually Buffer.handle can be typed as

DevicePointerT = Optional[Union[IsHandleT, int]] 

Copy link
Contributor Author

@shwina shwina Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a reasonable approach and I agree it would simplify the handling here. A couple of comments:

  • Perhaps we should rename DevicePointerT to just PointerT? In the case of pinned memory for instance, it doesn't actually represent a device pointer AFAIU.
  • If we use the protocol as written, then Union[IsHandleT, int] is equivalent to just IsHandleT (int type implements __int__). The protocol would also allow types like float or bool.
    • I feel like this discussion has been had before, but it might be worth considering a protocol with a __cuda_handle__() method or something, rather than __int__()

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching the bug, Ashwin! I left some comments.

It seems to be a miss (due to the way we implicitly test LegacyPinnedMemoryResource through NumPy today) that we do not test passing Buffer from various MRs to launch(). Could you add some tests to test_launcher.py? In a comment below, I suggest we could move part of what you have there.

@@ -212,7 +212,13 @@ cdef class ParamHolder:
for i, arg in enumerate(kernel_args):
if isinstance(arg, Buffer):
# we need the address of where the actual buffer address is stored
self.data_addresses[i] = <void*><intptr_t>(arg.handle.getPtr())
if isinstance(arg.handle, int):
Copy link
Member

@leofang leofang Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think isinstance in Cython is cheap and what you have here is good. I don't want to introduce more types than needed, partly because we want MR providers to focus on the MR properties (is_host_accessible etc), which is nicer for programmatic checks. I actually think that Buffer.handle should be of Any type so as to not get in the way of the MR providers. From both CUDA and cccl-rt perspectives they should be all void*. We don't want to encode the memory space information as part of the type.

@github-project-automation github-project-automation bot moved this from Todo to In Progress in CCCL Jun 18, 2025
@leofang leofang added this to the cuda.core beta 5 milestone Jun 18, 2025
@leofang leofang added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Jun 18, 2025
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ashwin! The added test LGTM. Left a few comments mainly for the code sample correctness.

@leofang
Copy link
Member

leofang commented Jun 24, 2025

/ok to test 40df59e

leofang
leofang previously approved these changes Jun 24, 2025
@github-project-automation github-project-automation bot moved this from In Progress to In Review in CCCL Jun 24, 2025
@leofang leofang enabled auto-merge (squash) June 24, 2025 13:32

This comment has been minimized.

@leofang
Copy link
Member

leofang commented Jun 24, 2025

I think the memory_ops.py sample have 1+ bugs that cause later tests to segfault (pytest runs the sample tests first).

We should early-return in the sample if NumPy is <2.1.0, because np.from_dlpack had a bug. In one of the samples I did this:

if cuda_path is None:
print("this demo requires a valid CUDA_PATH environment variable set", file=sys.stderr)
sys.exit(0)

@shwina
Copy link
Contributor Author

shwina commented Jun 24, 2025

I'm also seeing this failure mode in addition to the segfault:

>           assert cp.allclose(array, original * 3.0), f"{memory_resource_class.__name__} operation failed"
E           AssertionError: DeviceMemoryResource operation failed
E           assert array(False)
E            +  where array(False) = <function allclose at 0xf5ce6f977740>(array([0.30819342, 0.78738075, 0.71034026, ..., 0.3719493 , 0.7937206 ,\n       0.90285766], shape=(1024,), dtype=float32), (array([0.30819342, 0.78738075, 0.71034026, ..., 0.3719493 , 0.7937206 ,\n       0.90285766], shape=(1024,), dtype=float32) * 3.0))
E            +    where <function allclose at 0xf5ce6f977740> = <module 'cupy' from '/opt/hostedtoolcache/Python/3.13.5/arm64/lib/python3.13/site-packages/cupy/__init__.py'>.allclose

tests/test_launcher.py:296: AssertionError

@shwina
Copy link
Contributor Author

shwina commented Jun 24, 2025

/ok to test 0b2e207

@leofang
Copy link
Member

leofang commented Jun 24, 2025

/ok to test 0b2e207

@shwina
Copy link
Contributor Author

shwina commented Jun 24, 2025

OK, now I'm seeing:

>           array[:] = rng.random(size, dtype=dtype)
E           ValueError: assignment destination is read-only

This one I can reproduce with older numpy.

@shwina
Copy link
Contributor Author

shwina commented Jun 25, 2025

/ok to test 2699ff1

leofang
leofang previously approved these changes Jun 25, 2025
@leofang
Copy link
Member

leofang commented Jun 26, 2025

/ok to test 7e3c468

leofang
leofang previously approved these changes Jun 26, 2025
@leofang
Copy link
Member

leofang commented Jun 26, 2025

pre-commit.ci autofix

@leofang
Copy link
Member

leofang commented Jun 26, 2025

/ok to test c2ff8cc

@leofang leofang merged commit fd8e07b into NVIDIA:main Jun 26, 2025
53 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Jun 26, 2025
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG]: Cannot pass pinned memory buffer to kernels
2 participants