Cythonize `cuda.core` more #1070

leofang · 2025-10-02T21:06:35Z

Description

Close #1065
Close #1063

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-10-02T21:06:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang · 2025-10-03T05:11:57Z

/ok to test b713c5c

cuda_core/cuda/core/experimental/_stream.pyx

github-actions · 2025-10-03T05:24:19Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1070/
https://nvidia.github.io/cuda-python/pr-preview/pr-1070/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1070/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1070/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

leofang · 2025-10-04T03:44:30Z

/ok to test bf712bf

cuda_core/cuda/core/experimental/_stream.pyx

cuda_core/cuda/core/experimental/_event.pyx

rparolin · 2025-10-06T15:24:16Z

@leofang Please update the PR description to include a description of the solution you are proposing.

Andy-Jost · 2025-10-06T15:43:26Z

cuda_core/cuda/core/experimental/_utils/cuda_utils.pyx

+cdef cydriver.CUdevice get_device_from_ctx(
+        cydriver.CUcontext target_ctx, cydriver.CUcontext curr_ctx) except?cydriver.CU_DEVICE_INVALID nogil:


When I see this level of complexity in Cython, approaching that of C++, I wonder whether it would be simpler to directly write C++ and expose it through something like pybind11. Not sure if I'm the only one who thinks this.

You are not alone in that ;) It's also possible to write C++ and just call it from Cython -- we have a couple small .cpp files in this repo already.

I've also been thinking this. Basically answering this question, where is the point on the trade-off curve of using a new DSL that generates C++ code is it easier to just writing the C++ manually?

Also it's also possible to inline C++ in Cython if needed with:

cdef extern from *: """ C++ code goes here """

I'd love to see an exploration into this. Given well-defined criteria, shave off a small piece of cuda.core, try a few implementation strategies, and evaluate the results.

The ship has sailed. Let's create an issue to track this discussion. The next time we will revisit it is when we start looking into cccl-runtime as the underlying implementation of cuda.core. I am not sure if anyone actually understands the implication of going through C++ now, after all the discussions in recent Friday co-design meetings. There are things that we need and get from cuda-bindings not yet available in native C++.

cuda_core/cuda/core/experimental/_device.pyx

cuda_core/cuda/core/experimental/_memory.pyx

shwina · 2025-10-06T18:50:22Z

cuda_core/cuda/core/experimental/_memory.pyx

        """
-        self._shutdown_safe_close(stream, is_shutting_down=None)
+        if self._ptr and self._mr is not None:
+            if isinstance(self._mr, _cyMemoryResource):


Can we implement __reduce__ on cyMemoryResource and avoid the isinstance check here?

With this PR, we have two kinds of memory resource implementations:

A pure Python one that inherits the MemoryResource ABC:

In this case, we need to ensure the Python deallocate() is invoked

A cuda.core-provided one that inherits both _cyMemoryResource (to get a C layout & methods) and MemoryResource (to meet Python requirements)

In this case, we can call the Cython _dealloc()

It is unclear to me when I worked on this that if isinstance can be replaced by a better approach, but since it is only a pointer comparison, I assume it is OK if we keep it.

If you have __reduce__ implemented in cyMemoryResource, then wouldn't classes like DeviceMemoryResource automatically use that one without the need for this isinstance check?

isinstance checks are notoriously expensive when abcs are involved, but also it seems cleaner for cyMemoryResource to define its own __reduce__.

@shwina I might have missed something. How does __reduce__ help close/__dealloc__?

Sorry, the diff appeared to show that this was within the __reduce__ method. Please replace all instances of __reduce__ in my comments with close.

Ah OK. Then I have a follow-up question: How do we know for certain if we'd call Cython close and not Python close, without an explicit isinstance check? We just inspect the generated code?

isinstance checks are notoriously expensive when abcs are involved,

I've run into this also. Hugely expensive.

To be discussed tomorrow... I raised a thread offline.

cuda_core/cuda/core/experimental/_memory.pyx

rparolin

I left a comment asking about pointer data type changes and pointer value conversions.

cuda_core/cuda/core/experimental/_device.pyx

cuda_core/cuda/core/experimental/_utils/cuda_utils.pyx

kkraus14 · 2025-10-06T18:59:08Z

cuda_core/cuda/core/experimental/_utils/cuda_utils.pyx



-cdef int HANDLE_RETURN(supported_error_type err) except?-1:
+cdef int HANDLE_RETURN(supported_error_type err) except?-1 nogil:


Should we explicitly release the GIL in this function?

kkraus14 · 2025-10-06T19:03:15Z

cuda_core/cuda/core/experimental/_device.pyx

+# TODO: I prefer to type these as "cdef object" and avoid accessing them from within Python,
+# but it seems it is very convenient to expose them for testing purposes...
 _tls = threading.local()
 _lock = threading.Lock()


We could cdef object these and then create private functions to return them for usage in testing, but we could do that in a follow up.

kkraus14 · 2025-10-06T19:25:05Z

cuda_core/cuda/core/experimental/_device.pyx

        name = name.split(b"\0")[0]
        return name.decode()


nitpick: there's probably a way to do this without the GIL?

cuda_core/cuda/core/experimental/_memory.pyx

kkraus14 · 2025-10-06T19:36:00Z

cuda_core/cuda/core/experimental/_memory.pyx

+        # Note: This is Linux only (int for file descriptor)
+        cdef int alloc_handle


Should we consider putting this and the below function call behind a compilation conditional?

Alternatively we could revisit this when we start working on #1028?

cuda_core/cuda/core/experimental/_stream.pyx

kkraus14 · 2025-10-06T19:45:35Z

cuda_core/cuda/core/experimental/_stream.pyx

+    cdef int _get_device_and_context(self) except?-1:
+        cdef cydriver.CUcontext curr_ctx
+        if self._device_id == cydriver.CU_DEVICE_INVALID:
+            # TODO: It is likely faster/safer to call cuCtxGetCurrent?


I don't think this is universally true, where a context can be non-current without being destroyed. Retrieving the device and context from that stream object would be valid.

I did not bother doing a profiling here, just thought that it might be good to reduce a few CUDA calls... 😛

cuda_core/cuda/core/experimental/_memory.pyx

cuda_core/cuda/core/experimental/_device.pyx

cuda_core/cuda/core/experimental/_memory.pyx

rparolin · 2025-10-06T19:57:48Z

cuda_core/cuda/core/experimental/_memory.pyx

+        with nogil:
+            HANDLE_RETURN(cydriver.cuMemAllocFromPoolAsync(&devptr, size, self._mempool_handle, s))
+        cdef Buffer buf = Buffer.__new__(Buffer)
+        buf._ptr = <intptr_t>(devptr)


We are sure we want intptr_t and not uintptr_t?

We definitely want uintptr_t to make sure we can represent all device pointers, even those whose 64-bit is set. Such pointers are conceivably possible on systems with several large memory capacity GPUs.

$ cat c.pyx from libc.stdint cimport uintptr_t, intptr_t def roundtrip_u(ptr : int) -> int: cdef uintptr_t _ptr = <uintptr_t>(int(ptr)) return _ptr def roundtrip_i(ptr : int) -> int: cdef intptr_t _ptr = <intptr_t>(int(ptr)) return _ptr $ cython -+ c.pyx $ g++ c.cpp -shared -fPIC $(python3-config --includes) $(python3-config --libs) -o c$(python3-config --extension-suffix) $ python Python 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import c >>> import cuda.core.experimental as cc >>> cc.Device().set_current() >>> stream = cc.Device().create_stream() >>> buf = cc.DeviceMemoryResource(cc.Device()).allocate(2048, stream) >>> c.roundtrip_u(buf.handle) # pointer here does not have the highest bit set, so both roundtrips work 17213423616 >>> c.roundtrip_i(buf.handle) 17213423616 >>> c.roundtrip_u(2**63 + 5) # contriving and example where it breaks 9223372036854775813 >>> c.roundtrip_i(2**63 + 5) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c.pyx", line 8, in c.roundtrip_i cdef intptr_t _ptr = <intptr_t>(int(ptr)) OverflowError: Python int too large to convert to C ssize_t

What was the rationale for switching from uintptr_t in the first place?

Discussed this during the review meeting, but wouldn't hurt to summarize here:

uintptr_t is used for all CUDA object addresses

intptr_t is used only for Buffer in case we need to compute pointer offsets

@oleksandr-pavlyk good question. I think by casting a Python int to intptr_t your example constitutes a lossy conversion that Cython protected you against. If you were to do the same in C, the C/C++ compiler would catch it too:

/local/home/leof/dev/round_trip_cast.c:2584:29: warning: integer constant is so large that it is unsigned 2584 | __pyx_v__ptr = ((intptr_t)9223372036854775813LL); |

intptr_t only guarantees that can be round-trip converted to/from void* (which is what we need, even including multi-device in mind). It does not offer any guarantees beyond that to the best of my knowledge.

We could do this but it does not make much sense depending on the use case:

# distutils: language = c++ from libc.stdint cimport uintptr_t, intptr_t def roundtrip_u(ptr) -> int: cdef uintptr_t _ptr = <uintptr_t>(int(ptr)) return _ptr def roundtrip_i(ptr) -> int: cdef intptr_t _ptr = <intptr_t>(int(ptr)) return _ptr def roundtrip_i_v2(ptr) -> int: cdef intptr_t _ptr = <intptr_t>(<uintptr_t>(int(ptr))) return _ptr cdef extern from * nogil: """ intptr_t convert_safe(void* ptr) { return reinterpret_cast<intptr_t>(ptr); } """ intptr_t convert_safe(void* ptr) def roundtrip_i_v3(ptr) -> int: cdef intptr_t _ptr = convert_safe(<void*><uintptr_t>(int(ptr))) return _ptr

Output:

>>> import round_trip_cast >>> round_trip_cast.roundtrip_u(2**63 + 5) 9223372036854775813 >>> round_trip_cast.roundtrip_i(2**63 + 5) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "round_trip_cast.pyx", line 11, in round_trip_cast.roundtrip_i cdef intptr_t _ptr = <intptr_t>(int(ptr)) OverflowError: Python int too large to convert to C ssize_t >>> round_trip_cast.roundtrip_i_v2(2**63 + 5) -9223372036854775803 >>> round_trip_cast.roundtrip_i_v3(2**63 + 5) -9223372036854775803 >>>

leofang · 2025-10-06T21:45:29Z

/ok to test a5e6bcf

cuda_core/cuda/core/experimental/_memory.pyx

leofang self-assigned this Oct 2, 2025

leofang added enhancement Any code-related improvements triage Needs the team's attention cuda.core Everything related to the cuda.core module blocked This task is currently blocked by other tasks labels Oct 2, 2025

This comment was marked as resolved.

Sign in to view

leofang added this to the cuda.core beta 7 milestone Oct 2, 2025

leofang removed the triage Needs the team's attention label Oct 2, 2025

This was referenced Oct 3, 2025

cuda-core: Release GIL when calling cimport'd CUDA APIs #1065

Open

Avoid race due to deleted imports when cleaning up objects at interpreter shutdown #973

Merged

leofang added 4 commits October 3, 2025 03:34

release gil

1f84a5a

release gil for record

2c86825

use __dealloc__ in event/stream

efe7fe0

nit: remove print

fa88b28

leofang force-pushed the cythonize_more branch from 7641f18 to fa88b28 Compare October 3, 2025 03:34

leofang added P0 High priority - Must do! and removed blocked This task is currently blocked by other tasks labels Oct 3, 2025

leofang added 3 commits October 3, 2025 03:39

fix linter error

1083d90

reduce further the number of Python objects held by Stream

c06dc9d

replace a few more __del__ by __dealloc__

b713c5c

leofang commented Oct 3, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_stream.pyx Show resolved Hide resolved

leofang added 6 commits October 3, 2025 05:32

improve __cuda_stream__ performance

d764f5b

cythonize Buffer & DMR (WIP - not working!)

3684644

minor fixes - still failing

2c9d5ec

fix casting and avoid repetive assignment

638bf59

ensure we have C access for DeviceMemoryResource

0c64e8e

restore the contracts for now (the deallocation stream should be fixed)

4c4bb17

leofang added 3 commits October 4, 2025 03:29

fully cythonize Stream

c593ac9

make linter happy

fae5e7f

Merge branch 'main' into cythonize_more

bf712bf

leofang requested review from mdboom, kkraus14 and Andy-Jost October 4, 2025 03:44

leofang marked this pull request as ready for review October 4, 2025 03:44

leofang requested a review from oleksandr-pavlyk October 5, 2025 15:26

leofang mentioned this pull request Oct 5, 2025

Ensure allocation stream is used for buffer deallocation if no explicit stream is provided #1032

Draft

2 tasks

cpcloud reviewed Oct 6, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_stream.pyx Outdated Show resolved Hide resolved

cuda_core/cuda/core/experimental/_event.pyx Show resolved Hide resolved

Andy-Jost reviewed Oct 6, 2025

View reviewed changes

oleksandr-pavlyk reviewed Oct 6, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_device.pyx Show resolved Hide resolved

leofang commented Oct 6, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_device.pyx Show resolved Hide resolved

shwina reviewed Oct 6, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Outdated Show resolved Hide resolved

shwina reviewed Oct 6, 2025

View reviewed changes

rparolin reviewed Oct 6, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Show resolved Hide resolved

rparolin requested changes Oct 6, 2025

View reviewed changes

rparolin reviewed Oct 6, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_device.pyx Show resolved Hide resolved

kkraus14 reviewed Oct 6, 2025

View reviewed changes

rparolin reviewed Oct 6, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Outdated Show resolved Hide resolved

oleksandr-pavlyk reviewed Oct 6, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_device.pyx Show resolved Hide resolved

rparolin reviewed Oct 6, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Show resolved Hide resolved

rparolin reviewed Oct 6, 2025

View reviewed changes

rwgk mentioned this pull request Oct 6, 2025

supported_nvidia_libs.py updates: add nvidia-cublasmp-cu12, nvidia-cublasmp-cu13, nvidia-cudss-cu13 #1089

Merged

leofang added 2 commits October 6, 2025 21:31

address review comments

2dd9138

Merge branch 'main' into cythonize_more

a5e6bcf

leofang commented Oct 6, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Show resolved Hide resolved

		cdef cydriver.CUdevice get_device_from_ctx(
		cydriver.CUcontext target_ctx, cydriver.CUcontext curr_ctx) except?cydriver.CU_DEVICE_INVALID nogil:



		cdef int HANDLE_RETURN(supported_error_type err) except?-1:
		cdef int HANDLE_RETURN(supported_error_type err) except?-1 nogil:

		# Note: This is Linux only (int for file descriptor)
		cdef int alloc_handle

Cythonize cuda.core more #1070

Are you sure you want to change the base?

Cythonize cuda.core more #1070

Uh oh!

Conversation

leofang commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Oct 2, 2025

Uh oh!

This comment was marked as resolved.

leofang commented Oct 3, 2025

Uh oh!

Uh oh!

github-actions bot commented Oct 3, 2025

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

leofang commented Oct 4, 2025

Uh oh!

Uh oh!

Uh oh!

rparolin commented Oct 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rparolin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Cythonize `cuda.core` more #1070

Cythonize `cuda.core` more #1070

leofang commented Oct 2, 2025 •

edited

Loading

leofang Oct 6, 2025 •

edited

Loading

Andy-Jost Oct 6, 2025 •

edited

Loading

oleksandr-pavlyk Oct 6, 2025 •

edited

Loading