Skip to content

Conversation

leofang
Copy link
Member

@leofang leofang commented Oct 2, 2025

Description

Close #1065
Close #1063

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@leofang leofang self-assigned this Oct 2, 2025
Copy link
Contributor

copy-pr-bot bot commented Oct 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang leofang added enhancement Any code-related improvements triage Needs the team's attention cuda.core Everything related to the cuda.core module blocked This task is currently blocked by other tasks labels Oct 2, 2025
@leofang

This comment was marked as resolved.

@leofang leofang added P0 High priority - Must do! and removed blocked This task is currently blocked by other tasks labels Oct 3, 2025
@leofang
Copy link
Member Author

leofang commented Oct 3, 2025

/ok to test b713c5c

Copy link

github-actions bot commented Oct 3, 2025

@leofang
Copy link
Member Author

leofang commented Oct 4, 2025

/ok to test bf712bf

@rparolin
Copy link
Collaborator

rparolin commented Oct 6, 2025

@leofang Please update the PR description to include a description of the solution you are proposing.

Comment on lines +195 to +196
cdef cydriver.CUdevice get_device_from_ctx(
cydriver.CUcontext target_ctx, cydriver.CUcontext curr_ctx) except?cydriver.CU_DEVICE_INVALID nogil:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I see this level of complexity in Cython, approaching that of C++, I wonder whether it would be simpler to directly write C++ and expose it through something like pybind11. Not sure if I'm the only one who thinks this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not alone in that ;) It's also possible to write C++ and just call it from Cython -- we have a couple small .cpp files in this repo already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also been thinking this. Basically answering this question, where is the point on the trade-off curve of using a new DSL that generates C++ code is it easier to just writing the C++ manually?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it's also possible to inline C++ in Cython if needed with:

cdef extern from *:
    """
    C++ code goes here
    """

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see an exploration into this. Given well-defined criteria, shave off a small piece of cuda.core, try a few implementation strategies, and evaluate the results.

Copy link
Member Author

@leofang leofang Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ship has sailed. Let's create an issue to track this discussion. The next time we will revisit it is when we start looking into cccl-runtime as the underlying implementation of cuda.core. I am not sure if anyone actually understands the implication of going through C++ now, after all the discussions in recent Friday co-design meetings. There are things that we need and get from cuda-bindings not yet available in native C++.

"""
self._shutdown_safe_close(stream, is_shutting_down=None)
if self._ptr and self._mr is not None:
if isinstance(self._mr, _cyMemoryResource):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we implement __reduce__ on cyMemoryResource and avoid the isinstance check here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, we have two kinds of memory resource implementations:

  1. A pure Python one that inherits the MemoryResource ABC:
    • In this case, we need to ensure the Python deallocate() is invoked
  2. A cuda.core-provided one that inherits both _cyMemoryResource (to get a C layout & methods) and MemoryResource (to meet Python requirements)
    • In this case, we can call the Cython _dealloc()

It is unclear to me when I worked on this that if isinstance can be replaced by a better approach, but since it is only a pointer comparison, I assume it is OK if we keep it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have __reduce__ implemented in cyMemoryResource, then wouldn't classes like DeviceMemoryResource automatically use that one without the need for this isinstance check?

isinstance checks are notoriously expensive when abcs are involved, but also it seems cleaner for cyMemoryResource to define its own __reduce__.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina I might have missed something. How does __reduce__ help close/__dealloc__?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the diff appeared to show that this was within the __reduce__ method. Please replace all instances of __reduce__ in my comments with close.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. Then I have a follow-up question: How do we know for certain if we'd call Cython close and not Python close, without an explicit isinstance check? We just inspect the generated code?

Copy link
Contributor

@Andy-Jost Andy-Jost Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance checks are notoriously expensive when abcs are involved,

I've run into this also. Hugely expensive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be discussed tomorrow... I raised a thread offline.

Copy link
Collaborator

@rparolin rparolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment asking about pointer data type changes and pointer value conversions.



cdef int HANDLE_RETURN(supported_error_type err) except?-1:
cdef int HANDLE_RETURN(supported_error_type err) except?-1 nogil:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly release the GIL in this function?

Comment on lines +30 to 33
# TODO: I prefer to type these as "cdef object" and avoid accessing them from within Python,
# but it seems it is very convenient to expose them for testing purposes...
_tls = threading.local()
_lock = threading.Lock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could cdef object these and then create private functions to return them for usage in testing, but we could do that in a follow up.

Comment on lines 1114 to 1115
name = name.split(b"\0")[0]
return name.decode()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: there's probably a way to do this without the GIL?

Comment on lines +859 to +860
# Note: This is Linux only (int for file descriptor)
cdef int alloc_handle
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider putting this and the below function call behind a compilation conditional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could revisit this when we start working on #1028?

cdef int _get_device_and_context(self) except?-1:
cdef cydriver.CUcontext curr_ctx
if self._device_id == cydriver.CU_DEVICE_INVALID:
# TODO: It is likely faster/safer to call cuCtxGetCurrent?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is universally true, where a context can be non-current without being destroyed. Retrieving the device and context from that stream object would be valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not bother doing a profiling here, just thought that it might be good to reduce a few CUDA calls... 😛

with nogil:
HANDLE_RETURN(cydriver.cuMemAllocFromPoolAsync(&devptr, size, self._mempool_handle, s))
cdef Buffer buf = Buffer.__new__(Buffer)
buf._ptr = <intptr_t>(devptr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are sure we want intptr_t and not uintptr_t?

Copy link
Contributor

@oleksandr-pavlyk oleksandr-pavlyk Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want uintptr_t to make sure we can represent all device pointers, even those whose 64-bit is set. Such pointers are conceivably possible on systems with several large memory capacity GPUs.

$ cat c.pyx
from libc.stdint cimport uintptr_t, intptr_t

def roundtrip_u(ptr : int) -> int:
    cdef uintptr_t _ptr = <uintptr_t>(int(ptr))
    return _ptr

def roundtrip_i(ptr : int) -> int:
    cdef intptr_t _ptr = <intptr_t>(int(ptr))
    return _ptr

$ cython -+ c.pyx 
$ g++ c.cpp -shared -fPIC $(python3-config --includes) $(python3-config --libs) -o c$(python3-config --extension-suffix)

$ python
Python 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import c
>>> import cuda.core.experimental as cc
>>> cc.Device().set_current()
>>> stream = cc.Device().create_stream()
>>> buf = cc.DeviceMemoryResource(cc.Device()).allocate(2048, stream)

>>> c.roundtrip_u(buf.handle) # pointer here does not have the highest bit set, so both roundtrips work
17213423616
>>> c.roundtrip_i(buf.handle)
17213423616

>>> c.roundtrip_u(2**63 + 5)  # contriving and example where it breaks
9223372036854775813
>>> c.roundtrip_i(2**63 + 5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c.pyx", line 8, in c.roundtrip_i
    cdef intptr_t _ptr = <intptr_t>(int(ptr))
OverflowError: Python int too large to convert to C ssize_t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the rationale for switching from uintptr_t in the first place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this during the review meeting, but wouldn't hurt to summarize here:

  • uintptr_t is used for all CUDA object addresses
  • intptr_t is used only for Buffer in case we need to compute pointer offsets

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandr-pavlyk good question. I think by casting a Python int to intptr_t your example constitutes a lossy conversion that Cython protected you against. If you were to do the same in C, the C/C++ compiler would catch it too:

/local/home/leof/dev/round_trip_cast.c:2584:29: warning: integer constant is so large that it is unsigned
 2584 |   __pyx_v__ptr = ((intptr_t)9223372036854775813LL);
      |  

intptr_t only guarantees that can be round-trip converted to/from void* (which is what we need, even including multi-device in mind). It does not offer any guarantees beyond that to the best of my knowledge.

We could do this but it does not make much sense depending on the use case:

# distutils: language = c++

from libc.stdint cimport uintptr_t, intptr_t

def roundtrip_u(ptr) -> int:
    cdef uintptr_t _ptr = <uintptr_t>(int(ptr))
    return _ptr


def roundtrip_i(ptr) -> int:
    cdef intptr_t _ptr = <intptr_t>(int(ptr))
    return _ptr


def roundtrip_i_v2(ptr) -> int:
    cdef intptr_t _ptr = <intptr_t>(<uintptr_t>(int(ptr)))
    return _ptr


cdef extern from * nogil:
    """
    intptr_t convert_safe(void* ptr) {
        return reinterpret_cast<intptr_t>(ptr);
    }
    """
    intptr_t convert_safe(void* ptr)


def roundtrip_i_v3(ptr) -> int:
    cdef intptr_t _ptr = convert_safe(<void*><uintptr_t>(int(ptr)))
    return _ptr

Output:

>>> import round_trip_cast
>>> round_trip_cast.roundtrip_u(2**63 + 5)
9223372036854775813
>>> round_trip_cast.roundtrip_i(2**63 + 5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "round_trip_cast.pyx", line 11, in round_trip_cast.roundtrip_i
    cdef intptr_t _ptr = <intptr_t>(int(ptr))
OverflowError: Python int too large to convert to C ssize_t
>>> round_trip_cast.roundtrip_i_v2(2**63 + 5)
-9223372036854775803
>>> round_trip_cast.roundtrip_i_v3(2**63 + 5)
-9223372036854775803
>>> 

@leofang
Copy link
Member Author

leofang commented Oct 6, 2025

/ok to test a5e6bcf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cuda-core: Release GIL when calling cimport'd CUDA APIs [FEA]: Update the memory module to use __dealloc__.
8 participants