-
Notifications
You must be signed in to change notification settings - Fork 212
Cythonize cuda.core
more
#1070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Cythonize cuda.core
more
#1070
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
7641f18
to
fa88b28
Compare
/ok to test b713c5c |
|
/ok to test bf712bf |
@leofang Please update the PR description to include a description of the solution you are proposing. |
cdef cydriver.CUdevice get_device_from_ctx( | ||
cydriver.CUcontext target_ctx, cydriver.CUcontext curr_ctx) except?cydriver.CU_DEVICE_INVALID nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I see this level of complexity in Cython, approaching that of C++, I wonder whether it would be simpler to directly write C++ and expose it through something like pybind11. Not sure if I'm the only one who thinks this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not alone in that ;) It's also possible to write C++ and just call it from Cython -- we have a couple small .cpp
files in this repo already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also been thinking this. Basically answering this question, where is the point on the trade-off curve of using a new DSL that generates C++ code is it easier to just writing the C++ manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it's also possible to inline C++ in Cython if needed with:
cdef extern from *:
"""
C++ code goes here
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to see an exploration into this. Given well-defined criteria, shave off a small piece of cuda.core, try a few implementation strategies, and evaluate the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ship has sailed. Let's create an issue to track this discussion. The next time we will revisit it is when we start looking into cccl-runtime as the underlying implementation of cuda.core. I am not sure if anyone actually understands the implication of going through C++ now, after all the discussions in recent Friday co-design meetings. There are things that we need and get from cuda-bindings not yet available in native C++.
""" | ||
self._shutdown_safe_close(stream, is_shutting_down=None) | ||
if self._ptr and self._mr is not None: | ||
if isinstance(self._mr, _cyMemoryResource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we implement __reduce__
on cyMemoryResource
and avoid the isinstance
check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this PR, we have two kinds of memory resource implementations:
- A pure Python one that inherits the
MemoryResource
ABC:- In this case, we need to ensure the Python
deallocate()
is invoked
- In this case, we need to ensure the Python
- A
cuda.core
-provided one that inherits both_cyMemoryResource
(to get a C layout & methods) andMemoryResource
(to meet Python requirements)- In this case, we can call the Cython
_dealloc()
- In this case, we can call the Cython
It is unclear to me when I worked on this that if isinstance
can be replaced by a better approach, but since it is only a pointer comparison, I assume it is OK if we keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have __reduce__
implemented in cyMemoryResource
, then wouldn't classes like DeviceMemoryResource
automatically use that one without the need for this isinstance
check?
isinstance
checks are notoriously expensive when abc
s are involved, but also it seems cleaner for cyMemoryResource
to define its own __reduce__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shwina I might have missed something. How does __reduce__
help close
/__dealloc__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the diff appeared to show that this was within the __reduce__
method. Please replace all instances of __reduce__
in my comments with close
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK. Then I have a follow-up question: How do we know for certain if we'd call Cython close
and not Python
close, without an explicit isinstance
check? We just inspect the generated code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isinstance
checks are notoriously expensive whenabc
s are involved,
I've run into this also. Hugely expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be discussed tomorrow... I raised a thread offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment asking about pointer data type changes and pointer value conversions.
|
||
|
||
cdef int HANDLE_RETURN(supported_error_type err) except?-1: | ||
cdef int HANDLE_RETURN(supported_error_type err) except?-1 nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explicitly release the GIL in this function?
# TODO: I prefer to type these as "cdef object" and avoid accessing them from within Python, | ||
# but it seems it is very convenient to expose them for testing purposes... | ||
_tls = threading.local() | ||
_lock = threading.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could cdef object
these and then create private functions to return them for usage in testing, but we could do that in a follow up.
name = name.split(b"\0")[0] | ||
return name.decode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: there's probably a way to do this without the GIL?
# Note: This is Linux only (int for file descriptor) | ||
cdef int alloc_handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider putting this and the below function call behind a compilation conditional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we could revisit this when we start working on #1028?
cdef int _get_device_and_context(self) except?-1: | ||
cdef cydriver.CUcontext curr_ctx | ||
if self._device_id == cydriver.CU_DEVICE_INVALID: | ||
# TODO: It is likely faster/safer to call cuCtxGetCurrent? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is universally true, where a context can be non-current without being destroyed. Retrieving the device and context from that stream object would be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not bother doing a profiling here, just thought that it might be good to reduce a few CUDA calls... 😛
with nogil: | ||
HANDLE_RETURN(cydriver.cuMemAllocFromPoolAsync(&devptr, size, self._mempool_handle, s)) | ||
cdef Buffer buf = Buffer.__new__(Buffer) | ||
buf._ptr = <intptr_t>(devptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are sure we want intptr_t and not uintptr_t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want uintptr_t
to make sure we can represent all device pointers, even those whose 64-bit is set. Such pointers are conceivably possible on systems with several large memory capacity GPUs.
$ cat c.pyx
from libc.stdint cimport uintptr_t, intptr_t
def roundtrip_u(ptr : int) -> int:
cdef uintptr_t _ptr = <uintptr_t>(int(ptr))
return _ptr
def roundtrip_i(ptr : int) -> int:
cdef intptr_t _ptr = <intptr_t>(int(ptr))
return _ptr
$ cython -+ c.pyx
$ g++ c.cpp -shared -fPIC $(python3-config --includes) $(python3-config --libs) -o c$(python3-config --extension-suffix)
$ python
Python 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import c
>>> import cuda.core.experimental as cc
>>> cc.Device().set_current()
>>> stream = cc.Device().create_stream()
>>> buf = cc.DeviceMemoryResource(cc.Device()).allocate(2048, stream)
>>> c.roundtrip_u(buf.handle) # pointer here does not have the highest bit set, so both roundtrips work
17213423616
>>> c.roundtrip_i(buf.handle)
17213423616
>>> c.roundtrip_u(2**63 + 5) # contriving and example where it breaks
9223372036854775813
>>> c.roundtrip_i(2**63 + 5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c.pyx", line 8, in c.roundtrip_i
cdef intptr_t _ptr = <intptr_t>(int(ptr))
OverflowError: Python int too large to convert to C ssize_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the rationale for switching from uintptr_t
in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this during the review meeting, but wouldn't hurt to summarize here:
uintptr_t
is used for all CUDA object addressesintptr_t
is used only forBuffer
in case we need to compute pointer offsets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandr-pavlyk good question. I think by casting a Python int
to intptr_t
your example constitutes a lossy conversion that Cython protected you against. If you were to do the same in C, the C/C++ compiler would catch it too:
/local/home/leof/dev/round_trip_cast.c:2584:29: warning: integer constant is so large that it is unsigned
2584 | __pyx_v__ptr = ((intptr_t)9223372036854775813LL);
|
intptr_t
only guarantees that can be round-trip converted to/from void*
(which is what we need, even including multi-device in mind). It does not offer any guarantees beyond that to the best of my knowledge.
We could do this but it does not make much sense depending on the use case:
# distutils: language = c++
from libc.stdint cimport uintptr_t, intptr_t
def roundtrip_u(ptr) -> int:
cdef uintptr_t _ptr = <uintptr_t>(int(ptr))
return _ptr
def roundtrip_i(ptr) -> int:
cdef intptr_t _ptr = <intptr_t>(int(ptr))
return _ptr
def roundtrip_i_v2(ptr) -> int:
cdef intptr_t _ptr = <intptr_t>(<uintptr_t>(int(ptr)))
return _ptr
cdef extern from * nogil:
"""
intptr_t convert_safe(void* ptr) {
return reinterpret_cast<intptr_t>(ptr);
}
"""
intptr_t convert_safe(void* ptr)
def roundtrip_i_v3(ptr) -> int:
cdef intptr_t _ptr = convert_safe(<void*><uintptr_t>(int(ptr)))
return _ptr
Output:
>>> import round_trip_cast
>>> round_trip_cast.roundtrip_u(2**63 + 5)
9223372036854775813
>>> round_trip_cast.roundtrip_i(2**63 + 5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "round_trip_cast.pyx", line 11, in round_trip_cast.roundtrip_i
cdef intptr_t _ptr = <intptr_t>(int(ptr))
OverflowError: Python int too large to convert to C ssize_t
>>> round_trip_cast.roundtrip_i_v2(2**63 + 5)
-9223372036854775803
>>> round_trip_cast.roundtrip_i_v3(2**63 + 5)
-9223372036854775803
>>>
/ok to test a5e6bcf |
Description
Close #1065
Close #1063
Checklist