Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Oct 16, 2025

Description

Adds construction keywords to Event to allow creation of IPC-enabled events. Adds reduce methods to allow events to be sent to other processes. Updates tests.

Closes #1040

…x_size` memory resource attribute to size_t from int32. Various updates and additions to test helpers.
@Andy-Jost Andy-Jost self-assigned this Oct 16, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Oct 16, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost Andy-Jost marked this pull request as draft October 16, 2025 20:07
@Andy-Jost
Copy link
Contributor Author

/ok to test 3d16913

@github-actions

This comment has been minimized.

…CBufferTestHelper` to `PatternGen` and combine `flipped` and `starting_from` arguments to just `seed`. Rename `compare_buffers` to `compare_equal_buffers` and have it return a Boolean.
@leofang leofang added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Oct 17, 2025
@leofang leofang self-requested a review October 17, 2025 04:04
@leofang leofang linked an issue Oct 17, 2025 that may be closed by this pull request
@Andy-Jost Andy-Jost force-pushed the ipc_events branch 2 times, most recently from fedc205 to 8a84c51 Compare October 17, 2025 17:32
@Andy-Jost Andy-Jost requested review from cpcloud and rwgk October 17, 2025 17:36
@Andy-Jost Andy-Jost marked this pull request as ready for review October 17, 2025 17:41
@Andy-Jost
Copy link
Contributor Author

/ok to test 5ad48ca

Copy link
Contributor Author

@Andy-Jost Andy-Jost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is ready for review.

self._busy_waited = ipc_descriptor._busy_waited
self._ipc_enabled = True
self._ipc_descriptor = ipc_descriptor
self._device_id = -1 # ??
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Events hold a device and context handle. I could not find these being used anywhere except the property getters in this class. For imported events, these are not available. The current implementation returns None for these properties for IPC-imported events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the intended use of the device and context. Do we need setters for these?

self._device_id = device_id
self._ctx_handle = ctx_handle
if opts.ipc_enabled:
self.get_ipc_descriptor()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiprocessing serializes arguments in a separate thread. If get_ipc_descriptor is called for the first time during serialization, then cuIpcGetEventHandle raises an error saying no CUDA context is bound to that thread. The best solution I found is this: if an event is created with IPC support, then create the descriptor right away (in the main thread) and cache it.


cdef:
bytes _reserved
bint _busy_waited
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I send the _busy_waited property to the child because there is no driver API to query it. I failed to devise a test that actually checks whether an event blocks or busy-waits in the child process, so I cannot confirm whether the property is accurate.

except ImportError:
# Import shared platform helpers for tests across repos
sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[2] / "cuda_python_test_helpers"))
sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[3] / "cuda_python_test_helpers"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made helpers.py into a package and merged the IPC utility.py into it.

log("done")


def test_event_is_monadic(ipc_device):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss this design decision. My view is that it's better for users to potentially trip over these limitations (~5 minute fix) rather than risk creating a race with mutable events (hours or days).

# Set up the IPC-enabled memory pool and share it.
device = ipc_device
mr = ipc_memory_resource
pgen = PatternGen(device, NBYTES)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are lots of changes coming from the rename of IPCBufferTestHelper to PatternGen, but the new class is much more sane and useful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it make sense to break the rename out as a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to it. I think there's not much risk of this change causing trouble for being overly complex, because the actual changes to cuda.core are so small. IMO, moving to the next project has more value but I'm not set in stone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the actual changes to cuda.core are so small.

That's very difficult for me to see at the moment. Let's discuss offline. I don't want to generate extra work.


# This kernel is designed to busy loop until a signal is received
code = """
#include <cuda/atomic>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factored this out to helpers/latch.py

raise RuntimeError("the pinned memory resource is not bound to any GPU")


class DummyUnifiedMemoryResource(MemoryResource):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to helpers/buffers.py

@Andy-Jost
Copy link
Contributor Author

/ok to test 0fe70cf

@Andy-Jost
Copy link
Contributor Author

/ok to test 820bf1b

from cuda.core.experimental._stream cimport default_stream



Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary whitespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. I assume the pre-commit hooks will reformat as needed, but maybe not in this case?

cdef class Event:

cdef:
cydriver.CUevent _handle
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does cython throw warnings when data members inject padding overhead that could be eliminated by reordering members?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. FYI, CUevent is 32 bytes IIRC

"""Export an event allocated for sharing between processes."""
if self._ipc_descriptor is not None:
return self._ipc_descriptor
if not self.is_ipc_enabled:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if is_ipc_enabled is false would we want this error to fire even if has some value in the _ipc_descriptor field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be possible to fill the _ipc_descriptor field in that case. It is only ever set non-None below, in this function, after the check.

def from_ipc_descriptor(cls, ipc_descriptor: IPCEventDescriptor) -> Event:
"""Import an event that was exported from another process."""
cdef cydriver.CUipcEventHandle data
memcpy(data.reserved, <const void*><const char*>(ipc_descriptor._reserved), sizeof(data.reserved))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the explicit case to void* required here? It should be implicitly convertible for pointer types.


def __init__(self, device):
if helpers.CUDA_INCLUDE_PATH is None:
pytest.skip("need CUDA header")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide more information to the user if they hit this? What would their next steps be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I moved this out of the existing test_events.py test.

@Andy-Jost Andy-Jost enabled auto-merge (squash) October 21, 2025 17:56
@Andy-Jost
Copy link
Contributor Author

/ok to test 6ef6b03

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +48 to +50
mr = LegacyPinnedMemoryResource()
self.buffer = mr.allocate(4)
self.busy_wait_flag[0] = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Property busy_wait_flag (defined on line 62) is accessed before the property decorator is evaluated. This will raise AttributeError at runtime

Suggested change
mr = LegacyPinnedMemoryResource()
self.buffer = mr.allocate(4)
self.busy_wait_flag[0] = 0
mr = LegacyPinnedMemoryResource()
self.buffer = mr.allocate(4)
ctypes.cast(int(self.buffer.handle), ctypes.POINTER(ctypes.c_int32))[0] = 0

Comment on lines +39 to 42
pgen = PatternGen(device, NBYTES)
for buffer in buffers:
IPCBufferTestHelper(device, buffer).verify_buffer(flipped=True)
pgen.verify_buffer(buffer, seed=True)
buffer.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: PatternGen instance created outside the worker pool context but used to verify buffers modified by child processes - verify that PatternGen.verify_buffer generates the same pattern across processes when seed=True. Does PatternGen use a deterministic seed that produces identical patterns in parent and child processes?

Comment on lines +39 to +40
def device_id(self) -> int:
return self.device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: device_id property returns self.device, but the type is unclear – if device is a Device object (line 20-21), this returns the wrong type. Should return self.device.device_id or cast to int. Is self.device an integer device ID or a Device object?

Comment on lines +193 to +194
self._device_id = -1 # ??
self._ctx_handle = None # ??
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Device ID and context handle left uninitialized for imported events. May cause AttributeError or incorrect behavior if device or context properties accessed. Should imported events query the current device/context, or is it expected that these properties will return None?

Comment on lines +286 to +288
def __eq__(self, IPCEventDescriptor rhs):
# No need to check self._busy_waited.
return self._reserved == rhs._reserved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Equality check ignores _busy_waited flag. Two descriptors with different sync behavior will compare equal. Is the busy_waited flag intentionally excluded from equality to allow interoperability, or should it be part of the comparison?

Comment on lines +60 to +61
latch.release()
process.join()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Latch is released after waiting for the child acknowledgment (line56), but the child's copy (line 83) is enqueued on stream2 which waits on the event from stream1 (line 81). If the latch blocks stream1, the event may never complete and the child will hang. Verify that the event is recorded before the latch blocks or reorder the release to before the child wait. Does stream1.record guarantee that the event timestamp is captured before any subsequent kernel on stream1 starts executing, or does the event wait for the kernel to finish?

@Andy-Jost Andy-Jost merged commit 20f29e9 into NVIDIA:main Oct 22, 2025
66 of 71 checks passed
@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

Andy-Jost added a commit to Andy-Jost/cuda-python that referenced this pull request Oct 22, 2025
Andy-Jost added a commit that referenced this pull request Oct 22, 2025
Andy-Jost added a commit to Andy-Jost/cuda-python that referenced this pull request Oct 22, 2025
Andy-Jost added a commit that referenced this pull request Oct 23, 2025
* Restore "Implement IPC-enabled events. (#1145)"

This reverts commit bcd40ff.

* Add timeout to LatchKernel.

* Fix test_latchkernel crash on Windows.

* Add timeout to child process join in tests.

* Minor test fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable IPC events and update tests to test stream-ordered IPC

4 participants