-
Notifications
You must be signed in to change notification settings - Fork 221
Implement IPC-enabled events. #1145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…x_size` memory resource attribute to size_t from int32. Various updates and additions to test helpers.
|
/ok to test 3d16913 |
This comment has been minimized.
This comment has been minimized.
…CBufferTestHelper` to `PatternGen` and combine `flipped` and `starting_from` arguments to just `seed`. Rename `compare_buffers` to `compare_equal_buffers` and have it return a Boolean.
fedc205 to
8a84c51
Compare
|
/ok to test 5ad48ca |
Andy-Jost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is ready for review.
| self._busy_waited = ipc_descriptor._busy_waited | ||
| self._ipc_enabled = True | ||
| self._ipc_descriptor = ipc_descriptor | ||
| self._device_id = -1 # ?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Events hold a device and context handle. I could not find these being used anywhere except the property getters in this class. For imported events, these are not available. The current implementation returns None for these properties for IPC-imported events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the intended use of the device and context. Do we need setters for these?
| self._device_id = device_id | ||
| self._ctx_handle = ctx_handle | ||
| if opts.ipc_enabled: | ||
| self.get_ipc_descriptor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiprocessing serializes arguments in a separate thread. If get_ipc_descriptor is called for the first time during serialization, then cuIpcGetEventHandle raises an error saying no CUDA context is bound to that thread. The best solution I found is this: if an event is created with IPC support, then create the descriptor right away (in the main thread) and cache it.
|
|
||
| cdef: | ||
| bytes _reserved | ||
| bint _busy_waited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I send the _busy_waited property to the child because there is no driver API to query it. I failed to devise a test that actually checks whether an event blocks or busy-waits in the child process, so I cannot confirm whether the property is accurate.
| except ImportError: | ||
| # Import shared platform helpers for tests across repos | ||
| sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[2] / "cuda_python_test_helpers")) | ||
| sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[3] / "cuda_python_test_helpers")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made helpers.py into a package and merged the IPC utility.py into it.
| log("done") | ||
|
|
||
|
|
||
| def test_event_is_monadic(ipc_device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss this design decision. My view is that it's better for users to potentially trip over these limitations (~5 minute fix) rather than risk creating a race with mutable events (hours or days).
| # Set up the IPC-enabled memory pool and share it. | ||
| device = ipc_device | ||
| mr = ipc_memory_resource | ||
| pgen = PatternGen(device, NBYTES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are lots of changes coming from the rename of IPCBufferTestHelper to PatternGen, but the new class is much more sane and useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it make sense to break the rename out as a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to it. I think there's not much risk of this change causing trouble for being overly complex, because the actual changes to cuda.core are so small. IMO, moving to the next project has more value but I'm not set in stone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the actual changes to cuda.core are so small.
That's very difficult for me to see at the moment. Let's discuss offline. I don't want to generate extra work.
|
|
||
| # This kernel is designed to busy loop until a signal is received | ||
| code = """ | ||
| #include <cuda/atomic> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factored this out to helpers/latch.py
| raise RuntimeError("the pinned memory resource is not bound to any GPU") | ||
|
|
||
|
|
||
| class DummyUnifiedMemoryResource(MemoryResource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to helpers/buffers.py
|
/ok to test 0fe70cf |
|
/ok to test 820bf1b |
| from cuda.core.experimental._stream cimport default_stream | ||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necessary whitespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not. I assume the pre-commit hooks will reformat as needed, but maybe not in this case?
| cdef class Event: | ||
|
|
||
| cdef: | ||
| cydriver.CUevent _handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does cython throw warnings when data members inject padding overhead that could be eliminated by reordering members?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. FYI, CUevent is 32 bytes IIRC
| """Export an event allocated for sharing between processes.""" | ||
| if self._ipc_descriptor is not None: | ||
| return self._ipc_descriptor | ||
| if not self.is_ipc_enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if is_ipc_enabled is false would we want this error to fire even if has some value in the _ipc_descriptor field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be possible to fill the _ipc_descriptor field in that case. It is only ever set non-None below, in this function, after the check.
| def from_ipc_descriptor(cls, ipc_descriptor: IPCEventDescriptor) -> Event: | ||
| """Import an event that was exported from another process.""" | ||
| cdef cydriver.CUipcEventHandle data | ||
| memcpy(data.reserved, <const void*><const char*>(ipc_descriptor._reserved), sizeof(data.reserved)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the explicit case to void* required here? It should be implicitly convertible for pointer types.
|
|
||
| def __init__(self, device): | ||
| if helpers.CUDA_INCLUDE_PATH is None: | ||
| pytest.skip("need CUDA header") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we provide more information to the user if they hit this? What would their next steps be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I moved this out of the existing test_events.py test.
|
/ok to test 6ef6b03 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20 files reviewed, 6 comments
| mr = LegacyPinnedMemoryResource() | ||
| self.buffer = mr.allocate(4) | ||
| self.busy_wait_flag[0] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Property busy_wait_flag (defined on line 62) is accessed before the property decorator is evaluated. This will raise AttributeError at runtime
| mr = LegacyPinnedMemoryResource() | |
| self.buffer = mr.allocate(4) | |
| self.busy_wait_flag[0] = 0 | |
| mr = LegacyPinnedMemoryResource() | |
| self.buffer = mr.allocate(4) | |
| ctypes.cast(int(self.buffer.handle), ctypes.POINTER(ctypes.c_int32))[0] = 0 |
| pgen = PatternGen(device, NBYTES) | ||
| for buffer in buffers: | ||
| IPCBufferTestHelper(device, buffer).verify_buffer(flipped=True) | ||
| pgen.verify_buffer(buffer, seed=True) | ||
| buffer.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: PatternGen instance created outside the worker pool context but used to verify buffers modified by child processes - verify that PatternGen.verify_buffer generates the same pattern across processes when seed=True. Does PatternGen use a deterministic seed that produces identical patterns in parent and child processes?
| def device_id(self) -> int: | ||
| return self.device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: device_id property returns self.device, but the type is unclear – if device is a Device object (line 20-21), this returns the wrong type. Should return self.device.device_id or cast to int. Is self.device an integer device ID or a Device object?
| self._device_id = -1 # ?? | ||
| self._ctx_handle = None # ?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Device ID and context handle left uninitialized for imported events. May cause AttributeError or incorrect behavior if device or context properties accessed. Should imported events query the current device/context, or is it expected that these properties will return None?
| def __eq__(self, IPCEventDescriptor rhs): | ||
| # No need to check self._busy_waited. | ||
| return self._reserved == rhs._reserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Equality check ignores _busy_waited flag. Two descriptors with different sync behavior will compare equal. Is the busy_waited flag intentionally excluded from equality to allow interoperability, or should it be part of the comparison?
| latch.release() | ||
| process.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Latch is released after waiting for the child acknowledgment (line56), but the child's copy (line 83) is enqueued on stream2 which waits on the event from stream1 (line 81). If the latch blocks stream1, the event may never complete and the child will hang. Verify that the event is recorded before the latch blocks or reorder the release to before the child wait. Does stream1.record guarantee that the event timestamp is captured before any subsequent kernel on stream1 starts executing, or does the event wait for the kernel to finish?
|
This reverts commit 20f29e9.
This reverts commit bcd40ff.
Description
Adds construction keywords to
Eventto allow creation of IPC-enabled events. Adds reduce methods to allow events to be sent to other processes. Updates tests.Closes #1040