cuda.core: Cythonize GraphBuilder and Graph with handle-layer cleanup#2008
cuda.core: Cythonize GraphBuilder and Graph with handle-layer cleanup#2008Andy-Jost wants to merge 8 commits into
Conversation
…hine Refactor GraphBuilder from a Python class using _MembersNeededForFinalize to a cdef class with explicit _BuilderKind (PRIMARY/FORKED/CONDITIONAL_BODY) and _CaptureState (NOT_STARTED/CAPTURING/ENDED) tracking. Cleanup moves into __dealloc__/close, and the builder now uses GraphHandle/StreamHandle from _resource_handles instead of holding raw driver objects. Drop the is_stream_owner flag now that StreamHandle owns the lifetime. End-capture paths in __dealloc__ and close guard on _h_stream so cleanup is safe even if _init* fails before completing assignment. Made-with: Cursor
Add a GraphExecHandle to the resource-handle layer (parallel to GraphHandle) wrapping CUgraphExec with RAII cleanup via cuGraphExecDestroy on shared_ptr release. Convert Graph from a Python class using _MembersNeededForFinalize to a cdef class holding a typed _h_graph_exec attribute, dropping the weakref.finalize machinery. update/upload/launch move to nogil cydriver paths consistent with the GraphBuilder rewrite. Also drop quoted forward-reference annotations on create_graph_builder and _instantiate_graph/complete now that GraphBuilder is cimported in _device.pyx and _stream.pyx and Cython accepts the in-module forward reference to Graph. Clears the related "Strings should no longer be used for type declarations" warnings. Made-with: Cursor
The cdef-class member declarations live in the .pxd, so the .pyx does not need to re-cimport GraphExecHandle, GraphHandle, or StreamHandle. Made-with: Cursor
… cycle cimport-ing GraphBuilder at the top of _stream.pyx and _device.pyx made Cython emit a Python-level import of cuda.core.graph._graph_builder during _stream module init. That triggered the chain graph -> _graph_node -> _kernel_arg_handler -> _memory._buffer -> _device, which then re-entered the still-initializing _stream module via "from cuda.core._stream import IsStreamT", failing with ImportError: cannot import name IsStreamT. Restore the original lazy "import GraphBuilder" inside create_graph_builder (Stream and Device) and Stream_accept. The return annotations stay as bare names; "from __future__ import annotations" in both files defers their evaluation, so they need not resolve at function-definition time. Made-with: Cursor
|
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
The previous import-cycle fix changed _stream/_device.create_graph_builder to a lazy Python "import GraphBuilder" instead of a module-level cimport. With _init declared as @staticmethod cdef, Python attribute lookup cannot find it, so every test that builds a graph failed with "AttributeError: type object 'GraphBuilder' has no attribute '_init'" at _device.pyx:1376 / _stream.pyx:376. Convert _init from @staticmethod cdef to @staticmethod def (matches the Stream._init pattern) and drop the cdef declaration from the .pxd. _init runs once per builder creation, so the loss of cdef-level dispatch is irrelevant. Graph._init stays cdef; it is only called intra-module. Made-with: Cursor
|
/ok to test |
Every graph-builder test failed with CUDA_ERROR_INVALID_VALUE on the new ``GraphBuilder.begin_building`` path. The driver rejects ``cuStreamGetCaptureInfo`` when ``captureStatus_out`` is NULL, but the new ``_get_capture_info`` helper accepted a NULL status pointer and ``begin_building`` was calling it that way (it just wanted the freshly captured graph handle and assumed the status was implied by the preceding ``cuStreamBeginCapture``). Pass a stack-local ``CUstreamCaptureStatus`` and document the helper's requirement that ``status`` be non-NULL. ``graph`` is still allowed to be NULL (``is_building`` calls it that way and the driver accepts it). Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
Completes step 3 of NVIDIA#1330 by exposing the captured graph as an explicit `GraphDefinition` view that shares ownership of the underlying `CUgraph`. The handle-layer plumbing landed in PR NVIDIA#2008; this commit wires up the user-facing surface and locks in the state-guard rules. State semantics: - PRIMARY builder: only valid after `end_building()`. Before `begin_building()` no graph exists; during capture the driver is the sole writer, so explicit access is unsafe. - CONDITIONAL_BODY builder: valid both before `begin_building()` (the body graph is allocated at conditional-node creation time) and after `end_building()`. This enables a hybrid flow where a conditional body is populated entirely via the explicit API, with no capture at all. - FORKED builder: never valid. Forked builders share the primary's graph; access through the primary instead. Tests cover the happy path, both hybrid flows on conditional bodies (populate-via-explicit-API and capture-then-augment), the three error states (forked, capturing, primary pre-capture), and the shared-ownership guarantee (the `GraphDefinition` survives the builder's `close()`). Co-authored-by: Cursor <cursoragent@cursor.com>
f0b2e78 to
3c6c387
Compare
Completes step 3 of NVIDIA#1330 by exposing the captured graph as an explicit `GraphDefinition` view that shares ownership of the underlying `CUgraph`. The handle-layer plumbing landed in PR NVIDIA#2008; this commit wires up the user-facing surface and locks in the state-guard rules. State semantics: - PRIMARY builder: only valid after `end_building()`. Before `begin_building()` no graph exists; during capture the driver is the sole writer, so explicit access is unsafe. - CONDITIONAL_BODY builder: valid both before `begin_building()` (the body graph is allocated at conditional-node creation time) and after `end_building()`. This enables a hybrid flow where a conditional body is populated entirely via the explicit API, with no capture at all. - FORKED builder: never valid. Forked builders share the primary's graph; access through the primary instead. Tests cover the happy path, both hybrid flows on conditional bodies (populate-via-explicit-API and capture-then-augment), the three error states (forked, capturing, primary pre-capture), and the shared-ownership guarantee (the `GraphDefinition` survives the builder's `close()`). Co-authored-by: Cursor <cursoragent@cursor.com>
| if self._h_stream and self._state == CAPTURING and self._kind != FORKED: | ||
| with nogil: | ||
| cydriver.cuStreamEndCapture(as_cu(self._h_stream), NULL) |
There was a problem hiding this comment.
- Shouldn't we set
_stateand_kindto a guarding value to prevent the destructor from being invoked again? - If a builder owns a stream, shouldn't we free the stream too (which is the old behavior)?
There was a problem hiding this comment.
- Is this non-public behavior that
cuStreamEndCaptureaccepts a nullptr forCUgraph*? I don't think it's documented.
| def close(self): | ||
| """Destroy the graph builder.""" | ||
| if self._h_stream and self._state == CAPTURING and self._kind != FORKED: | ||
| with nogil: | ||
| HANDLE_RETURN(cydriver.cuStreamEndCapture(as_cu(self._h_stream), NULL)) | ||
| self._h_graph.reset() | ||
| self._h_stream.reset() | ||
| self._state = CAPTURE_ENDED | ||
| self._stream = None |
There was a problem hiding this comment.
Let's share the same destructor for both close and __dealloc__ like what we do elsewhere in the codebase, instead of repeating.
leofang
left a comment
There was a problem hiding this comment.
A few more observations beyond the inline thread already in flight — see the new inline comments. The two I'd actually want answers to before merge are (a) the FORKED-parent / conditional-body handle dependency on L762, and (b) the partial-failure window in begin_building on L322–328. The rest are minor.
-- Leo's bot
| graph = Graph._init(handle_return(driver.cuGraphInstantiateWithParams(h_graph, params))) | ||
| py_exec = handle_return(driver.cuGraphInstantiateWithParams(h_graph, params)) | ||
| c_exec = <cydriver.CUgraphExec><intptr_t>int(py_exec) | ||
| graph = Graph._init(c_exec) |
There was a problem hiding this comment.
Pre-existing pattern (the original wrapped before checking too), but worth fixing while we're here: if params.result_out indicates an instantiate error, c_exec may be NULL or otherwise invalid — yet we've already wrapped it in Graph._init, whose new RAII deleter will call cuGraphExecDestroy on it during the exception unwind that follows. Suggest checking result_out first and only constructing the Graph on SUCCESS.
|
|
||
| def __init__(self): | ||
| raise NotImplementedError( | ||
| "directly creating a Graph object can be ambiguous. Please either " |
There was a problem hiding this comment.
Nit: this is GraphBuilder.__init__, so the message should say "directly creating a GraphBuilder object" — currently mirrors the wording from the old class, but it's confusing now that there's a distinct Graph class right below (line 830) with its own message saying "directly constructing a Graph instance".
| HANDLE_RETURN(cydriver.cuStreamEndCapture(as_cu(self._h_stream), NULL)) | ||
| self._h_graph.reset() | ||
| self._h_stream.reset() | ||
| self._state = CAPTURE_ENDED |
There was a problem hiding this comment.
close() unconditionally lands every kind of builder in _state = CAPTURE_ENDED with _h_graph reset — including FORKED, which never legitimately reaches "ended" via the normal end_building path. After forked.close() (or join(), which calls close() on the non-root builders), a subsequent some_graph.update(forked_builder) then passes the _state != CAPTURE_ENDED check in Graph.update (line ~870) and dereferences a null _h_graph into cuGraphExecUpdate. Two options:
- Introduce a distinct
CLOSED(orINVALIDATED) state value soGraph.update/complete/debug_dot_printcan reject closed builders explicitly. - Or guard in
Graph.update'sGraphBuilderbranch onsource._h_graphbeing non-null with a clearer error than the raw CUDA failure.
Related to leofang's guarding-state question on L250.
| with nogil: | ||
| HANDLE_RETURN(cydriver.cuStreamBeginCapture(c_stream, c_mode)) | ||
| # The driver rejects NULL captureStatus_out, so we pass a | ||
| # stack-local even though begin_capture just succeeded and we | ||
| # only care about the resulting graph handle. | ||
| _get_capture_info(c_stream, &c_status, &c_graph) | ||
| self._h_graph = create_graph_handle(c_graph) |
There was a problem hiding this comment.
Partial-failure window: if cuStreamBeginCapture succeeds but the immediately-following _get_capture_info raises (or create_graph_handle throws under memory pressure), the stream is now mid-capture from the driver's POV but self._state is still CAPTURE_NOT_STARTED. __dealloc__'s guard _state == CAPTURING and _kind != FORKED then won't end the orphaned capture, and the stream stays poisoned for the rest of its lifetime.
This is new in this PR (the original begin_building was a single driver call, so no intermediate failure window existed). Two cheap fixes:
- Set
self._state = CAPTURINGbefore the_get_capture_infocall, so__dealloc__will clean up. - Or wrap the whole block in a
try:that callscuStreamEndCaptureon failure.
|
|
||
| cdef inline GraphBuilder _init_conditional(Stream stream, cydriver.CUgraph cond_graph, GraphBuilder parent): | ||
| cdef GraphBuilder gb = GraphBuilder.__new__(GraphBuilder) | ||
| gb._h_graph = create_graph_handle_ref(cond_graph, parent._h_graph) |
There was a problem hiding this comment.
This is correct for the common case where parent is a PRIMARY builder whose _h_graph actually owns the top-level CUgraph. But _init_forked (line 750) explicitly leaves _h_graph empty for FORKED builders ("captures to primary graph"), so if a user calls forked_builder.if_then(...) (or switch/while_loop), _cond_with_params passes the FORKED's _h_graph (null) here as the parent reference.
The result is a CONDITIONAL_BODY builder whose _h_graph is a create_graph_handle_ref to the body sub-graph with no parent pinning — the actual owning handle (the PRIMARY's _h_graph) is never grabbed. So while the physical conditional node is wired into the primary's CUgraph, the handle dependency chain is broken.
Original code didn't track parents either, so this is a pre-existing semantic gap rather than a regression. But the new structure makes it look like the dependency is being tracked, which is more misleading than no tracking. Either (a) walk up to the real owning handle (somehow — there's currently no link from FORKED back to its PRIMARY), or (b) add a comment here documenting that this ref can be null when the parent is FORKED.
| namespace { | ||
| struct GraphExecBox { | ||
| CUgraphExec resource; | ||
| }; | ||
| } // namespace |
There was a problem hiding this comment.
Design question, not a blocker: this is the only one of the graph-family boxes that holds no extra state — GraphBox tracks h_parent, GraphNodeBox tracks h_graph, but GraphExecBox has just the raw CUgraphExec. That's semantically correct today: after cuGraphInstantiateWithParams, the exec is independent of its source CUgraph, and cuGraphExecUpdate takes the new source at call time, so nothing needs to be held.
But the PR description calls this "groundwork for step 3 of #1330 (graph updates)". If that work ends up wanting to remember which GraphBuilder / CUgraph an exec was last updated from (for diagnostics, error messages, or to extend the source's lifetime through the exec), this Box is the natural place to grow a GraphHandle h_source_graph;. Worth either a one-line // TODO: noting that, or a confirmation that no such reference is planned and removing the asymmetry from any reviewer's mental model.
|
During the offline review with Leo we found a gap in the test coverage, see below. No need to address them in this PR, but we should track it after the PR is merged. The existing suite at
-- Leo's bot |
Summary
Convert
GraphBuilderandGraphfrom Python classes (using_MembersNeededForFinalize+weakref.finalize) to Cythoncdef classobjects backed by typed C++ resource handles.This does two things. First, it lays groundwork for step 3 of #1330 (graph updates) by giving graph objects the same handle-based ownership pattern as the rest of
cuda.core. Second, it clarifiesGraphBuilder's state machine: what used to be a tangle of implicit flags and conditional cleanup paths is now two orthogonal enums —_BuilderKind(PRIMARY/FORKED/CONDITIONAL_BODY) describing how the builder was created, and_CaptureState(CAPTURE_NOT_STARTED/CAPTURING/CAPTURE_ENDED) tracking the capture lifecycle. Methods can now check exactly the state they care about, illegal transitions are detectable, and__dealloc__has a single, well-defined condition for ending capture.Changes
GraphExecHandleto the resource-handle layer (_cpp/resource_handles.{hpp,cpp},_resource_handles.{pxd,pyx}), wrappingCUgraphExecwith acuGraphExecDestroy-based deleter run underGILReleaseGuard.GraphBuilderbecomes acdef classwith the explicit_BuilderKind/_CaptureStateenums described above. Live-API methods (begin_building,end_building,embed,split,join, etc.) move tonogilcydriverpaths where practical, and end-of-capture in__dealloc__runs against the cachedStreamHandlerather than reaching into a possibly-clearedStreamattribute.Graphbecomes acdef classholdingGraphExecHandle _h_graph_execdirectly;update/upload/launchmove tonogilcydriver.weakref.finalizeis gone.Device.create_graph_builderandStream.create_graph_buildercimportGraphBuilderand call its_initfactory; quoted forward-reference annotations are removed (clears Cython "Strings should no longer be used for type declarations" warnings).Related work