Skip to content

cuda.core: Cythonize GraphBuilder and Graph with handle-layer cleanup#2008

Open
Andy-Jost wants to merge 8 commits into
NVIDIA:mainfrom
Andy-Jost:graph-builder-refactor
Open

cuda.core: Cythonize GraphBuilder and Graph with handle-layer cleanup#2008
Andy-Jost wants to merge 8 commits into
NVIDIA:mainfrom
Andy-Jost:graph-builder-refactor

Conversation

@Andy-Jost

@Andy-Jost Andy-Jost commented May 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Convert GraphBuilder and Graph from Python classes (using _MembersNeededForFinalize + weakref.finalize) to Cython cdef class objects backed by typed C++ resource handles.

This does two things. First, it lays groundwork for step 3 of #1330 (graph updates) by giving graph objects the same handle-based ownership pattern as the rest of cuda.core. Second, it clarifies GraphBuilder's state machine: what used to be a tangle of implicit flags and conditional cleanup paths is now two orthogonal enums — _BuilderKind (PRIMARY/FORKED/CONDITIONAL_BODY) describing how the builder was created, and _CaptureState (CAPTURE_NOT_STARTED/CAPTURING/CAPTURE_ENDED) tracking the capture lifecycle. Methods can now check exactly the state they care about, illegal transitions are detectable, and __dealloc__ has a single, well-defined condition for ending capture.

Changes

  • Add GraphExecHandle to the resource-handle layer (_cpp/resource_handles.{hpp,cpp}, _resource_handles.{pxd,pyx}), wrapping CUgraphExec with a cuGraphExecDestroy-based deleter run under GILReleaseGuard.
  • GraphBuilder becomes a cdef class with the explicit _BuilderKind/_CaptureState enums described above. Live-API methods (begin_building, end_building, embed, split, join, etc.) move to nogil cydriver paths where practical, and end-of-capture in __dealloc__ runs against the cached StreamHandle rather than reaching into a possibly-cleared Stream attribute.
  • Graph becomes a cdef class holding GraphExecHandle _h_graph_exec directly; update/upload/launch move to nogil cydriver. weakref.finalize is gone.
  • Device.create_graph_builder and Stream.create_graph_builder cimport GraphBuilder and call its _init factory; quoted forward-reference annotations are removed (clears Cython "Strings should no longer be used for type declarations" warnings).

Related work

Andy-Jost added 3 commits May 1, 2026 14:52
…hine

Refactor GraphBuilder from a Python class using _MembersNeededForFinalize
to a cdef class with explicit _BuilderKind (PRIMARY/FORKED/CONDITIONAL_BODY)
and _CaptureState (NOT_STARTED/CAPTURING/ENDED) tracking. Cleanup moves
into __dealloc__/close, and the builder now uses GraphHandle/StreamHandle
from _resource_handles instead of holding raw driver objects. Drop the
is_stream_owner flag now that StreamHandle owns the lifetime.

End-capture paths in __dealloc__ and close guard on _h_stream so cleanup
is safe even if _init* fails before completing assignment.

Made-with: Cursor
Add a GraphExecHandle to the resource-handle layer (parallel to
GraphHandle) wrapping CUgraphExec with RAII cleanup via
cuGraphExecDestroy on shared_ptr release. Convert Graph from a Python
class using _MembersNeededForFinalize to a cdef class holding a typed
_h_graph_exec attribute, dropping the weakref.finalize machinery.
update/upload/launch move to nogil cydriver paths consistent with the
GraphBuilder rewrite.

Also drop quoted forward-reference annotations on create_graph_builder
and _instantiate_graph/complete now that GraphBuilder is cimported in
_device.pyx and _stream.pyx and Cython accepts the in-module forward
reference to Graph. Clears the related "Strings should no longer be
used for type declarations" warnings.

Made-with: Cursor
The cdef-class member declarations live in the .pxd, so the .pyx does
not need to re-cimport GraphExecHandle, GraphHandle, or StreamHandle.

Made-with: Cursor
@Andy-Jost Andy-Jost added this to the cuda.core v1.0.0 milestone May 1, 2026
@Andy-Jost Andy-Jost added enhancement Any code-related improvements P1 Medium priority - Should do cuda.core Everything related to the cuda.core module labels May 1, 2026
@Andy-Jost Andy-Jost self-assigned this May 1, 2026
… cycle

cimport-ing GraphBuilder at the top of _stream.pyx and _device.pyx made
Cython emit a Python-level import of cuda.core.graph._graph_builder
during _stream module init. That triggered the chain
graph -> _graph_node -> _kernel_arg_handler -> _memory._buffer
-> _device, which then re-entered the still-initializing _stream module
via "from cuda.core._stream import IsStreamT", failing with
ImportError: cannot import name IsStreamT.

Restore the original lazy "import GraphBuilder" inside
create_graph_builder (Stream and Device) and Stream_accept. The return
annotations stay as bare names; "from __future__ import annotations" in
both files defers their evaluation, so they need not resolve at
function-definition time.

Made-with: Cursor
@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown

@Andy-Jost Andy-Jost marked this pull request as draft May 4, 2026 17:12
@copy-pr-bot

copy-pr-bot Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@Andy-Jost

Copy link
Copy Markdown
Contributor Author

/ok to test

The previous import-cycle fix changed _stream/_device.create_graph_builder
to a lazy Python "import GraphBuilder" instead of a module-level cimport.
With _init declared as @staticmethod cdef, Python attribute lookup
cannot find it, so every test that builds a graph failed with
"AttributeError: type object 'GraphBuilder' has no attribute '_init'"
at _device.pyx:1376 / _stream.pyx:376.

Convert _init from @staticmethod cdef to @staticmethod def (matches the
Stream._init pattern) and drop the cdef declaration from the .pxd.
_init runs once per builder creation, so the loss of cdef-level
dispatch is irrelevant. Graph._init stays cdef; it is only called
intra-module.

Made-with: Cursor
@Andy-Jost

Copy link
Copy Markdown
Contributor Author

/ok to test

Andy-Jost and others added 2 commits May 5, 2026 10:35
Every graph-builder test failed with CUDA_ERROR_INVALID_VALUE on the
new ``GraphBuilder.begin_building`` path. The driver rejects
``cuStreamGetCaptureInfo`` when ``captureStatus_out`` is NULL, but the
new ``_get_capture_info`` helper accepted a NULL status pointer and
``begin_building`` was calling it that way (it just wanted the freshly
captured graph handle and assumed the status was implied by the
preceding ``cuStreamBeginCapture``).

Pass a stack-local ``CUstreamCaptureStatus`` and document the helper's
requirement that ``status`` be non-NULL. ``graph`` is still allowed to
be NULL (``is_building`` calls it that way and the driver accepts it).

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost

Copy link
Copy Markdown
Contributor Author

/ok to test

@Andy-Jost Andy-Jost requested a review from leofang May 5, 2026 17:55
@Andy-Jost Andy-Jost marked this pull request as ready for review May 5, 2026 17:55
Andy-Jost added a commit to Andy-Jost/cuda-python that referenced this pull request May 5, 2026
Completes step 3 of NVIDIA#1330 by exposing the captured graph as an explicit
`GraphDefinition` view that shares ownership of the underlying `CUgraph`.
The handle-layer plumbing landed in PR NVIDIA#2008; this commit wires up the
user-facing surface and locks in the state-guard rules.

State semantics:

- PRIMARY builder: only valid after `end_building()`. Before
  `begin_building()` no graph exists; during capture the driver is the
  sole writer, so explicit access is unsafe.
- CONDITIONAL_BODY builder: valid both before `begin_building()` (the
  body graph is allocated at conditional-node creation time) and after
  `end_building()`. This enables a hybrid flow where a conditional body
  is populated entirely via the explicit API, with no capture at all.
- FORKED builder: never valid. Forked builders share the primary's
  graph; access through the primary instead.

Tests cover the happy path, both hybrid flows on conditional bodies
(populate-via-explicit-API and capture-then-augment), the three error
states (forked, capturing, primary pre-capture), and the
shared-ownership guarantee (the `GraphDefinition` survives the
builder's `close()`).

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost force-pushed the graph-builder-refactor branch from f0b2e78 to 3c6c387 Compare May 5, 2026 19:15
Andy-Jost added a commit to Andy-Jost/cuda-python that referenced this pull request May 5, 2026
Completes step 3 of NVIDIA#1330 by exposing the captured graph as an explicit
`GraphDefinition` view that shares ownership of the underlying `CUgraph`.
The handle-layer plumbing landed in PR NVIDIA#2008; this commit wires up the
user-facing surface and locks in the state-guard rules.

State semantics:

- PRIMARY builder: only valid after `end_building()`. Before
  `begin_building()` no graph exists; during capture the driver is the
  sole writer, so explicit access is unsafe.
- CONDITIONAL_BODY builder: valid both before `begin_building()` (the
  body graph is allocated at conditional-node creation time) and after
  `end_building()`. This enables a hybrid flow where a conditional body
  is populated entirely via the explicit API, with no capture at all.
- FORKED builder: never valid. Forked builders share the primary's
  graph; access through the primary instead.

Tests cover the happy path, both hybrid flows on conditional bodies
(populate-via-explicit-API and capture-then-augment), the three error
states (forked, capturing, primary pre-capture), and the
shared-ownership guarantee (the `GraphDefinition` survives the
builder's `close()`).

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost requested a review from rwgk May 28, 2026 16:47
Comment on lines +248 to +250
if self._h_stream and self._state == CAPTURING and self._kind != FORKED:
with nogil:
cydriver.cuStreamEndCapture(as_cu(self._h_stream), NULL)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Shouldn't we set _state and _kind to a guarding value to prevent the destructor from being invoked again?
  2. If a builder owns a stream, shouldn't we free the stream too (which is the old behavior)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is this non-public behavior that cuStreamEndCapture accepts a nullptr for CUgraph*? I don't think it's documented.

Comment on lines +262 to +270
def close(self):
"""Destroy the graph builder."""
if self._h_stream and self._state == CAPTURING and self._kind != FORKED:
with nogil:
HANDLE_RETURN(cydriver.cuStreamEndCapture(as_cu(self._h_stream), NULL))
self._h_graph.reset()
self._h_stream.reset()
self._state = CAPTURE_ENDED
self._stream = None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's share the same destructor for both close and __dealloc__ like what we do elsewhere in the codebase, instead of repeating.

@leofang leofang left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more observations beyond the inline thread already in flight — see the new inline comments. The two I'd actually want answers to before merge are (a) the FORKED-parent / conditional-body handle dependency on L762, and (b) the partial-failure window in begin_building on L322–328. The rest are minor.

-- Leo's bot

graph = Graph._init(handle_return(driver.cuGraphInstantiateWithParams(h_graph, params)))
py_exec = handle_return(driver.cuGraphInstantiateWithParams(h_graph, params))
c_exec = <cydriver.CUgraphExec><intptr_t>int(py_exec)
graph = Graph._init(c_exec)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing pattern (the original wrapped before checking too), but worth fixing while we're here: if params.result_out indicates an instantiate error, c_exec may be NULL or otherwise invalid — yet we've already wrapped it in Graph._init, whose new RAII deleter will call cuGraphExecDestroy on it during the exception unwind that follows. Suggest checking result_out first and only constructing the Graph on SUCCESS.


def __init__(self):
raise NotImplementedError(
"directly creating a Graph object can be ambiguous. Please either "

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is GraphBuilder.__init__, so the message should say "directly creating a GraphBuilder object" — currently mirrors the wording from the old class, but it's confusing now that there's a distinct Graph class right below (line 830) with its own message saying "directly constructing a Graph instance".

HANDLE_RETURN(cydriver.cuStreamEndCapture(as_cu(self._h_stream), NULL))
self._h_graph.reset()
self._h_stream.reset()
self._state = CAPTURE_ENDED

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close() unconditionally lands every kind of builder in _state = CAPTURE_ENDED with _h_graph reset — including FORKED, which never legitimately reaches "ended" via the normal end_building path. After forked.close() (or join(), which calls close() on the non-root builders), a subsequent some_graph.update(forked_builder) then passes the _state != CAPTURE_ENDED check in Graph.update (line ~870) and dereferences a null _h_graph into cuGraphExecUpdate. Two options:

  1. Introduce a distinct CLOSED (or INVALIDATED) state value so Graph.update / complete / debug_dot_print can reject closed builders explicitly.
  2. Or guard in Graph.update's GraphBuilder branch on source._h_graph being non-null with a clearer error than the raw CUDA failure.

Related to leofang's guarding-state question on L250.

Comment on lines +322 to +328
with nogil:
HANDLE_RETURN(cydriver.cuStreamBeginCapture(c_stream, c_mode))
# The driver rejects NULL captureStatus_out, so we pass a
# stack-local even though begin_capture just succeeded and we
# only care about the resulting graph handle.
_get_capture_info(c_stream, &c_status, &c_graph)
self._h_graph = create_graph_handle(c_graph)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial-failure window: if cuStreamBeginCapture succeeds but the immediately-following _get_capture_info raises (or create_graph_handle throws under memory pressure), the stream is now mid-capture from the driver's POV but self._state is still CAPTURE_NOT_STARTED. __dealloc__'s guard _state == CAPTURING and _kind != FORKED then won't end the orphaned capture, and the stream stays poisoned for the rest of its lifetime.

This is new in this PR (the original begin_building was a single driver call, so no intermediate failure window existed). Two cheap fixes:

  • Set self._state = CAPTURING before the _get_capture_info call, so __dealloc__ will clean up.
  • Or wrap the whole block in a try: that calls cuStreamEndCapture on failure.


cdef inline GraphBuilder _init_conditional(Stream stream, cydriver.CUgraph cond_graph, GraphBuilder parent):
cdef GraphBuilder gb = GraphBuilder.__new__(GraphBuilder)
gb._h_graph = create_graph_handle_ref(cond_graph, parent._h_graph)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct for the common case where parent is a PRIMARY builder whose _h_graph actually owns the top-level CUgraph. But _init_forked (line 750) explicitly leaves _h_graph empty for FORKED builders ("captures to primary graph"), so if a user calls forked_builder.if_then(...) (or switch/while_loop), _cond_with_params passes the FORKED's _h_graph (null) here as the parent reference.

The result is a CONDITIONAL_BODY builder whose _h_graph is a create_graph_handle_ref to the body sub-graph with no parent pinning — the actual owning handle (the PRIMARY's _h_graph) is never grabbed. So while the physical conditional node is wired into the primary's CUgraph, the handle dependency chain is broken.

Original code didn't track parents either, so this is a pre-existing semantic gap rather than a regression. But the new structure makes it look like the dependency is being tracked, which is more misleading than no tracking. Either (a) walk up to the real owning handle (somehow — there's currently no link from FORKED back to its PRIMARY), or (b) add a comment here documenting that this ref can be null when the parent is FORKED.

Comment on lines +960 to +964
namespace {
struct GraphExecBox {
CUgraphExec resource;
};
} // namespace

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design question, not a blocker: this is the only one of the graph-family boxes that holds no extra state — GraphBox tracks h_parent, GraphNodeBox tracks h_graph, but GraphExecBox has just the raw CUgraphExec. That's semantically correct today: after cuGraphInstantiateWithParams, the exec is independent of its source CUgraph, and cuGraphExecUpdate takes the new source at call time, so nothing needs to be held.

But the PR description calls this "groundwork for step 3 of #1330 (graph updates)". If that work ends up wanting to remember which GraphBuilder / CUgraph an exec was last updated from (for diagnostics, error messages, or to extend the source's lifetime through the exec), this Box is the natural place to grow a GraphHandle h_source_graph;. Worth either a one-line // TODO: noting that, or a confirmation that no such reference is planned and removing the asymmetry from any reviewer's mental model.

@leofang

leofang commented Jun 8, 2026

Copy link
Copy Markdown
Member

During the offline review with Leo we found a gap in the test coverage, see below. No need to address them in this PR, but we should track it after the PR is merged.

The existing suite at cuda_core/tests/graph/ covers two of the illegal state transitions (test_graph_repeat_capture, test_graph_capture_errors). The following scenarios — most introduced or behaviorally changed by this PR — are not exercised today:

  1. begin_building() called twice in a row. Pre-PR this fell through to a cryptic CUDA error from cuStreamBeginCapture; post-PR the new _state == CAPTURING guard converts it to a clear RuntimeError("Graph builder is already building."). This is the headline case for the "illegal transitions are detectable" claim in the PR description and is currently unverified.

  2. complete() on a FORKED builder after it has been close()d or join()ed. close() unconditionally sets _state = CAPTURE_ENDED and resets _h_graph, so complete() passes the state guard and dereferences a null graph handle (see inline comment on L269).

  3. if_then / switch / while_loop invoked on a FORKED builder. _init_conditional stores create_graph_handle_ref(cond_graph, parent._h_graph), but parent._h_graph is null for FORKED, so the resulting CONDITIONAL_BODY's keep-alive chain to the actual owning primary handle is broken (see inline comment on L762).

  4. GC of a builder mid-capture (no close() / end_building() called). This is the path that exercises cuStreamEndCapture(stream, NULL) (the L250 concern about the undocumented NULL phGraph behavior) and the partial-failure window in begin_building between cuStreamBeginCapture and _get_capture_info (inline comment on L328). __dealloc__ swallows all driver return codes, so a regression here would be silent.

  5. Graph.update(source) after source.close(). Same root cause as Docs #2_state is CAPTURE_ENDED post-close, the type-check passes, and cuGraphExecUpdate is called with a null _h_graph.

  6. Nested conditionals (if_then inside a CONDITIONAL_BODY body). This is the only path that exercises the multi-level create_graph_handle_ref chain introduced by this PR (nested-body → outer-body → primary). Expected to work, but untested.

  7. embed(non_GraphBuilder). The new def embed(self, GraphBuilder child) Cython signature raises TypeError immediately on a non-GraphBuilder argument; pre-PR this AttributeErrored on the first attribute access. Minor error-class change worth pinning down.

-- Leo's bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants