`_decide_nvjitlink_or_driver()` bug fix and enhancements #1055

rwgk · 2025-09-30T23:55:11Z

RuntimeError needs to be caught to resolve #951

While we're at it:

The rest of the changes are enhancements for clarity (specific error messages) and safety (ModuleNotFoundError instead of the less specific ImportError).

copy-pr-bot · 2025-09-30T23:55:14Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2025-09-30T23:56:31Z

/ok to test

…rtlib + ModuleNotFoundError (more selective than ImportError) and produce specific error messages

…k._inspect_function_pointer()

rwgk · 2025-10-01T00:17:01Z

/ok to test

github-actions · 2025-10-01T00:31:35Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1055/
https://nvidia.github.io/cuda-python/pr-preview/pr-1055/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1055/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1055/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rwgk · 2025-10-01T06:56:25Z

@leofang @mdboom for awareness

This PR passes CI testing, but before marking it as ready for review, I still want to do more local testing.

mdboom

I like this solution, not just for fixing the specific bug, but for providing a more detailed explanation of what is missing.

I think we should add a unit test. In my earlier PR, I had written one that monkey patches pathfinder to always return an import error. I think with this solution we would want to then call _decide_nvjitlink_or_driver() and make sure it returns True and doesn't raise. That new test worked fine locally when I did python -m pytest cuda_bindings/tests, but failed when run in CI which does pytest cuda_bindings/tests. It was somehow not loading the correct import paths in the subprocess (probably missing some of the magic that pytest does). So we'd have to work that out, but I think that test is otherwise a good approach to testing this kind of thing.

EDIT: Forgot to link to my unit test: https://github.com/NVIDIA/cuda-python/pull/1052/files#diff-d4ea6e71dd3f7e00e889a8d8736b221da91704799da0361af1c112bd23ffbb6d

leofang

Thanks, Ralf! I think your analysis is correct.

cuda_core/cuda/core/experimental/_linker.py

cpcloud · 2025-10-01T13:53:53Z

cuda_core/cuda/core/experimental/_linker.py

-    else:
-        return False
+
+    warn(


We should always be asking ourselves: "Is this warning actionable by our users?"

If the answer is "no" or "not really", then there's no reason to spam people with "informative" warnings.

This one looks like it is not actionable.

It is actionable (and more so once we figure out how to address #851) because users can update cuda-bindings and/or nvJitLink.

Assuming it is actionable, the warning needs to contain the action that a user can take instead of just outputting the current state of their installation.

It'd be much easier if we don't actually want to produce a warning. I was just following the established path.

We could make the errors actionable in this way (for the three situations):

cuda.bindings.nvjitlink is not available, therefore the culink APIs will be used instead. For best results, please upgrade to a more recent cuda-bindings version.

libnvJitLink.so* is not available. Therefore cuda.bindings.nvjitlink is not usable and the culink APIs will be used instead. For best results, please install a recent version of nvJitLink.

libnvJitLink.so* is too old (<12.3). Therefore cuda.bindings.nvjitlink is not usable and the culink APIs will be used instead. For best results, please install a recent version of nvJitLink.

@leofang which way should I go?

A: Simply remove the warnings (and explain in source code comments instead)?

B: Append the "For best results" sentences?

I'm now thinking based on this discussion that as a user of this I would 100% not care about this warning if my software is working. Why would I want to be warned to upgrade every time I run my program? We should avoid pestering our users with warnings.

One way I find useful to frame the utility of warnings is: if I were in a production on call setting and I saw this warning, would I ignore it, or would I do something?

In that setting, there's tons of information that you have to use to act (usually quickly) and warnings that don't indicate "something bad is about to happen" are an enormous nuisance.

Ack

I want to defer to @leofang for this, because he has significantly more relevant background.

The biggest difference is that the driver APIs do not support minor version compatibility or link LTO IRs, unlike nvJitLink which should be the majority of the use cases. It is useful to warn as a hint for debugging potential errors that are raised in later operations due to driver limitations, because the driver error msg is not super clear.

Thanks! I worked "do not support minor version compatibility or linking LTO IRs" into the warning messages: commit 7e9fce4

I added two more commits, the first tiny one (4724769) to make testing easier, then the new tests, almost completely LLM generated (6630d8e).

cuda_core/cuda/core/experimental/_linker.py

…s also raise ModuleNotFoundError)

This aids unit testing by allowing localized stubbing of the version-symbol check, without needing to patch the full inner nvjitlink module.

As generated by ChatGPT 5, with minor manual tweaks.

rwgk · 2025-10-02T05:20:23Z

/ok to test

copy-pr-bot · 2025-10-02T06:11:44Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2025-10-02T17:59:51Z

@leofang @kkraus14 This slightly tangential question keeps coming back to me: When can we declare that we don't support CTK <12.3 anymore?

Because at that point we could just rip out all the fallback (to culink) code completely.

cuda_core/cuda/core/experimental/_linker.py

kkraus14 · 2025-10-02T21:10:48Z

@leofang @kkraus14 This slightly tangential question keeps coming back to me: When can we declare that we don't support CTK <12.3 anymore?

Because at that point we could just rip out all the fallback (to culink) code completely.

When we drop all of CUDA 12.x support, so quite a while away.

leofang · 2025-10-02T22:10:11Z

@leofang @kkraus14 This slightly tangential question keeps coming back to me: When can we declare that we don't support CTK <12.3 anymore?
Because at that point we could just rip out all the fallback (to culink) code completely.

When we drop all of CUDA 12.x support, so quite a while away.

We already documented it: https://nvidia.github.io/cuda-python/cuda-bindings/latest/release/12.8.0-notes.html#highlights

0948942

rwgk · 2025-10-03T16:45:15Z

I think we should add a unit test.

@mdboom there is a comprehensive new test now (LLM-generated).

I think this PR is ready for merging, but I'll hold off re-running the CI until there is an approval.

leofang

Thanks, Ralf. The changes to _linker.py LGTM.

I have, however, a mixed feeling about the new test module (test_linker_warning.py). We are testing a single function with only ~70 lines of code (_decide_nvjitlink_or_driver ) with almost 200 lines of new test code with nontrivial complexity and readability (I find it hard to follow). I am not sure if we want the test to be added now. If there is no simpler solution, can we only include the bug fix and defer the test addition to another PR?

leofang · 2025-10-06T02:07:06Z

cuda_core/tests/test_linker_warnings.py

+    # Ensure a clean sys.modules slate for our synthetic packages
+    for modname in list(sys.modules):
+        if modname.startswith("cuda.bindings.nvjitlink") or modname == "cuda.bindings" or modname == "cuda":
+            sys.modules.pop(modname, None)
+
+    yield
+
+    # Cleanup any stubs we added
+    for modname in list(sys.modules):
+        if modname.startswith("cuda.bindings.nvjitlink") or modname == "cuda.bindings" or modname == "cuda":
+            sys.modules.pop(modname, None)


Please let us not meddle with sys.modules. If this code is wrong for any reason, it'd be very hard to track down why the Python state becomes inconsistent.

It seems to me the whole test module is written in mind (ex: _install_public_nvjitlink_stub below) that it is run standalone. The import side effects to other test modules are not considered.

Don't try to replace this with monkeypath.{setitem,delitem}. I haven't tried it myself but pytest.monkeypatch does not seem to be particularly good at scoped tests: pytest-dev/pytest#2229, pytest-dev/pytest#945.

Maybe it would be possible to run the tests in subprocesses, which would reduce the problem of side effects affecting other tests...

The root problem here is the code-to-be-tested. Given that @kkraus14 said we'll need it for a while (I'm assuming until CTK 14 is released), I don't want to spend effort on complex and unusual testing. Probably with a similar effort, the code can be refactored for better practices (no globals), then it'll be easy to test without excessive mocking or setting up subprocesses (which tends to be brittle).

leofang · 2025-10-06T02:08:04Z

cuda_core/tests/test_linker_warnings.py

+            sys.modules.pop(modname, None)
+
+
+def _install_public_nvjitlink_stub():


This test modifies sys.modules but does not ensure restoration.

_decide_nvjitlink_or_driver(): catch RuntimeError (bug fix), use impo…

9ad6bcd

…rtlib + ModuleNotFoundError (more selective than ImportError) and produce specific error messages

rwgk force-pushed the linker_decide_nvjitlink_or_driver_fix branch from ee32f67 to 9ad6bcd Compare October 1, 2025 00:04

Fix misunderstanding: RuntimeError is raised only from inner_nvjitlin…

9390c85

…k._inspect_function_pointer()

mdboom requested changes Oct 1, 2025

View reviewed changes

leofang previously approved these changes Oct 1, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_linker.py Outdated Show resolved Hide resolved

cpcloud reviewed Oct 1, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_linker.py Outdated Show resolved Hide resolved

cpcloud reviewed Oct 1, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_linker.py Outdated Show resolved Hide resolved

leofang assigned rwgk Oct 1, 2025

leofang added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Oct 1, 2025

leofang added this to the cuda.core beta 7 milestone Oct 1, 2025

rwgk added 2 commits October 1, 2025 11:30

Merge branch 'main' into linker_decide_nvjitlink_or_driver_fix

2ccdc21

Better way of formatting warning messages.

6bf421e

rwgk dismissed leofang’s stale review via 6bf421e October 1, 2025 19:04

rwgk added 5 commits October 1, 2025 15:14

Change from importlib.import_module() to plain import (the latter doe…

9220d58

…s also raise ModuleNotFoundError)

Merge branch 'main' into linker_decide_nvjitlink_or_driver_fix

c9a95c0

Enhance to warning messages, to make them actionable.

7e9fce4

Factor out _nvjitlink_has_version_symbol() for clarity and testability

4724769

This aids unit testing by allowing localized stubbing of the version-symbol check, without needing to patch the full inner nvjitlink module.

Add test_linker_warnings.py

6630d8e

As generated by ChatGPT 5, with minor manual tweaks.

rwgk marked this pull request as ready for review October 2, 2025 06:11

Fix "the the" oversight

f1c29f6

kkraus14 reviewed Oct 2, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_linker.py Outdated Show resolved Hide resolved

rwgk added 3 commits October 2, 2025 16:20

Replace "culink APIs" → "driver APIs" in warning message.

0948942

Merge branch 'main' into linker_decide_nvjitlink_or_driver_fix

00e6421

Fix oversight: test_linker_warnings.py needs to be updated after commit

b237669

0948942

rwgk and others added 3 commits October 3, 2025 16:49

Merge branch 'main' into linker_decide_nvjitlink_or_driver_fix

6787ddf

fix skipping the check for nvidia-smi (NVIDIA#1084)

4dc08e7

Merge branch 'main' into linker_decide_nvjitlink_or_driver_fix

9721041

leofang requested changes Oct 6, 2025

View reviewed changes

		sys.modules.pop(modname, None)


		def _install_public_nvjitlink_stub():

_decide_nvjitlink_or_driver() bug fix and enhancements #1055

Are you sure you want to change the base?

_decide_nvjitlink_or_driver() bug fix and enhancements #1055

Uh oh!

Conversation

rwgk commented Sep 30, 2025

Uh oh!

copy-pr-bot bot commented Sep 30, 2025

Uh oh!

rwgk commented Sep 30, 2025

Uh oh!

rwgk commented Oct 1, 2025

Uh oh!

github-actions bot commented Oct 1, 2025

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk commented Oct 1, 2025

Uh oh!

mdboom left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rwgk commented Oct 2, 2025

Uh oh!

copy-pr-bot bot commented Oct 2, 2025

Uh oh!

rwgk commented Oct 2, 2025

Uh oh!

Uh oh!

kkraus14 commented Oct 2, 2025

Uh oh!

leofang commented Oct 2, 2025

Uh oh!

rwgk commented Oct 3, 2025

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

`_decide_nvjitlink_or_driver()` bug fix and enhancements #1055

`_decide_nvjitlink_or_driver()` bug fix and enhancements #1055

mdboom left a comment •

edited

Loading