Skip to content

Conversation

rwgk
Copy link
Collaborator

@rwgk rwgk commented Sep 30, 2025

RuntimeError needs to be caught to resolve #951

While we're at it:

The rest of the changes are enhancements for clarity (specific error messages) and safety (ModuleNotFoundError instead of the less specific ImportError).

Copy link
Contributor

copy-pr-bot bot commented Sep 30, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Sep 30, 2025

/ok to test

…rtlib + ModuleNotFoundError (more selective than ImportError) and produce specific error messages
@rwgk rwgk force-pushed the linker_decide_nvjitlink_or_driver_fix branch from ee32f67 to 9ad6bcd Compare October 1, 2025 00:04
@rwgk
Copy link
Collaborator Author

rwgk commented Oct 1, 2025

/ok to test

Copy link

github-actions bot commented Oct 1, 2025

@rwgk
Copy link
Collaborator Author

rwgk commented Oct 1, 2025

@leofang @mdboom for awareness

This PR passes CI testing, but before marking it as ready for review, I still want to do more local testing.

Copy link
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution, not just for fixing the specific bug, but for providing a more detailed explanation of what is missing.

I think we should add a unit test. In my earlier PR, I had written one that monkey patches pathfinder to always return an import error. I think with this solution we would want to then call _decide_nvjitlink_or_driver() and make sure it returns True and doesn't raise. That new test worked fine locally when I did python -m pytest cuda_bindings/tests, but failed when run in CI which does pytest cuda_bindings/tests. It was somehow not loading the correct import paths in the subprocess (probably missing some of the magic that pytest does). So we'd have to work that out, but I think that test is otherwise a good approach to testing this kind of thing.

EDIT: Forgot to link to my unit test: https://github.com/NVIDIA/cuda-python/pull/1052/files#diff-d4ea6e71dd3f7e00e889a8d8736b221da91704799da0361af1c112bd23ffbb6d

leofang
leofang previously approved these changes Oct 1, 2025
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ralf! I think your analysis is correct.

else:
return False

warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always be asking ourselves: "Is this warning actionable by our users?"

If the answer is "no" or "not really", then there's no reason to spam people with "informative" warnings.

This one looks like it is not actionable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actionable (and more so once we figure out how to address #851) because users can update cuda-bindings and/or nvJitLink.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming it is actionable, the warning needs to contain the action that a user can take instead of just outputting the current state of their installation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be much easier if we don't actually want to produce a warning. I was just following the established path.

We could make the errors actionable in this way (for the three situations):

  • cuda.bindings.nvjitlink is not available, therefore the culink APIs will be used instead. For best results, please upgrade to a more recent cuda-bindings version.

  • libnvJitLink.so* is not available. Therefore cuda.bindings.nvjitlink is not usable and the culink APIs will be used instead. For best results, please install a recent version of nvJitLink.

  • libnvJitLink.so* is too old (<12.3). Therefore cuda.bindings.nvjitlink is not usable and the culink APIs will be used instead. For best results, please install a recent version of nvJitLink.

@leofang which way should I go?

A: Simply remove the warnings (and explain in source code comments instead)?

B: Append the "For best results" sentences?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now thinking based on this discussion that as a user of this I would 100% not care about this warning if my software is working. Why would I want to be warned to upgrade every time I run my program? We should avoid pestering our users with warnings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way I find useful to frame the utility of warnings is: if I were in a production on call setting and I saw this warning, would I ignore it, or would I do something?

In that setting, there's tons of information that you have to use to act (usually quickly) and warnings that don't indicate "something bad is about to happen" are an enormous nuisance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

I want to defer to @leofang for this, because he has significantly more relevant background.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest difference is that the driver APIs do not support minor version compatibility or link LTO IRs, unlike nvJitLink which should be the majority of the use cases. It is useful to warn as a hint for debugging potential errors that are raised in later operations due to driver limitations, because the driver error msg is not super clear.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I worked "do not support minor version compatibility or linking LTO IRs" into the warning messages: commit 7e9fce4

I added two more commits, the first tiny one (4724769) to make testing easier, then the new tests, almost completely LLM generated (6630d8e).

@leofang leofang added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Oct 1, 2025
@leofang leofang added this to the cuda.core beta 7 milestone Oct 1, 2025
rwgk added 5 commits October 1, 2025 15:14
This aids unit testing by allowing localized stubbing of the version-symbol
check, without needing to patch the full inner nvjitlink module.
As generated by ChatGPT 5, with minor manual tweaks.
@rwgk
Copy link
Collaborator Author

rwgk commented Oct 2, 2025

/ok to test

@rwgk rwgk marked this pull request as ready for review October 2, 2025 06:11
Copy link
Contributor

copy-pr-bot bot commented Oct 2, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Oct 2, 2025

@leofang @kkraus14 This slightly tangential question keeps coming back to me: When can we declare that we don't support CTK <12.3 anymore?

Because at that point we could just rip out all the fallback (to culink) code completely.

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 2, 2025

@leofang @kkraus14 This slightly tangential question keeps coming back to me: When can we declare that we don't support CTK <12.3 anymore?

Because at that point we could just rip out all the fallback (to culink) code completely.

When we drop all of CUDA 12.x support, so quite a while away.

@leofang
Copy link
Member

leofang commented Oct 2, 2025

@leofang @kkraus14 This slightly tangential question keeps coming back to me: When can we declare that we don't support CTK <12.3 anymore?
Because at that point we could just rip out all the fallback (to culink) code completely.

When we drop all of CUDA 12.x support, so quite a while away.

We already documented it: https://nvidia.github.io/cuda-python/cuda-bindings/latest/release/12.8.0-notes.html#highlights

@rwgk
Copy link
Collaborator Author

rwgk commented Oct 3, 2025

I think we should add a unit test.

@mdboom there is a comprehensive new test now (LLM-generated).

I think this PR is ready for merging, but I'll hold off re-running the CI until there is an approval.

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ralf. The changes to _linker.py LGTM.

I have, however, a mixed feeling about the new test module (test_linker_warning.py). We are testing a single function with only ~70 lines of code (_decide_nvjitlink_or_driver ) with almost 200 lines of new test code with nontrivial complexity and readability (I find it hard to follow). I am not sure if we want the test to be added now. If there is no simpler solution, can we only include the bug fix and defer the test addition to another PR?

Comment on lines +37 to +47
# Ensure a clean sys.modules slate for our synthetic packages
for modname in list(sys.modules):
if modname.startswith("cuda.bindings.nvjitlink") or modname == "cuda.bindings" or modname == "cuda":
sys.modules.pop(modname, None)

yield

# Cleanup any stubs we added
for modname in list(sys.modules):
if modname.startswith("cuda.bindings.nvjitlink") or modname == "cuda.bindings" or modname == "cuda":
sys.modules.pop(modname, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let us not meddle with sys.modules. If this code is wrong for any reason, it'd be very hard to track down why the Python state becomes inconsistent.

It seems to me the whole test module is written in mind (ex: _install_public_nvjitlink_stub below) that it is run standalone. The import side effects to other test modules are not considered.

Don't try to replace this with monkeypath.{setitem,delitem}. I haven't tried it myself but pytest.monkeypatch does not seem to be particularly good at scoped tests: pytest-dev/pytest#2229, pytest-dev/pytest#945.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be possible to run the tests in subprocesses, which would reduce the problem of side effects affecting other tests...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root problem here is the code-to-be-tested. Given that @kkraus14 said we'll need it for a while (I'm assuming until CTK 14 is released), I don't want to spend effort on complex and unusual testing. Probably with a similar effort, the code can be refactored for better practices (no globals), then it'll be easy to test without excessive mocking or setting up subprocesses (which tends to be brittle).

sys.modules.pop(modname, None)


def _install_public_nvjitlink_stub():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test modifies sys.modules but does not ensure restoration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Library bindings are not importable if the underlying library is not installed
5 participants