-
Notifications
You must be signed in to change notification settings - Fork 212
_decide_nvjitlink_or_driver()
bug fix and enhancements
#1055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
…rtlib + ModuleNotFoundError (more selective than ImportError) and produce specific error messages
ee32f67
to
9ad6bcd
Compare
…k._inspect_function_pointer()
/ok to test |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this solution, not just for fixing the specific bug, but for providing a more detailed explanation of what is missing.
I think we should add a unit test. In my earlier PR, I had written one that monkey patches pathfinder
to always return an import error. I think with this solution we would want to then call _decide_nvjitlink_or_driver()
and make sure it returns True
and doesn't raise. That new test worked fine locally when I did python -m pytest cuda_bindings/tests
, but failed when run in CI which does pytest cuda_bindings/tests
. It was somehow not loading the correct import paths in the subprocess (probably missing some of the magic that pytest does). So we'd have to work that out, but I think that test is otherwise a good approach to testing this kind of thing.
EDIT: Forgot to link to my unit test: https://github.com/NVIDIA/cuda-python/pull/1052/files#diff-d4ea6e71dd3f7e00e889a8d8736b221da91704799da0361af1c112bd23ffbb6d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Ralf! I think your analysis is correct.
else: | ||
return False | ||
|
||
warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should always be asking ourselves: "Is this warning actionable by our users?"
If the answer is "no" or "not really", then there's no reason to spam people with "informative" warnings.
This one looks like it is not actionable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actionable (and more so once we figure out how to address #851) because users can update cuda-bindings and/or nvJitLink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming it is actionable, the warning needs to contain the action that a user can take instead of just outputting the current state of their installation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be much easier if we don't actually want to produce a warning. I was just following the established path.
We could make the errors actionable in this way (for the three situations):
-
cuda.bindings.nvjitlink is not available, therefore the culink APIs will be used instead. For best results, please upgrade to a more recent cuda-bindings version.
-
libnvJitLink.so* is not available. Therefore cuda.bindings.nvjitlink is not usable and the culink APIs will be used instead. For best results, please install a recent version of nvJitLink.
-
libnvJitLink.so* is too old (<12.3). Therefore cuda.bindings.nvjitlink is not usable and the culink APIs will be used instead. For best results, please install a recent version of nvJitLink.
@leofang which way should I go?
A: Simply remove the warnings (and explain in source code comments instead)?
B: Append the "For best results" sentences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now thinking based on this discussion that as a user of this I would 100% not care about this warning if my software is working. Why would I want to be warned to upgrade every time I run my program? We should avoid pestering our users with warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way I find useful to frame the utility of warnings is: if I were in a production on call setting and I saw this warning, would I ignore it, or would I do something?
In that setting, there's tons of information that you have to use to act (usually quickly) and warnings that don't indicate "something bad is about to happen" are an enormous nuisance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
I want to defer to @leofang for this, because he has significantly more relevant background.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest difference is that the driver APIs do not support minor version compatibility or link LTO IRs, unlike nvJitLink which should be the majority of the use cases. It is useful to warn as a hint for debugging potential errors that are raised in later operations due to driver limitations, because the driver error msg is not super clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…s also raise ModuleNotFoundError)
This aids unit testing by allowing localized stubbing of the version-symbol check, without needing to patch the full inner nvjitlink module.
As generated by ChatGPT 5, with minor manual tweaks.
/ok to test |
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
We already documented it: https://nvidia.github.io/cuda-python/cuda-bindings/latest/release/12.8.0-notes.html#highlights |
@mdboom there is a comprehensive new test now (LLM-generated). I think this PR is ready for merging, but I'll hold off re-running the CI until there is an approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Ralf. The changes to _linker.py
LGTM.
I have, however, a mixed feeling about the new test module (test_linker_warning.py
). We are testing a single function with only ~70 lines of code (_decide_nvjitlink_or_driver
) with almost 200 lines of new test code with nontrivial complexity and readability (I find it hard to follow). I am not sure if we want the test to be added now. If there is no simpler solution, can we only include the bug fix and defer the test addition to another PR?
# Ensure a clean sys.modules slate for our synthetic packages | ||
for modname in list(sys.modules): | ||
if modname.startswith("cuda.bindings.nvjitlink") or modname == "cuda.bindings" or modname == "cuda": | ||
sys.modules.pop(modname, None) | ||
|
||
yield | ||
|
||
# Cleanup any stubs we added | ||
for modname in list(sys.modules): | ||
if modname.startswith("cuda.bindings.nvjitlink") or modname == "cuda.bindings" or modname == "cuda": | ||
sys.modules.pop(modname, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let us not meddle with sys.modules
. If this code is wrong for any reason, it'd be very hard to track down why the Python state becomes inconsistent.
It seems to me the whole test module is written in mind (ex: _install_public_nvjitlink_stub
below) that it is run standalone. The import side effects to other test modules are not considered.
Don't try to replace this with monkeypath.{setitem,delitem}
. I haven't tried it myself but pytest.monkeypatch
does not seem to be particularly good at scoped tests: pytest-dev/pytest#2229, pytest-dev/pytest#945.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be possible to run the tests in subprocesses, which would reduce the problem of side effects affecting other tests...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root problem here is the code-to-be-tested. Given that @kkraus14 said we'll need it for a while (I'm assuming until CTK 14 is released), I don't want to spend effort on complex and unusual testing. Probably with a similar effort, the code can be refactored for better practices (no globals), then it'll be easy to test without excessive mocking or setting up subprocesses (which tends to be brittle).
sys.modules.pop(modname, None) | ||
|
||
|
||
def _install_public_nvjitlink_stub(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test modifies sys.modules
but does not ensure restoration.
RuntimeError
needs to be caught to resolve #951While we're at it:
The rest of the changes are enhancements for clarity (specific error messages) and safety (
ModuleNotFoundError
instead of the less specificImportError
).