CUDA_ERROR_SYSTEM_DRIVER_MISMATCH #2170
Replies: 3 comments
-
|
Looking at #126, I can see that upgrades have been made to use pre-existing drivers on the host. GPU operator install guide says the following: What if we specify |
Beta Was this translation helpful? Give feedback.
-
|
If we see LD_DEBUG when running Nvidia gpu commands, it is loading libcuda-560.x.so (The driver module lib) module from /usr/local/cuda/compact. This is the the default provided by Nvidia. But the driver specific module (libcuda.570.133.20.so) is in /usr/lib/cuda/lib64. This is the reason for mismatch and the error "CUDA_DRIVER_MISMATCH" This issue is due to: In triton inference server, while building the image, Nvidia has set "LD_LIBRARY_PATH" env variable to /usr/local/cuda/compact than /usr/lib/cuda/lib64. Remediation : |
Beta Was this translation helpful? Give feedback.
-
|
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello
We have a K8S node with 8 GPUs of type H200 on RHEL 9.4. CUDA version is 12.8 and Driver version is 570.133.20.
We have configured GPU operator v25.3.0. All daemons are working fine.
We have 3 PODs running with different versions of cuda-python installed/configured - 12.5.1, 12.6.1 and 12.8.1.
When we run the same test program in all these PODs only the POD with cuda-python v12.8.1 completes successfully. The other 2 fail with CUDA_ERROR_SYSTEM_DRIVER_MISMATCH error.
I have chosen the option
driver.enabled=trueso that containers do not rely on what is installed on the system.I have gone through RKE2 Nvidia Operator documentation and everything seems alright. I can see the label
nvidia.com/gpu.deploy.driver=pre-installedon the node so, the operator sees pre-existing drivers.So, what should I be doing to successfully invoke different versions of cuda-python on the same node? This problem does not exist in AWS EKS so, what am I missing while configuring in on-prem nodes?
Beta Was this translation helpful? Give feedback.
All reactions