CUDA_ERROR_SYSTEM_DRIVER_MISMATCH #2170

RangaSamudrala · 2025-05-06T19:57:25Z

RangaSamudrala
May 6, 2025

Hello
We have a K8S node with 8 GPUs of type H200 on RHEL 9.4. CUDA version is 12.8 and Driver version is 570.133.20.

We have configured GPU operator v25.3.0. All daemons are working fine.

We have 3 PODs running with different versions of cuda-python installed/configured - 12.5.1, 12.6.1 and 12.8.1.

When we run the same test program in all these PODs only the POD with cuda-python v12.8.1 completes successfully. The other 2 fail with CUDA_ERROR_SYSTEM_DRIVER_MISMATCH error.

I have chosen the option driver.enabled=true so that containers do not rely on what is installed on the system.

I have gone through RKE2 Nvidia Operator documentation and everything seems alright. I can see the label nvidia.com/gpu.deploy.driver=pre-installed on the node so, the operator sees pre-existing drivers.

So, what should I be doing to successfully invoke different versions of cuda-python on the same node? This problem does not exist in AWS EKS so, what am I missing while configuring in on-prem nodes?

RangaSamudrala · 2025-05-07T19:52:09Z

RangaSamudrala
May 7, 2025
Author

Looking at #126, I can see that upgrades have been made to use pre-existing drivers on the host.
Can the operator be forced to ignore pre-existing drivers on host and use a driver version specified in the helm chart?

GPU operator install guide says the following:

If you do not specify the driver.enabled=false argument and nodes in the cluster have a pre-installed GPU driver, the init container in the driver pod detects that the driver is preinstalled and labels the node so that the driver pod is terminated and does not get re-scheduled on to the node. The Operator proceeds to start other pods, such as the container toolkit pod.

What if we specify driver.enabled=true even though host driver exists?

0 replies

PraveenKumarInjam · 2025-07-15T11:12:48Z

PraveenKumarInjam
Jul 15, 2025

If we see LD_DEBUG when running Nvidia gpu commands, it is loading libcuda-560.x.so (The driver module lib) module from /usr/local/cuda/compact. This is the the default provided by Nvidia. But the driver specific module (libcuda.570.133.20.so) is in /usr/lib/cuda/lib64. This is the reason for mismatch and the error "CUDA_DRIVER_MISMATCH"

This issue is due to:

In triton inference server, while building the image, Nvidia has set "LD_LIBRARY_PATH" env variable to /usr/local/cuda/compact than /usr/lib/cuda/lib64.
It will work with lower versions of driver due to backward compatibility as /usr/local/cuda/compact has libcuda.560.so file
But with 570.x it is failing.

Remediation :
Set LD_LIBRARY_PATH
LD_LIBRARY_PATH=/usr/lib/cuda/lib64:$LD_LIBRARY_PATH
This will give preference to the latest driver spec libraries, if you are using k8s set this isn triton server env variables

0 replies

2025-11-04T22:45:12Z

github-actions[bot]
bot Nov 4, 2025

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA_ERROR_SYSTEM_DRIVER_MISMATCH #2170

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CUDA_ERROR_SYSTEM_DRIVER_MISMATCH #2170

Uh oh!

Uh oh!

RangaSamudrala May 6, 2025

Replies: 3 comments

Uh oh!

RangaSamudrala May 7, 2025 Author

Uh oh!

PraveenKumarInjam Jul 15, 2025

Uh oh!

github-actions[bot] bot Nov 4, 2025

RangaSamudrala
May 6, 2025

RangaSamudrala
May 7, 2025
Author

PraveenKumarInjam
Jul 15, 2025

github-actions[bot]
bot Nov 4, 2025