CUDA_ERROR_SYSTEM_DRIVER_MISMATCH

Hello
We have a K8S node with 8 GPUs of type H200 on RHEL 9.4. CUDA version is 12.8 and Driver version is 570.133.20.

We have configured GPU operator v25.3.0. All daemons are working fine. 

We have 3 PODs running with different versions of cuda-python installed/configured - 12.5.1, 12.6.1 and 12.8.1.

When we run the same test program in all these PODs only the POD with cuda-python v12.8.1 completes successfully. The other 2 fail with CUDA_ERROR_SYSTEM_DRIVER_MISMATCH error.

I have chosen the option `driver.enabled=true ` so that containers do not rely on what is installed on the system. 

I have gone through [RKE2 Nvidia Operator documentation](https://docs.rke2.io/advanced#deploy-nvidia-operator) and everything seems alright. I can see the label `nvidia.com/gpu.deploy.driver=pre-installed`   on the node so, the operator sees pre-existing drivers.   

So, what should I be doing to successfully invoke different versions of cuda-python on the same node? This problem does not exist in  AWS EKS so, what am I missing while configuring in on-prem nodes? 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA_ERROR_SYSTEM_DRIVER_MISMATCH #1428

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA_ERROR_SYSTEM_DRIVER_MISMATCH #1428

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions