Skip to content

CUDA_ERROR_SYSTEM_DRIVER_MISMATCH #1428

@RangaSamudrala

Description

@RangaSamudrala

Hello
We have a K8S node with 8 GPUs of type H200 on RHEL 9.4. CUDA version is 12.8 and Driver version is 570.133.20.

We have configured GPU operator v25.3.0. All daemons are working fine.

We have 3 PODs running with different versions of cuda-python installed/configured - 12.5.1, 12.6.1 and 12.8.1.

When we run the same test program in all these PODs only the POD with cuda-python v12.8.1 completes successfully. The other 2 fail with CUDA_ERROR_SYSTEM_DRIVER_MISMATCH error.

I have chosen the option driver.enabled=true so that containers do not rely on what is installed on the system.

I have gone through RKE2 Nvidia Operator documentation and everything seems alright. I can see the label nvidia.com/gpu.deploy.driver=pre-installed on the node so, the operator sees pre-existing drivers.

So, what should I be doing to successfully invoke different versions of cuda-python on the same node? This problem does not exist in AWS EKS so, what am I missing while configuring in on-prem nodes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageissue or PR has not been assigned a priority-px labelquestionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions