-
Notifications
You must be signed in to change notification settings - Fork 461
Description
Hello
We have a K8S node with 8 GPUs of type H200 on RHEL 9.4. CUDA version is 12.8 and Driver version is 570.133.20.
We have configured GPU operator v25.3.0. All daemons are working fine.
We have 3 PODs running with different versions of cuda-python installed/configured - 12.5.1, 12.6.1 and 12.8.1.
When we run the same test program in all these PODs only the POD with cuda-python v12.8.1 completes successfully. The other 2 fail with CUDA_ERROR_SYSTEM_DRIVER_MISMATCH error.
I have chosen the option driver.enabled=true so that containers do not rely on what is installed on the system.
I have gone through RKE2 Nvidia Operator documentation and everything seems alright. I can see the label nvidia.com/gpu.deploy.driver=pre-installed on the node so, the operator sees pre-existing drivers.
So, what should I be doing to successfully invoke different versions of cuda-python on the same node? This problem does not exist in AWS EKS so, what am I missing while configuring in on-prem nodes?