-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia-fabricmanager.service can not start due to CUDA Version mismatch #544
Comments
I am also having this same exact issue. What I ended up doing was purging "libnvidia*" and "nvidia*" then reinstalling the nvidia driver and nvidia-fabricmanager. The difference I did this time with the reinstall was I installed the nvidia driver and fabricmanager driver using this command. "apt install nvidia-headless-525-server nvidia-fabricmanager-525" and not the generic nvidia-driver-535 install route. After a reboot this now allowed nvidia-fabricmanager to run properly. Hope this helps. Still learning this stuff but this is what I found/worked for me. |
@fame346 will check with Canonical on the mismatch of these packages as they should be aligned with the driver version. Meanwhile you can install fabric manager from NVIDIA CUDA repos as well: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ |
I was also facing the same issue of fabric manager mismatch. I think Ubuntu package manager updates fabric-manager which causes Nvidia driver and fabric manager version mismatch. The only way to solve this is to either install old fabric manager from the link shared by @shivamerla |
Thank you @shivamerla Whoever needs this, below is the fix:
|
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
nv-fabricmanager[3228]: fabric manager NVIDIA GPU driver interface version 530.30.02 don't match with driver version 530.41.03. Please update with matching NVIDIA driver package.
2. Steps to reproduce the issue
Install NVIDIA-530 driver (CUDA 12.1)
Install fabricmanager-530 from apt-get
Then run systemctl enable nvidia-fabricmanager, systemctl start nvidia-fabricmanager
It will fail
3. Information to attach (optional if deemed irrelevant)
My question is simple: where can I find CUDA 530.30.02 driver, or fabric manager 530.41.03? they should have matched with 530 right
kubernetes pods status:
kubectl get pods --all-namespaces
kubernetes daemonset status:
kubectl get ds --all-namespaces
If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
Output of running a container on the GPU machine:
docker run -it alpine echo foo
Docker configuration file:
cat /etc/docker/daemon.json
Docker runtime configuration:
docker info | grep runtime
NVIDIA shared directory:
ls -la /run/nvidia
NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
NVIDIA driver directory:
ls -la /run/nvidia/driver
kubelet logs
journalctl -u kubelet > kubelet.logs
The text was updated successfully, but these errors were encountered: