nvidia-fabricmanager.service can not start due to CUDA Version mismatch #544

bo-zeng-ml · 2023-06-28T23:40:22Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node?
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

nv-fabricmanager[3228]: fabric manager NVIDIA GPU driver interface version 530.30.02 don't match with driver version 530.41.03. Please update with matching NVIDIA driver package.

2. Steps to reproduce the issue

Install NVIDIA-530 driver (CUDA 12.1)

Install fabricmanager-530 from apt-get

Then run systemctl enable nvidia-fabricmanager, systemctl start nvidia-fabricmanager

It will fail

3. Information to attach (optional if deemed irrelevant)

My question is simple: where can I find CUDA 530.30.02 driver, or fabric manager 530.41.03? they should have matched with 530 right

The text was updated successfully, but these errors were encountered:

yzhao-2023 · 2023-10-14T01:00:51Z

Any update?

The above problem seems relavant to mine, but the author did not post any details.

My failure looks like this:

fame346 · 2023-10-16T16:32:01Z

Any update?

The above problem seems relavant to mine, but the author did not post any details.

My failure looks like this:

I am also having this same exact issue. What I ended up doing was purging "libnvidia*" and "nvidia*" then reinstalling the nvidia driver and nvidia-fabricmanager. The difference I did this time with the reinstall was I installed the nvidia driver and fabricmanager driver using this command. "apt install nvidia-headless-525-server nvidia-fabricmanager-525" and not the generic nvidia-driver-535 install route. After a reboot this now allowed nvidia-fabricmanager to run properly. Hope this helps. Still learning this stuff but this is what I found/worked for me.

shivamerla · 2023-10-16T16:50:34Z

@fame346 will check with Canonical on the mismatch of these packages as they should be aligned with the driver version. Meanwhile you can install fabric manager from NVIDIA CUDA repos as well: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/

saxenarohit · 2023-11-03T19:50:32Z

I was also facing the same issue of fabric manager mismatch.

I think Ubuntu package manager updates fabric-manager which causes Nvidia driver and fabric manager version mismatch.
Once the fabric manager is updated, you cannot reinstall the previous version as it is removed from ubuntu package manager.

The only way to solve this is to either install old fabric manager from the link shared by @shivamerla
Or
reinstall the latest Nvidia Driver which matches the fabric manager available on Ubuntu (535.129.03 worked for me)

OrenLeung · 2024-04-27T02:38:18Z

Any update?
The above problem seems relavant to mine, but the author did not post any details.
My failure looks like this:

I am also having this same exact issue. What I ended up doing was purging "libnvidia*" and "nvidia*" then reinstalling the nvidia driver and nvidia-fabricmanager. The difference I did this time with the reinstall was I installed the nvidia driver and fabricmanager driver using this command. "apt install nvidia-headless-525-server nvidia-fabricmanager-525" and not the generic nvidia-driver-535 install route. After a reboot this now allowed nvidia-fabricmanager to run properly. Hope this helps. Still learning this stuff but this is what I found/worked for me.

this happens when you install cuda-drivers-535 and nvidia-fabricmanager-535 as separate packages and apt unattended-upgrades updates ONLY nvidia-fabricmanager-535 when there is a new version available.

to resolve this either

uninstall unattend-upgrades or have it on a denylist for nvidia packages
install ONLY the wrapper package cuda-drivers-fabricmanager-535 which will keep both cuda-drivers535andnvidia-fabricmanager-535` in sync instead of installing them separately

or probably option 1 AND option 2 to prevent this from happening again

andy108369 · 2024-11-05T14:46:15Z

@fame346 will check with Canonical on the mismatch of these packages as they should be aligned with the driver version. Meanwhile you can install fabric manager from NVIDIA CUDA repos as well: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/

Thank you @shivamerla
This also helped in my situation where all the nvidia-* packages were a version behind the nvidia-fabricmanager offered by the Ubuntu repos:

Whoever needs this, below is the fix:

Running dist-upgrade with the official nvidia repo bumps nvidia packages along with the nvidia-fabricmanager, without version mismatch issue.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
apt-key add 3bf863cc.pub 

echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /" > /etc/apt/sources.list.d/nvidia-official-repo.list
apt update
apt dist-upgrade
apt autoremove

# `dpkg -l | grep nvidia` -- make sure to remove any version you don't expect
# and reboot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-fabricmanager.service can not start due to CUDA Version mismatch #544

nvidia-fabricmanager.service can not start due to CUDA Version mismatch #544

bo-zeng-ml commented Jun 28, 2023 •

edited

Loading

yzhao-2023 commented Oct 14, 2023

fame346 commented Oct 16, 2023

shivamerla commented Oct 16, 2023

saxenarohit commented Nov 3, 2023 •

edited

Loading

OrenLeung commented Apr 27, 2024 •

edited

Loading

andy108369 commented Nov 5, 2024

nvidia-fabricmanager.service can not start due to CUDA Version mismatch #544

nvidia-fabricmanager.service can not start due to CUDA Version mismatch #544

Comments

bo-zeng-ml commented Jun 28, 2023 • edited Loading

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

yzhao-2023 commented Oct 14, 2023

fame346 commented Oct 16, 2023

shivamerla commented Oct 16, 2023

saxenarohit commented Nov 3, 2023 • edited Loading

OrenLeung commented Apr 27, 2024 • edited Loading

andy108369 commented Nov 5, 2024

bo-zeng-ml commented Jun 28, 2023 •

edited

Loading

saxenarohit commented Nov 3, 2023 •

edited

Loading

OrenLeung commented Apr 27, 2024 •

edited

Loading