Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia-fabricmanager failed to start with “NV_WARN_NOTHING_TO_DO” #610

Open
dongkuang opened this issue Nov 15, 2023 · 5 comments
Open

Comments

@dongkuang
Copy link

dongkuang commented Nov 15, 2023

Uploading NVlinkError-fabricmanager-en1.docx…
I have 1 NVLink device to connect 2 nvidia A40 graphics cards, used ubuntu20.04 system, downloaded and installed nvidia-driver-local-repo-ubuntu2004-515.105.01_1.0-1_ amd64 .deb driver from the official website, and then installed cuda11.8 (cuda-repo-ubuntu2004-11-8-local_11.8.0-520.61.05-1_amd64.deb) from the official website, After installing nvidia-fabricmanager-520_520.61.05-1_amd64.deb and nvidia-fabricmanager-dev-520_520.61.05-1_amd64.deb, start the fabricmanager service
(sudo systemctl start nvidia-fabricmanager) The following error message is reported:

Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xe" for details.

View the error details and report NV_WARN_NOTHING_TO_DO errors, as follows:

11月 15 14:15:47 leon-NF5468M6 systemd[1]: Starting NVIDIA fabric manager service...
11月 15 14:15:47 leon-NF5468M6 nv-fabricmanager[4177]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
11月 15 14:15:47 leon-NF5468M6 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
11月 15 14:15:47 leon-NF5468M6 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
11月 15 14:15:47 leon-NF5468M6 systemd[1]: Failed to start NVIDIA fabric manager service.
ESCOC

Please help me, Thanks!
@shivamerla

Tasks

Preview Give feedback
No tasks being tracked yet.
@tariq1890
Copy link
Contributor

@dongkuang Just to be clear, you are not installing the driver via the container?

Can you confirm if this exists in your system?

/proc/driver/nvidia-nvswitch/devices

@dongkuang
Copy link
Author

dongkuang commented Nov 17, 2023

nvswtich

@dongkuang需要明确的是,您不是通过容器安装驱动程序?

你能确认一下你的系统中是否存在这个吗?

/proc/driver/nvidia-nvswitch/devices

@tariq1890 Thank you!This directory exists, but there are no files or any content inside,and I am sure I installing the driver not in via the container

@tariq1890
Copy link
Contributor

as /proc/driver/nvidia-nvswitch/devices is an empty dir, it is most likely that there are no nvswitches for this GPU

Fabricmanager service will not work if there are no nvswitch devices.

@dongkuang
Copy link
Author

Today, I saw on the NVIDIA official website that A40 introduces that ultra fast GDDR6 memory can be expanded to 96GB through NVLink.How to install nvswitch?

@ifourier
Copy link

ifourier commented Jun 4, 2024

In the Fabric-Manager User Guide, NVSwitches are supported starting with DGX-2, and only V100, A100, and H100 GPUs support them.

https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants