Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The NVIDIA ICD JSON occasionally goes missing from 'nvidia-ctk cdi generate' #767

Open
debarshiray opened this issue Oct 31, 2024 · 6 comments
Assignees

Comments

@debarshiray
Copy link

debarshiray commented Oct 31, 2024

I have been playing with the NVIDIA Container Toolkit on Fedora 39 Workstation and the proprietary NVIDIA driver from RPM Fusion. I have noticed that the NVIDIA installable client driver (or ICD) JSON for Vulkan occasionally goes missing from nvidia-ctk cdi generate:

$ nvidia-ctk cdi generate --format yaml 2>/dev/null | grep vulkan
 - containerPath: /etc/vulkan/implicit_layer.d/nvidia_layers.json
   hostPath: /usr/share/vulkan/implicit_layer.d/nvidia_layers.json

... even though the file is present on the host operating system at /usr/share/vulkan/icd.d/nvidia_icd.x86_64.json and Vulkan support on the host is confirmed by:

$ vulkaninfo --summary
...
...
Devices:
========
GPU0:
	apiVersion         = 1.3.280
	driverVersion      = 560.35.3.0
	vendorID           = 0x10de
	deviceID           = 0x1cbc
	deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName         = Quadro P600
	driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
	driverName         = NVIDIA
	driverInfo         = 560.35.3.0
	conformanceVersion = 1.3.8.2
	deviceUUID         = 2efa4848-ba99-ccd3-0a19-f497b31331ca
	driverUUID         = c3ca0510-c7e6-5f1c-86a1-dc0ed4ea4e21
...
...

This means that Podman containers don't have Vulkan support through the proprietary NVIDIA driver, and can only use LLVMpipe.

Right now, I am observing this problem with:

$ uname --kernel-release
6.11.4-101.fc39.x86_64
$ rpm -q kernel
kernel-6.5.6-300.fc39.x86_64
kernel-6.11.4-101.fc39.x86_64
$ rpm -q kmod-nvidia
kmod-nvidia-560.35.03-1.fc39.x86_64
@debarshiray
Copy link
Author

I forgot to mention the NVIDIA Container Toolkit version:

$ nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.16.1
$ rpm -qf $(which nvidia-ctk)
golang-github-nvidia-container-toolkit-1.16.1-1.fc39.x86_64

Note that the NVIDIA Container Toolkit version didn't change between the NVIDIA ICD JSON for Vulkan being listed and not listed. What changed was that I pulled in the RPM updates for the rest of the Fedora host.

@elezar
Copy link
Member

elezar commented Nov 1, 2024

@debarshiray the host path you mention /usr/share/vulkan/icd.d/nvidia_icd.x86_64.json is not one that we explicitly search for. Could you please confirm which package provides that file? It could be that the 560.35.3.0 driver that you're using now includes the file including the architecture string.

(Looking at some older internal documentation it seems as if this has been the case for a while).

@debarshiray
Copy link
Author

Thanks for looking into it, @elezar !

Meanwhile, I reinstalled different versions of Fedora a few times to see if the problem is specific to a particular combination of package versions. I could reproduce it reliably on Fedora 40 and 41, which was surprising because this used to work. :)

Now with Fedora 41 Workstation and the proprietary NVIDIA driver from RPM Fusion, I see:

$ rpm --query --file /usr/share/vulkan/icd.d/nvidia_icd.x86_64.json
xorg-x11-drv-nvidia-libs-560.35.03-5.fc41.x86_64

If I force /usr/share/vulkan/icd.d/nvidia_icd.x86_64.json to be present inside the container through an explicit bind mount then I do get Vulkan support through the proprietary NVIDIA driver.

In all cases, Vulkan support is available through the proprietary driver on the host operating system, as shown in the vulkaninfo --summary snippet above.

@elezar
Copy link
Member

elezar commented Nov 4, 2024

Who is the publisher of the xorg-x11-drv-nvidia-libs-560.35.03-5.fc41.x86_64 package above?

@debarshiray
Copy link
Author

Who is the publisher of the xorg-x11-drv-nvidia-libs-560.35.03-5.fc41.x86_64 package above?

It's RPM Fusion. That's where I got the proprietary NVIDIA driver from.

@elezar
Copy link
Member

elezar commented Nov 29, 2024

The issue is that the driver package definition (see https://pkgs.rpmfusion.org/cgit/nonfree/xorg-x11-drv-nvidia.git/tree/xorg-x11-drv-nvidia.spec#n294) changes the name that the NVIDIA tooling expects. This means that the NVIDIA Container Toolkit can't locate the expected ICD.

My suggestion would be to create a bug against RPM Fusion so as to maintain the behaviour of the NVIDIA driver.

As a workaround you could rename / copy the nvidia_icd.x86_64.json file to nvidia_icd.json instead or ensure that you start your containers with:

-v /usr/share/vulkan/icd.d/nvidia_icd.x86_64.json:/usr/share/vulkan/icd.d/nvidia_icd.json

or adding an additional mount to your CDI spec.

I will look into what would be required for a more stable workaround, but can't commit to a specific timeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants