Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-toolkit fails to create /etc/vulkan/icd.d/nvidia_icd.json #811

Open
buzmeg opened this issue Nov 25, 2024 · 5 comments
Open

Comments

@buzmeg
Copy link

buzmeg commented Nov 25, 2024

When installing using toolbox on Fedora Silverblue 41, the nvidia-container-toolkit doesn't create the file /etc/vulkan/icd.d/nvidia_icd.json necessary for Vulkan to operate correctly--it doesn't find the nvidia card. If you take a working nvidia_icd.json from elsewhere and copy it into place, the card is found.

Probably related to #767

Installation transcript:

$ toolbox create -i registry.fedoraproject.org/fedora-toolbox:41 vk_new
Created container: vk_new
Enter with: toolbox enter vk_new
foo@fedora:~$ toolbox enter vk_new
⬢ [foo@toolbx ~]$ # Directions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
⬢ [foo@toolbx ~]$ curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
[nvidia-container-toolkit]
name=nvidia-container-toolkit
baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-container-toolkit-experimental]
name=nvidia-container-toolkit-experimental
baseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=0
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
⬢ [foo@toolbx ~]$ sudo dnf install nvidia-container-toolkit
Updating and loading repositories:
 nvidia-container-toolkit                                                                                                                                                      100% |  49.0 KiB/s |   2.9 KiB |  00m00s
>>> Librepo error: repomd.xml GPG signature verification error: Signing key not found
 Fedora 41 openh264 (From Cisco) - x86_64                                                                                                                                      100% |   8.7 KiB/s |   4.8 KiB |  00m01s
 Fedora 41 - x86_64 - Updates                                                                                                                                                  100% |   2.4 MiB/s |   5.4 MiB |  00m02s
 Fedora 41 - x86_64                                                                                                                                                            100% |  18.9 MiB/s |  35.4 MiB |  00m02s
 https://nvidia.github.io/libnvidia-container/gpgkey                                                                                                                           100% | 124.8 KiB/s |   3.1 KiB |  00m00sImporting PGP key 0xF796ECB0:
 UserID     : "NVIDIA CORPORATION (Open Source Projects) <[email protected]>"
 Fingerprint: C95B321B61E88C1809C4F759DDCAE044F796ECB0
 From       : https://nvidia.github.io/libnvidia-container/gpgkey
Is this ok [y/N]: y
The key was successfully imported.

 nvidia-container-toolkit                                                                                                                                                      100% | 231.0 KiB/s |  19.4 KiB |  00m00s
Repositories loaded.
Package                                                                  Arch            Version                                                                  Repository                                       Size
Installing:
 nvidia-container-toolkit                                                x86_64          1.17.2-1                                                                 nvidia-container-toolkit                      3.9 MiB
Installing dependencies:
 libnvidia-container-tools                                               x86_64          1.17.2-1                                                                 nvidia-container-toolkit                    104.4 KiB
 libnvidia-container1                                                    x86_64          1.17.2-1                                                                 nvidia-container-toolkit                      3.1 MiB
 nvidia-container-toolkit-base                                           x86_64          1.17.2-1                                                                 nvidia-container-toolkit                     19.3 MiB

Transaction Summary:
 Installing:         4 packages

Total size of inbound packages is 8 MiB. Need to download 8 MiB.
After this operation, 26 MiB extra will be used (install 26 MiB, remove 0 B).
Is this ok [y/N]: y
[1/4] libnvidia-container-tools-0:1.17.2-1.x86_64                                                                                                                              100% | 325.4 KiB/s |  39.4 KiB |  00m00s
[2/4] nvidia-container-toolkit-0:1.17.2-1.x86_64                                                                                                                               100% |   5.4 MiB/s |   1.2 MiB |  00m00s
[3/4] nvidia-container-toolkit-base-0:1.17.2-1.x86_64                                                                                                                          100% |  22.9 MiB/s |   5.6 MiB |  00m00s
[4/4] libnvidia-container1-0:1.17.2-1.x86_64                                                                                                                                   100% |   2.0 MiB/s |   1.0 MiB |  00m00s
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[4/4] Total                                                                                                                                                                    100% |  13.0 MiB/s |   7.9 MiB |  00m01s
Running transaction
[1/6] Verify package files                                                                                                                                                     100% | 222.0   B/s |   4.0   B |  00m00s
[2/6] Prepare transaction                                                                                                                                                      100% | 102.0   B/s |   4.0   B |  00m00s
[3/6] Installing libnvidia-container1-0:1.17.2-1.x86_64                                                                                                                        100% |  35.7 MiB/s |   3.1 MiB |  00m00s
[4/6] Installing libnvidia-container-tools-0:1.17.2-1.x86_64                                                                                                                   100% |  12.9 MiB/s | 105.5 KiB |  00m00s
[5/6] Installing nvidia-container-toolkit-base-0:1.17.2-1.x86_64                                                                                                               100% |  85.9 MiB/s |  19.3 MiB |  00m00s
[6/6] Installing nvidia-container-toolkit-0:1.17.2-1.x86_64                                                                                                                    100% |  21.9 MiB/s |   3.9 MiB |  00m00s
Warning: skipped PGP checks for 4 packages from repository: nvidia-container-toolkit
Complete!
⬢ [foo@toolbx ~]$ nvidia-smi
Mon Nov 25 17:16:17 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070        Off |   00000000:01:00.0  On |                  N/A |
|  0%   40C    P8             19W /  240W |    1858MiB /   8192MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2310      G   /usr/bin/gnome-shell                          438MiB |
|    0   N/A  N/A      3560    C+G   /usr/bin/ptyxis                               248MiB |
|    0   N/A  N/A      3672      G   /usr/bin/Xwayland                               4MiB |
|    0   N/A  N/A     13845      G   /usr/lib64/firefox/firefox                   1088MiB |
+-----------------------------------------------------------------------------------------+
⬢ [foo@toolbx ~]$ vulkaninfo --summary
bash: vulkaninfo: command not found
⬢ [foo@toolbx ~]$ sudo dnf install vulkan-tools
Updating and loading repositories:
Repositories loaded.
Package                                                                   Arch             Version                                                                   Repository                                    Size
Installing:
 vulkan-tools                                                             x86_64           1.3.296.0-2.fc41                                                          updates                                    1.3 MiB

Transaction Summary:
 Installing:         1 package

Total size of inbound packages is 322 KiB. Need to download 322 KiB.
After this operation, 1 MiB extra will be used (install 1 MiB, remove 0 B).
Is this ok [y/N]: y
[1/1] vulkan-tools-0:1.3.296.0-2.fc41.x86_64                                                                                                                                   100% | 826.9 KiB/s | 322.5 KiB |  00m00s
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[1/1] Total                                                                                                                                                                    100% | 487.1 KiB/s | 322.5 KiB |  00m01s
Running transaction
[1/3] Verify package files                                                                                                                                                     100% |   0.0   B/s |   1.0   B |  00m00s
[2/3] Prepare transaction                                                                                                                                                      100% |  50.0   B/s |   1.0   B |  00m00s
[3/3] Installing vulkan-tools-0:1.3.296.0-2.fc41.x86_64                                                                                                                        100% |  10.2 MiB/s |   1.3 MiB |  00m00s
Complete!
⬢ [foo@toolbx ~]$ vulkaninfo --summary
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.296


Instance Extensions: count = 24
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_headless_surface                : extension revision 1
VK_EXT_surface_maintenance1            : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers: count = 2
--------------------------
VK_LAYER_MESA_device_select Linux device selection layer 1.3.211  version 1
VK_LAYER_NV_optimus         NVIDIA Optimus layer         1.3.289  version 1

Devices:
========
GPU0:
	apiVersion         = 1.3.289
	driverVersion      = 0.0.1
	vendorID           = 0x10005
	deviceID           = 0x0000
	deviceType         = PHYSICAL_DEVICE_TYPE_CPU
	deviceName         = llvmpipe (LLVM 19.1.0, 256 bits)
	driverID           = DRIVER_ID_MESA_LLVMPIPE
	driverName         = llvmpipe
	driverInfo         = Mesa 24.2.6 (LLVM 19.1.0)
	conformanceVersion = 1.3.1.1
	deviceUUID         = 6d657361-3234-2e32-2e36-000000000000
	driverUUID         = 6c6c766d-7069-7065-5555-494400000000
⬢ [foo@toolbx ~]$ cp nvidia_icd.json /etc/vulkan/
explicit_layer.d/ icd.d/            implicit_layer.d/ 
⬢ [foo@toolbx ~]$ sudo cp nvidia_icd.json /etc/vulkan/icd.d/
⬢ [foo@toolbx ~]$ vulkaninfo --summary
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.296


Instance Extensions: count = 24
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_headless_surface                : extension revision 1
VK_EXT_surface_maintenance1            : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers: count = 2
--------------------------
VK_LAYER_MESA_device_select Linux device selection layer 1.3.211  version 1
VK_LAYER_NV_optimus         NVIDIA Optimus layer         1.3.289  version 1

Devices:
========
GPU0:
	apiVersion         = 1.3.289
	driverVersion      = 565.57.1.0
	vendorID           = 0x10de
	deviceID           = 0x2488
	deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName         = NVIDIA GeForce RTX 3070
	driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
	driverName         = NVIDIA
	driverInfo         = 565.57.01
	conformanceVersion = 1.3.8.2
	deviceUUID         = 3c4b693b-0187-c02a-92f8-3fb533a66f2c
	driverUUID         = a40eb34f-a796-5990-89ac-95d78eb83699
GPU1:
	apiVersion         = 1.3.289
	driverVersion      = 0.0.1
	vendorID           = 0x10005
	deviceID           = 0x0000
	deviceType         = PHYSICAL_DEVICE_TYPE_CPU
	deviceName         = llvmpipe (LLVM 19.1.0, 256 bits)
	driverID           = DRIVER_ID_MESA_LLVMPIPE
	driverName         = llvmpipe
	driverInfo         = Mesa 24.2.6 (LLVM 19.1.0)
	conformanceVersion = 1.3.1.1
	deviceUUID         = 6d657361-3234-2e32-2e36-000000000000
	driverUUID         = 6c6c766d-7069-7065-5555-494400000000
@buzmeg
Copy link
Author

buzmeg commented Nov 26, 2024

It seems like nvidia-ctk isn't recognizing the layers on the host rather than the container.

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Hmm, the issue seems to be that command. I get:

WARN[0000] Could not locate vulkan/icd.d/nvidia_icd.json: pattern vulkan/icd.d/nvidia_icd.json not found
pattern vulkan/icd.d/nvidia_icd.json not found 
WARN[0000] Could not locate vulkan/icd.d/nvidia_layers.json: pattern vulkan/icd.d/nvidia_layers.json not found
pattern vulkan/icd.d/nvidia_layers.json not found 
foo@fedora:~$ rpm -qa | grep -i nvidia | sort
akmod-nvidia-565.57.01-1.fc41.x86_64
libnvidia-container1-1.17.2-1.x86_64
libnvidia-container1-debuginfo-1.17.2-1.x86_64
libnvidia-container-devel-1.17.2-1.x86_64
libnvidia-container-tools-1.17.2-1.x86_64
libva-nvidia-driver-0.0.13^20241108git259b7b7-2.fc41.x86_64
nvidia-container-runtime-3.14.0-1.noarch
nvidia-container-toolkit-1.17.2-1.x86_64
nvidia-container-toolkit-base-1.17.2-1.x86_64
nvidia-gpu-firmware-20241110-1.fc41.noarch
nvidia-modprobe-565.57.01-1.fc41.x86_64
nvidia-persistenced-565.57.01-1.fc41.x86_64
nvidia-settings-565.57.01-1.fc41.x86_64
nvidia-xconfig-565.57.01-1.fc41.x86_64
xorg-x11-drv-nvidia-565.57.01-3.fc41.x86_64
xorg-x11-drv-nvidia-cuda-565.57.01-3.fc41.x86_64
xorg-x11-drv-nvidia-cuda-libs-565.57.01-3.fc41.x86_64
xorg-x11-drv-nvidia-devel-565.57.01-3.fc41.x86_64
xorg-x11-drv-nvidia-kmodsrc-565.57.01-3.fc41.x86_64
xorg-x11-drv-nvidia-libs-565.57.01-3.fc41.x86_64
xorg-x11-drv-nvidia-power-565.57.01-3.fc41.x86_64
xorg-x11-drv-nvidia-xorg-libs-565.57.01-3.fc41.x86_64
foo@fedora:~$ rpm -qa | grep -i vulkan | sort
mesa-vulkan-drivers-24.2.7-1.fc41.x86_64
vulkan-loader-1.3.296.0-1.fc41.x86_64
vulkan-tools-1.3.296.0-2.fc41.x86_64
foo@fedora:~$ rpm-ostree status -v
State: idle
AutomaticUpdates: disabled
Deployments:
● fedora:fedora/41/x86_64/silverblue (index: 0)
                  Version: 41.20241125.0 (2024-11-25T00:38:31Z)
               BaseCommit: 5e5f81ac9327ab9192f2317c406a4e7014679bac6c7d5de89a32e741a1092725
                           ├─ repo-0 (2024-10-24T13:55:59Z)
                           ├─ repo-1 (2024-11-25T00:16:55Z)
                           └─ repo-2 (2024-11-25T00:21:12Z)
                   Commit: 8b20eb4ced1ef0882cf204f747a024ac29eeac5e6b308223e3eb932169e7ee94
                           ├─ copr:copr.fedorainfracloud.org:phracek:PyCharm (2024-08-12T11:59:47Z)
                           ├─ fedora (2024-10-25T08:41:19Z)
                           ├─ fedora-cisco-openh264 (2024-03-11T19:22:31Z)
                           ├─ google-chrome (2024-11-24T19:58:38Z)
                           ├─ nvidia-container-toolkit (2024-11-15T23:44:44Z)
                           ├─ rpmfusion-free (2024-10-27T07:49:25Z)
                           ├─ rpmfusion-free-updates (2024-11-23T12:56:46Z)
                           ├─ rpmfusion-nonfree (2024-10-27T07:58:23Z)
                           ├─ rpmfusion-nonfree-nvidia-driver (2024-11-23T13:28:40Z)
                           ├─ rpmfusion-nonfree-steam (2024-11-23T13:28:51Z)
                           ├─ rpmfusion-nonfree-updates (2024-11-23T13:18:45Z)
                           ├─ updates (2024-11-25T01:51:23Z)
                           └─ updates-archive (2024-11-25T02:38:28Z)
                   Staged: no
                StateRoot: fedora
             GPGSignature: 1 signature
                           Signature made Sun 24 Nov 2024 06:39:40 PM CST using RSA key ID D0622462E99D6AD1
                           Good signature from "Fedora <[email protected]>"
          LayeredPackages: akmod-nvidia libnvidia-container-devel libnvidia-container1-debuginfo libva-nvidia-driver nvidia-container-runtime nvidia-container-toolkit
                           nvidia-settings nvidia-xconfig vulkan-tools xorg-x11-drv-nvidia xorg-x11-drv-nvidia-cuda xorg-x11-drv-nvidia-devel xorg-x11-drv-nvidia-libs
                           xorg-x11-drv-nvidia-xorg-libs
            LocalPackages: rpmfusion-free-release-41-1.noarch rpmfusion-nonfree-release-41-1.noarch

@elezar
Copy link
Member

elezar commented Nov 26, 2024

This may be related to #767

Could you confirm the locations of icd.d/nvidia*.json on your system?

@buzmeg
Copy link
Author

buzmeg commented Nov 26, 2024

$ sudo find / -name "nvidia*.json" -print 2>/dev/null | grep -v /var/home
/sysroot/ostree/deploy/fedora/deploy/d2f41aed87b66324b4221827a28cc19be966e25c0481801927b1f84907e7cb13.0/usr/share/vulkan/icd.d/nvidia_icd.x86_64.json
/sysroot/ostree/deploy/fedora/deploy/d2f41aed87b66324b4221827a28cc19be966e25c0481801927b1f84907e7cb13.0/usr/share/vulkan/implicit_layer.d/nvidia_layers.json
/sysroot/ostree/deploy/fedora/deploy/d2f41aed87b66324b4221827a28cc19be966e25c0481801927b1f84907e7cb13.0/usr/share/vulkansc/icd.d/nvidia_icd_vksc.x86_64.json
/usr/share/vulkan/icd.d/nvidia_icd.x86_64.json
/usr/share/vulkan/implicit_layer.d/nvidia_layers.json
/usr/share/vulkansc/icd.d/nvidia_icd_vksc.x86_64.json
/var/lib/flatpak/runtime/org.freedesktop.Platform.GL.nvidia-565-57-01/x86_64/1.4/5e4d40cab2da58e0a8348232fa463aacfc593f469cb390482aca8931012d23fb/files/extra/vulkansc/icd.d/nvidia_icd_vksc.json
/var/lib/flatpak/runtime/org.freedesktop.Platform.GL.nvidia-565-57-01/x86_64/1.4/5e4d40cab2da58e0a8348232fa463aacfc593f469cb390482aca8931012d23fb/files/extra/vulkan/icd.d/nvidia_icd.json
/var/lib/flatpak/runtime/org.freedesktop.Platform.GL.nvidia-565-57-01/x86_64/1.4/5e4d40cab2da58e0a8348232fa463aacfc593f469cb390482aca8931012d23fb/files/extra/vulkan/implicit_layer.d/nvidia_layers.json

@elezar
Copy link
Member

elezar commented Nov 29, 2024

This is the same issue as #767 where the file /usr/share/vulkan/icd.d/nvidia_icd.x86_64.json is not detected on the system. From my internal investigations, it seems as if this file may be created by the fedora-specific driver build.

@elezar
Copy link
Member

elezar commented Nov 29, 2024

See my response here #767 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants