RKE2: [pre-installed drivers+container-toolkit] error creating symlinks #569

DevKyleS · 2023-08-17T05:59:50Z

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.0/getting-started.html

1. Quick Debug Information

BAREMETAL
OS/Version:Ubuntu 22.04.3 LTS
Container Runtime Type/Version: containerd
K8s Flavor/Version: Rancher RKE2 v1.25.12+rke2r1
GPU Operator Version: nvidia gpu-operator-v23.6.0

2. Issue or feature description

Deploying gpu-operator there's error Error: error validating driver installation: error creating symlinks error in the nvidia-operator-validator container on RKE2 with pre-installed drivers and container toolkit.

level=info msg="Error: error validating driver installation: error creating symlinks: failed to get device nodes: failed to get GPU information: error getting all NVIDIA devices: error constructing NVIDIA PCI device 0000:01:00.1: unable to get device name: failed to find device with id '10fa'\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""

3. Steps to reproduce the issue

$ sudo apt-get install -y nvidia-driver-535-server nvidia-container-toolkit
$ sudo shutdown -r now
$ helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
	--set driver.enabled=false \
	--set toolkit.enabled=false \
	--set toolkit.env[0].name=CONTAINERD_CONFIG \
	--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
	--set toolkit.env[1].name=CONTAINERD_SOCKET \
	--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
	--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
	--set toolkit.env[2].value=nvidia \
	--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
	--set-string toolkit.env[3].value=true \
	--set psp.enabled=true

4. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods -n gpu-operator
kubernetes daemonset status: kubectl get ds -n gpu-operator
If a pod/ds is in an error state or pending state kubectl describe pod -n gpu-operator POD_NAME
If a pod/ds is in an error state or pending state kubectl logs -n gpu-operator POD_NAME --all-containers
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n gpu-operator -c nvidia-driver-ctr -- nvidia-smi
containerd logs journalctl -u containerd > containerd.log

spectrum@spectrum:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

spectrum@spectrum:~$ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-common-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 all [installed,automatic]
libnvidia-compute-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-container-tools/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
libnvidia-container1/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
libnvidia-decode-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-encode-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-extra-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-fbc1-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-gl-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-compute-utils-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-container-toolkit-base/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
nvidia-container-toolkit/bionic,bionic,now 1.13.5-1 amd64 [installed]
nvidia-dkms-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-driver-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed]
nvidia-firmware-535-server-535.54.03/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-kernel-common-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-kernel-source-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-utils-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]

spectrum@spectrum:~$ cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

spectrum@spectrum:~$ nvidia-container-cli info
NVRM version:   535.54.03
CUDA version:   12.2

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce GTX 1650
Brand:          GeForce
GPU UUID:       GPU-648ac414-633e-cf39-d315-eabd271dfad1
Bus Location:   00000000:01:00.0
Architecture:   7.5

spectrum@spectrum:~$ kubectl logs -n gpu-operator -p nvidia-operator-validator-j8kvt --all-containers=true
time="2023-08-17T05:24:02Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Wed Aug 16 23:24:03 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650        Off | 00000000:01:00.0 Off |                  N/A |
| 30%   35C    P0              11W /  75W |      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
time="2023-08-17T05:24:03Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidiactl already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-modeset already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-uvm already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-uvm-tools already exists"
time="2023-08-17T05:24:03Z" level=info msg="Error: error validating driver installation: error creating symlinks: failed to get device nodes: failed to get GPU information: error getting all NVIDIA devices: error constructing NVIDIA PCI device 0000:01:00.1: unable to get device name: failed to find device with id '10fa'\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""
Error from server (BadRequest): previous terminated container "toolkit-validation" in pod "nvidia-operator-validator-j8kvt" not found

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

# tree /run/nvidia
/run/nvidia
├── driver
└── validations

2 directories, 0 files

spectrum@spectrum:/tmp/nvidia-gpu-operator_20230816_2329 $ cat gpu_operand_ds_nvidia-operator-validator.descr
Name:           nvidia-operator-validator
Selector:       app=nvidia-operator-validator,app.kubernetes.io/part-of=gpu-operator
Node-Selector:  nvidia.com/gpu.deploy.operator-validator=true
Labels:         app=nvidia-operator-validator
                app.kubernetes.io/managed-by=gpu-operator
                app.kubernetes.io/part-of=gpu-operator
                helm.sh/chart=gpu-operator-v23.6.0
Annotations:    deprecated.daemonset.template.generation: 1
                nvidia.com/last-applied-hash: fa2bb82bef132a9a
Desired Number of Nodes Scheduled: 1
Current Number of Nodes Scheduled: 1
Number of Nodes Scheduled with Up-to-date Pods: 1
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 1 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=nvidia-operator-validator
                    app.kubernetes.io/managed-by=gpu-operator
                    app.kubernetes.io/part-of=gpu-operator
                    helm.sh/chart=gpu-operator-v23.6.0
  Service Account:  nvidia-operator-validator
  Init Containers:
   driver-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
   toolkit-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
   cuda-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:            (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
   plugin-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                false
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:            (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
  Containers:
   nvidia-operator-validator:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    Environment:  <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
  Volumes:
   run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
   driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
   host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
   host-dev-char:
    Type:               HostPath (bare host directory volume)
    Path:               /dev/char
    HostPathType:
  Priority Class Name:  system-node-critical
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  11m   daemonset-controller  Created pod: nvidia-operator-validator-j8kvt

spectrum@spectrum:~$ sudo nvidia-ctk system create-dev-char-symlinks
INFO[0000] Creating link /dev/char/195:254 => /dev/nvidia-modeset
WARN[0000] Could not create symlink: symlink /dev/nvidia-modeset /dev/char/195:254: file exists
INFO[0000] Creating link /dev/char/507:0 => /dev/nvidia-uvm
WARN[0000] Could not create symlink: symlink /dev/nvidia-uvm /dev/char/507:0: file exists
INFO[0000] Creating link /dev/char/507:1 => /dev/nvidia-uvm-tools
WARN[0000] Could not create symlink: symlink /dev/nvidia-uvm-tools /dev/char/507:1: file exists
INFO[0000] Creating link /dev/char/195:0 => /dev/nvidia0
WARN[0000] Could not create symlink: symlink /dev/nvidia0 /dev/char/195:0: file exists
INFO[0000] Creating link /dev/char/195:255 => /dev/nvidiactl
WARN[0000] Could not create symlink: symlink /dev/nvidiactl /dev/char/195:255: file exists
INFO[0000] Creating link /dev/char/511:1 => /dev/nvidia-caps/nvidia-cap1
WARN[0000] Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap1 /dev/char/511:1: file exists
INFO[0000] Creating link /dev/char/511:2 => /dev/nvidia-caps/nvidia-cap2
WARN[0000] Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap2 /dev/char/511:2: file exists

The text was updated successfully, but these errors were encountered:

elezar · 2023-08-17T09:03:33Z

Hi @DevKyleS. We are aware of this issue. For the time being, please update the culster policy and add:

      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
        value: "true"

to the validator.driver.env.

cc @cdesiniotis

DevKyleS · 2023-08-17T17:38:41Z

I got it finally working after removing /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl (by renaming)
I'll dig more into the differences later.

$ mv config.toml.tmpl config.toml.tmpl-nvidia
$ sudo service containerd restart
$ sudo service rke2-server restart
$ helm uninstall gpu-operator -n gpu-operator
$ helm install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator \
	--set driver.enabled=false \
	--set toolkit.enabled=false \
	--set toolkit.env[0].name=CONTAINERD_CONFIG \
	--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
	--set toolkit.env[1].name=CONTAINERD_SOCKET \
	--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
	--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
	--set toolkit.env[2].value=nvidia \
	--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
	--set-string toolkit.env[3].value=true \
	--set psp.enabled=true \
	--set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION \
	--set-string validator.driver.env[0].value=true

Looks like the original rke2 containerd config differs from what I had placed in the tmpl file per guidance on https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.0/getting-started.html#bare-metal-passthrough-with-pre-installed-drivers-and-nvidia-container-toolkit

But it still works...

$ sudo ctr run --rm -t     --runc-binary=/usr/bin/nvidia-container-runtime     --env NVIDIA_VISIBLE_DEVICES=all     docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04     cuda-22.2.0-base-ubuntu22.04 nvidia-smiThu Aug 17 17:36:54 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650        Off | 00000000:01:00.0 Off |                  N/A |
| 34%   36C    P8               7W /  75W |      1MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

root@spectrum:/var/lib/rancher/rke2/agent/etc/containerd# cat config.toml

# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/rke2/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "index.docker.io/rancher/pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true



[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true









[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

DevKyleS · 2023-08-17T18:41:03Z

Still investigating...
Looks like I have multiple versions of nvidia-container-runtime installed somehow. Still investigating, as this appears to not be working but the node/containers can start now (couldn't before)...

DevKyleS · 2023-09-07T03:36:12Z

Upgrading to v23.6.1 I'm no longer able to reproduce this issue.

After reading https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.1/release-notes.html#fixed-issues I attempted this with the new version. My issue has been resolved.

@elezar I think this can be closed as it appears the v23.6.1 release has fixed this problem.

cmontemuino · 2023-11-08T15:38:42Z

I still can reproduce this issue with version v23.9.0.

In our case the NVIDIA drivers come pre-installed, and I can see devices /dev/nvidia*

armaneshaghi · 2023-12-07T14:55:41Z

Hi @DevKyleS. We are aware of this issue. For the time being, please update the culster policy and add:
      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
        value: "true"
to the validator.driver.env.

cc @cdesiniotis

I tried this on my cluster policy and restarted the cluster but still get the same error. I am using version 23.9.0

armaneshaghi · 2023-12-07T14:57:05Z

I still can reproduce this issue with version v23.9.0.

In our case the NVIDIA drivers come pre-installed, and I can see devices /dev/nvidia*

Did you find a workaround?

cmontemuino · 2023-12-08T10:29:52Z

name: DISABLE_DEV_CHAR_SYMLINK_CREATION
value: "true"

This what we have in our values.yaml:

validator:
  driver:
    env:
      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
        value: "true"

tinklern · 2023-12-08T14:22:00Z

I hit this today on v23.9.1

Adding DISABLE_DEV_CHAR_SYMLINK_CREATION resolved it in my case.

That said - the release notes say this should have been fixed in 23.6.1: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.9.1/release-notes.html#id8

And I am definitely still seeing it in 23.9.1

Mostly consumer GPUs (RTX2080s) on my nodes.

danieljkemp · 2024-05-08T15:04:05Z

Just encountered this with a Tesla P4 on v23.9.1/rke2 v1.29

CoderTH · 2024-05-13T02:50:47Z

We also encountered the same problem in v23.9.0. I manually modified the DISABLE_DEV_CHAR_SYMLINK_CREATION parameter as prompted, and the container-toolkit works normally. However, the tookit-validator check of nvidia-operator-validator fails, and the following error message is displayed

企业微信截图_7c31c521-5b90-4775-865d-467e0d536cc7

gpu-operator version

libnvidia-ml.so host path

exec validator container:

/usr/lib64/ libnvidia-ml.so

riteshsonawane1372 · 2024-06-10T09:37:18Z

Same issue with v23.9.1 having RTX3090 @elezar

riteshsonawane1372 · 2024-06-10T09:44:16Z

Also DISABLE_DEV_CHAR_SYMLINK_CREATION "true" doesn't work

jpi-seb · 2024-12-09T13:57:05Z

I had the same problem with the nvidia-operator-validator pod displaying a correct output for nvidia-smi, but with lots of warnings about creation of symlinks. Moreover, I was not able to deploy a working GPU-enabled pod after the NVIDIA gpu-operator deployment, even if the helm chart app deployed without any problem.

The problem was apparently linked to a wrong default class name used in the pod declaration (see discussion here: k3s-io/k3s#9231). I added the explicit declaration runtimeClassName: nvidia in the pod specs, and now it works without any problem.

This problem seems to be linked to K3S environments, but we experienced it in our Kubernetes environment deployed with RKE2.

Here is a full YAML configuration to test the pod deployment:

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-version-check
spec:
  # Explicitly declare Nvidia runtime class: required by K3S (and by RKE2 ?)
  runtimeClassName: nvidia
  # ---
  restartPolicy: OnFailure
  containers:
  - name: nvidia-version-check
    image: "nvidia/cuda:12.5.0-base-ubuntu22.04"
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1

DevKyleS changed the title ~~RKE2: [pre-installed drivers+container-toolkt] error creating symlinks~~ RKE2: [pre-installed drivers+container-toolkit] error creating symlinks Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RKE2: [pre-installed drivers+container-toolkit] error creating symlinks #569

RKE2: [pre-installed drivers+container-toolkit] error creating symlinks #569

DevKyleS commented Aug 17, 2023 •

edited

Loading

elezar commented Aug 17, 2023

DevKyleS commented Aug 17, 2023

DevKyleS commented Aug 17, 2023

DevKyleS commented Sep 7, 2023 •

edited

Loading

cmontemuino commented Nov 8, 2023

armaneshaghi commented Dec 7, 2023 •

edited

Loading

armaneshaghi commented Dec 7, 2023

cmontemuino commented Dec 8, 2023

tinklern commented Dec 8, 2023

danieljkemp commented May 8, 2024 •

edited

Loading

CoderTH commented May 13, 2024

riteshsonawane1372 commented Jun 10, 2024

riteshsonawane1372 commented Jun 10, 2024

jpi-seb commented Dec 9, 2024

RKE2: [pre-installed drivers+container-toolkit] error creating symlinks #569

RKE2: [pre-installed drivers+container-toolkit] error creating symlinks #569

Comments

DevKyleS commented Aug 17, 2023 • edited Loading

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

elezar commented Aug 17, 2023

DevKyleS commented Aug 17, 2023

DevKyleS commented Aug 17, 2023

DevKyleS commented Sep 7, 2023 • edited Loading

cmontemuino commented Nov 8, 2023

armaneshaghi commented Dec 7, 2023 • edited Loading

armaneshaghi commented Dec 7, 2023

cmontemuino commented Dec 8, 2023

tinklern commented Dec 8, 2023

danieljkemp commented May 8, 2024 • edited Loading

CoderTH commented May 13, 2024

riteshsonawane1372 commented Jun 10, 2024

riteshsonawane1372 commented Jun 10, 2024

jpi-seb commented Dec 9, 2024

DevKyleS commented Aug 17, 2023 •

edited

Loading

DevKyleS commented Sep 7, 2023 •

edited

Loading

armaneshaghi commented Dec 7, 2023 •

edited

Loading

danieljkemp commented May 8, 2024 •

edited

Loading