Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2: [pre-installed drivers+container-toolkit] error creating symlinks #569

Open
1 of 6 tasks
DevKyleS opened this issue Aug 17, 2023 · 14 comments
Open
1 of 6 tasks

Comments

@DevKyleS
Copy link

DevKyleS commented Aug 17, 2023

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.0/getting-started.html

1. Quick Debug Information

  • BAREMETAL
  • OS/Version:Ubuntu 22.04.3 LTS
  • Container Runtime Type/Version: containerd
  • K8s Flavor/Version: Rancher RKE2 v1.25.12+rke2r1
  • GPU Operator Version: nvidia gpu-operator-v23.6.0

2. Issue or feature description

Deploying gpu-operator there's error Error: error validating driver installation: error creating symlinks error in the nvidia-operator-validator container on RKE2 with pre-installed drivers and container toolkit.

level=info msg="Error: error validating driver installation: error creating symlinks: failed to get device nodes: failed to get GPU information: error getting all NVIDIA devices: error constructing NVIDIA PCI device 0000:01:00.1: unable to get device name: failed to find device with id '10fa'\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""

3. Steps to reproduce the issue

$ sudo apt-get install -y nvidia-driver-535-server nvidia-container-toolkit
$ sudo shutdown -r now
$ helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
	--set driver.enabled=false \
	--set toolkit.enabled=false \
	--set toolkit.env[0].name=CONTAINERD_CONFIG \
	--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
	--set toolkit.env[1].name=CONTAINERD_SOCKET \
	--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
	--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
	--set toolkit.env[2].value=nvidia \
	--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
	--set-string toolkit.env[3].value=true \
	--set psp.enabled=true

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n gpu-operator
  • kubernetes daemonset status: kubectl get ds -n gpu-operator
  • If a pod/ds is in an error state or pending state kubectl describe pod -n gpu-operator POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n gpu-operator POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n gpu-operator -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log
spectrum@spectrum:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy
spectrum@spectrum:~$ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-common-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 all [installed,automatic]
libnvidia-compute-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-container-tools/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
libnvidia-container1/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
libnvidia-decode-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-encode-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-extra-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-fbc1-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-gl-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-compute-utils-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-container-toolkit-base/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
nvidia-container-toolkit/bionic,bionic,now 1.13.5-1 amd64 [installed]
nvidia-dkms-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-driver-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed]
nvidia-firmware-535-server-535.54.03/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-kernel-common-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-kernel-source-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-utils-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
spectrum@spectrum:~$ cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
spectrum@spectrum:~$ nvidia-container-cli info
NVRM version:   535.54.03
CUDA version:   12.2

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce GTX 1650
Brand:          GeForce
GPU UUID:       GPU-648ac414-633e-cf39-d315-eabd271dfad1
Bus Location:   00000000:01:00.0
Architecture:   7.5
spectrum@spectrum:~$ kubectl logs -n gpu-operator -p nvidia-operator-validator-j8kvt --all-containers=true
time="2023-08-17T05:24:02Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Wed Aug 16 23:24:03 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650        Off | 00000000:01:00.0 Off |                  N/A |
| 30%   35C    P0              11W /  75W |      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
time="2023-08-17T05:24:03Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidiactl already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-modeset already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-uvm already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-uvm-tools already exists"
time="2023-08-17T05:24:03Z" level=info msg="Error: error validating driver installation: error creating symlinks: failed to get device nodes: failed to get GPU information: error getting all NVIDIA devices: error constructing NVIDIA PCI device 0000:01:00.1: unable to get device name: failed to find device with id '10fa'\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""
Error from server (BadRequest): previous terminated container "toolkit-validation" in pod "nvidia-operator-validator-j8kvt" not found

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

# tree /run/nvidia
/run/nvidia
├── driver
└── validations

2 directories, 0 files
spectrum@spectrum:/tmp/nvidia-gpu-operator_20230816_2329 $ cat gpu_operand_ds_nvidia-operator-validator.descr
Name:           nvidia-operator-validator
Selector:       app=nvidia-operator-validator,app.kubernetes.io/part-of=gpu-operator
Node-Selector:  nvidia.com/gpu.deploy.operator-validator=true
Labels:         app=nvidia-operator-validator
                app.kubernetes.io/managed-by=gpu-operator
                app.kubernetes.io/part-of=gpu-operator
                helm.sh/chart=gpu-operator-v23.6.0
Annotations:    deprecated.daemonset.template.generation: 1
                nvidia.com/last-applied-hash: fa2bb82bef132a9a
Desired Number of Nodes Scheduled: 1
Current Number of Nodes Scheduled: 1
Number of Nodes Scheduled with Up-to-date Pods: 1
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 1 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=nvidia-operator-validator
                    app.kubernetes.io/managed-by=gpu-operator
                    app.kubernetes.io/part-of=gpu-operator
                    helm.sh/chart=gpu-operator-v23.6.0
  Service Account:  nvidia-operator-validator
  Init Containers:
   driver-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
   toolkit-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
   cuda-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:            (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
   plugin-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                false
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:            (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
  Containers:
   nvidia-operator-validator:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    Environment:  <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
  Volumes:
   run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
   driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
   host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
   host-dev-char:
    Type:               HostPath (bare host directory volume)
    Path:               /dev/char
    HostPathType:
  Priority Class Name:  system-node-critical
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  11m   daemonset-controller  Created pod: nvidia-operator-validator-j8kvt
spectrum@spectrum:~$ sudo nvidia-ctk system create-dev-char-symlinks
INFO[0000] Creating link /dev/char/195:254 => /dev/nvidia-modeset
WARN[0000] Could not create symlink: symlink /dev/nvidia-modeset /dev/char/195:254: file exists
INFO[0000] Creating link /dev/char/507:0 => /dev/nvidia-uvm
WARN[0000] Could not create symlink: symlink /dev/nvidia-uvm /dev/char/507:0: file exists
INFO[0000] Creating link /dev/char/507:1 => /dev/nvidia-uvm-tools
WARN[0000] Could not create symlink: symlink /dev/nvidia-uvm-tools /dev/char/507:1: file exists
INFO[0000] Creating link /dev/char/195:0 => /dev/nvidia0
WARN[0000] Could not create symlink: symlink /dev/nvidia0 /dev/char/195:0: file exists
INFO[0000] Creating link /dev/char/195:255 => /dev/nvidiactl
WARN[0000] Could not create symlink: symlink /dev/nvidiactl /dev/char/195:255: file exists
INFO[0000] Creating link /dev/char/511:1 => /dev/nvidia-caps/nvidia-cap1
WARN[0000] Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap1 /dev/char/511:1: file exists
INFO[0000] Creating link /dev/char/511:2 => /dev/nvidia-caps/nvidia-cap2
WARN[0000] Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap2 /dev/char/511:2: file exists
@DevKyleS DevKyleS changed the title RKE2: [pre-installed drivers+container-toolkt] error creating symlinks RKE2: [pre-installed drivers+container-toolkit] error creating symlinks Aug 17, 2023
@elezar
Copy link
Member

elezar commented Aug 17, 2023

Hi @DevKyleS. We are aware of this issue. For the time being, please update the culster policy and add:

      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
        value: "true"

to the validator.driver.env.

cc @cdesiniotis

@DevKyleS
Copy link
Author

I got it finally working after removing /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl (by renaming)
I'll dig more into the differences later.

$ mv config.toml.tmpl config.toml.tmpl-nvidia
$ sudo service containerd restart
$ sudo service rke2-server restart
$ helm uninstall gpu-operator -n gpu-operator
$ helm install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator \
	--set driver.enabled=false \
	--set toolkit.enabled=false \
	--set toolkit.env[0].name=CONTAINERD_CONFIG \
	--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
	--set toolkit.env[1].name=CONTAINERD_SOCKET \
	--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
	--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
	--set toolkit.env[2].value=nvidia \
	--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
	--set-string toolkit.env[3].value=true \
	--set psp.enabled=true \
	--set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION \
	--set-string validator.driver.env[0].value=true

Looks like the original rke2 containerd config differs from what I had placed in the tmpl file per guidance on https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.0/getting-started.html#bare-metal-passthrough-with-pre-installed-drivers-and-nvidia-container-toolkit

But it still works...

$ sudo ctr run --rm -t     --runc-binary=/usr/bin/nvidia-container-runtime     --env NVIDIA_VISIBLE_DEVICES=all     docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04     cuda-22.2.0-base-ubuntu22.04 nvidia-smiThu Aug 17 17:36:54 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650        Off | 00000000:01:00.0 Off |                  N/A |
| 34%   36C    P8               7W /  75W |      1MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@spectrum:/var/lib/rancher/rke2/agent/etc/containerd# cat config.toml

# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/rke2/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "index.docker.io/rancher/pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true



[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true









[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

@DevKyleS
Copy link
Author

Still investigating...
Looks like I have multiple versions of nvidia-container-runtime installed somehow. Still investigating, as this appears to not be working but the node/containers can start now (couldn't before)...

image

@DevKyleS
Copy link
Author

DevKyleS commented Sep 7, 2023

Upgrading to v23.6.1 I'm no longer able to reproduce this issue.

After reading https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.1/release-notes.html#fixed-issues I attempted this with the new version. My issue has been resolved.

@elezar I think this can be closed as it appears the v23.6.1 release has fixed this problem.

@cmontemuino
Copy link

I still can reproduce this issue with version v23.9.0.

In our case the NVIDIA drivers come pre-installed, and I can see devices /dev/nvidia*

@armaneshaghi
Copy link

armaneshaghi commented Dec 7, 2023

Hi @DevKyleS. We are aware of this issue. For the time being, please update the culster policy and add:

      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
        value: "true"

to the validator.driver.env.

cc @cdesiniotis

I tried this on my cluster policy and restarted the cluster but still get the same error. I am using version 23.9.0

@armaneshaghi
Copy link

I still can reproduce this issue with version v23.9.0.

In our case the NVIDIA drivers come pre-installed, and I can see devices /dev/nvidia*

Did you find a workaround?

@cmontemuino
Copy link

  • name: DISABLE_DEV_CHAR_SYMLINK_CREATION
    value: "true"

This what we have in our values.yaml:

validator:
  driver:
    env:
      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
        value: "true"

@tinklern
Copy link

tinklern commented Dec 8, 2023

I hit this today on v23.9.1

Adding DISABLE_DEV_CHAR_SYMLINK_CREATION resolved it in my case.

That said - the release notes say this should have been fixed in 23.6.1: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.9.1/release-notes.html#id8

And I am definitely still seeing it in 23.9.1

Mostly consumer GPUs (RTX2080s) on my nodes.

@danieljkemp
Copy link

danieljkemp commented May 8, 2024

Just encountered this with a Tesla P4 on v23.9.1/rke2 v1.29

@CoderTH
Copy link

CoderTH commented May 13, 2024

We also encountered the same problem in v23.9.0. I manually modified the DISABLE_DEV_CHAR_SYMLINK_CREATION parameter as prompted, and the container-toolkit works normally. However, the tookit-validator check of nvidia-operator-validator fails, and the following error message is displayed

企业微信截图_7c31c521-5b90-4775-865d-467e0d536cc7

gpu-operator version
企业微信截图_a8e926a9-f38a-4bf9-99eb-5c0626d6853c

libnvidia-ml.so host path

企业微信截图_929d2520-74c3-454d-83be-3593620cf370

exec validator container:

/usr/lib64/ libnvidia-ml.so

@riteshsonawane1372
Copy link

Same issue with v23.9.1 having RTX3090 @elezar

@riteshsonawane1372
Copy link

Also DISABLE_DEV_CHAR_SYMLINK_CREATION "true" doesn't work

@jpi-seb
Copy link

jpi-seb commented Dec 9, 2024

I had the same problem with the nvidia-operator-validator pod displaying a correct output for nvidia-smi, but with lots of warnings about creation of symlinks. Moreover, I was not able to deploy a working GPU-enabled pod after the NVIDIA gpu-operator deployment, even if the helm chart app deployed without any problem.

The problem was apparently linked to a wrong default class name used in the pod declaration (see discussion here: k3s-io/k3s#9231). I added the explicit declaration runtimeClassName: nvidia in the pod specs, and now it works without any problem.

This problem seems to be linked to K3S environments, but we experienced it in our Kubernetes environment deployed with RKE2.

Here is a full YAML configuration to test the pod deployment:

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-version-check
spec:
  # Explicitly declare Nvidia runtime class: required by K3S (and by RKE2 ?)
  runtimeClassName: nvidia
  # ---
  restartPolicy: OnFailure
  containers:
  - name: nvidia-version-check
    image: "nvidia/cuda:12.5.0-base-ubuntu22.04"
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants