-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RKE2: [pre-installed drivers+container-toolkit] error creating symlinks #569
Comments
Hi @DevKyleS. We are aware of this issue. For the time being, please update the culster policy and add:
to the cc @cdesiniotis |
I got it finally working after removing $ mv config.toml.tmpl config.toml.tmpl-nvidia
$ sudo service containerd restart
$ sudo service rke2-server restart
$ helm uninstall gpu-operator -n gpu-operator
$ helm install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value=true \
--set psp.enabled=true \
--set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION \
--set-string validator.driver.env[0].value=true Looks like the original rke2 containerd config differs from what I had placed in the tmpl file per guidance on https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.0/getting-started.html#bare-metal-passthrough-with-pre-installed-drivers-and-nvidia-container-toolkit But it still works... $ sudo ctr run --rm -t --runc-binary=/usr/bin/nvidia-container-runtime --env NVIDIA_VISIBLE_DEVICES=all docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04 cuda-22.2.0-base-ubuntu22.04 nvidia-smiThu Aug 17 17:36:54 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1650 Off | 00000000:01:00.0 Off | N/A |
| 34% 36C P8 7W / 75W | 1MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@spectrum:/var/lib/rancher/rke2/agent/etc/containerd# cat config.toml
# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2
[plugins."io.containerd.internal.v1.opt"]
path = "/var/lib/rancher/rke2/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
enable_selinux = false
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
sandbox_image = "index.docker.io/rancher/pause:3.6"
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
disable_snapshot_annotations = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime" |
Upgrading to v23.6.1 I'm no longer able to reproduce this issue. After reading https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.1/release-notes.html#fixed-issues I attempted this with the new version. My issue has been resolved. @elezar I think this can be closed as it appears the v23.6.1 release has fixed this problem. |
I still can reproduce this issue with version v23.9.0. In our case the NVIDIA drivers come pre-installed, and I can see devices |
I tried this on my cluster policy and restarted the cluster but still get the same error. I am using version 23.9.0 |
Did you find a workaround? |
This what we have in our validator:
driver:
env:
- name: DISABLE_DEV_CHAR_SYMLINK_CREATION
value: "true" |
I hit this today on v23.9.1 Adding That said - the release notes say this should have been fixed in 23.6.1: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.9.1/release-notes.html#id8 And I am definitely still seeing it in 23.9.1 Mostly consumer GPUs (RTX2080s) on my nodes. |
Just encountered this with a Tesla P4 on v23.9.1/rke2 v1.29 |
Same issue with v23.9.1 having RTX3090 @elezar |
Also |
I had the same problem with the The problem was apparently linked to a wrong default class name used in the pod declaration (see discussion here: k3s-io/k3s#9231). I added the explicit declaration This problem seems to be linked to K3S environments, but we experienced it in our Kubernetes environment deployed with RKE2. Here is a full YAML configuration to test the pod deployment: apiVersion: v1
kind: Pod
metadata:
name: nvidia-version-check
spec:
# Explicitly declare Nvidia runtime class: required by K3S (and by RKE2 ?)
runtimeClassName: nvidia
# ---
restartPolicy: OnFailure
containers:
- name: nvidia-version-check
image: "nvidia/cuda:12.5.0-base-ubuntu22.04"
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1 |
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.0/getting-started.html
1. Quick Debug Information
2. Issue or feature description
Deploying gpu-operator there's error
Error: error validating driver installation: error creating symlinks
error in thenvidia-operator-validator
container on RKE2 with pre-installed drivers and container toolkit.level=info msg="Error: error validating driver installation: error creating symlinks: failed to get device nodes: failed to get GPU information: error getting all NVIDIA devices: error constructing NVIDIA PCI device 0000:01:00.1: unable to get device name: failed to find device with id '10fa'\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""
3. Steps to reproduce the issue
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n gpu-operator
kubectl get ds -n gpu-operator
kubectl describe pod -n gpu-operator POD_NAME
kubectl logs -n gpu-operator POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n gpu-operator -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log
spectrum@spectrum:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.3 LTS Release: 22.04 Codename: jammy
spectrum@spectrum:~$ nvidia-container-cli info NVRM version: 535.54.03 CUDA version: 12.2 Device Index: 0 Device Minor: 0 Model: NVIDIA GeForce GTX 1650 Brand: GeForce GPU UUID: GPU-648ac414-633e-cf39-d315-eabd271dfad1 Bus Location: 00000000:01:00.0 Architecture: 7.5
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
# tree /run/nvidia /run/nvidia ├── driver └── validations 2 directories, 0 files
The text was updated successfully, but these errors were encountered: