You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The out-of-box v23.9.0 is not working on vanilla Ubuntu 22.04, throwing out GLIBC compatibility errors.
Defaulted container "nvidia-sandbox-device-plugin-ctr" out of: nvidia-sandbox-device-plugin-ctr, vfio-pci-validation (init), vgpu-devices-validation (init)
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by nvidia-kubevirt-gpu-device-plugin)nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by nvidia-kubevirt-gpu-device-plugin)
With some debugging, the issue was found to be the outdated kubevirt-gpu-device-plugin:v1.2.3 that caused the issue.
The issue could be manually rectified thru an kubectl edit ds -n gpu-operator nvidia-sandbox-device-plugin-daemonset
by changing the v1.2.3 to v1.2.4
Although this issue seems to be solved in the latest code in this commit a couple weeks ago 5d46e5d
It's not included in the latest release of the operator which confuses the common users.
Please consider release a new version with the fix.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
OS/Version(e.g. RHEL8.6, Ubuntu22.04):
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04 LTS
Release: 22.04
Codename: jammy
Kernel Version:
# uname -a
Linux cloud-gpu3 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
# kubectl get no
NAME STATUS ROLES AGE VERSION
cloud-gpu3 Ready control-plane 6d v1.27.7
cloud-gpu4 Ready control-plane 6d v1.27.7
cloud-gpu5 Ready control-plane 6d v1.27.7
GPU Operator Version:
# helm list -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
cilium kube-system 1 2023-11-29 02:50:18.53817559 +0000 UTC deployed cilium-1.13.4 1.13.4
gpu-operator-1701742661 gpu-operator 1 2023-12-05 02:17:43.14248693 +0000 UTC deployed gpu-operator-v23.9.0 v23.9.0
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
The nvidia-sandbox-device-plugin-daemonset keeps crashing after initial deployment of
operator v23.9.0
nvidia-sandbox-device-plugin-ctr" out of: nvidia-sandbox-device-plugin-ctr, vfio-pci-validation (init), vgpu-devices-validation (init)
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by nvidia-kubevirt-gpu-device-plugin)
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by nvidia-kubevirt-gpu-device-plugin)
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
After the helm install, the nvidia-sandbox-device-plugin-daemonset will be in that crashloop state forever.
4. Information to attach (optional if deemed irrelevant)
kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
# kubectl -n gpu-operator describe pod nvidia-sandbox-device-plugin-daemonset-2pzvk
Name: nvidia-sandbox-device-plugin-daemonset-2pzvk
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-sandbox-device-plugin
Node: cloud-gpu3/192.168.8.23
Start Time: Tue, 05 Dec 2023 03:53:45 +0000
Labels: app=nvidia-sandbox-device-plugin-daemonset
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=68f8974766
helm.sh/chart=gpu-operator-v23.9.0
pod-template-generation=3
Annotations: <none>
Status: Running
IP: 10.0.0.42
IPs:
IP: 10.0.0.42
Controlled By: DaemonSet/nvidia-sandbox-device-plugin-daemonset
Init Containers:
vfio-pci-validation:
Container ID: containerd://62b2550734b80ad5c243648c2adae37d08a121d921533cffa3eaa506c0e73b63
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
until [ -f /run/nvidia/validations/workload-type ]; do echo waiting for workload type status file; sleep 5; done; if [ "$(</run/nvidia/validations/workload-type)" != "vm-passthrough" ]; then echo vfio-pci not needed, skipping validation; exit 0; fi; until [ -f /run/nvidia/validations/vfio-pci-ready ]; do echo waiting for vfio-pci driver ...; sleep 5; done;
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 05 Dec 2023 03:53:48 +0000
Finished: Tue, 05 Dec 2023 03:53:48 +0000
Ready: True
Restart Count: 0
Environment:
NVIDIA_VISIBLE_DEVICES: void
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rl7xv (ro)
vgpu-devices-validation:
Container ID: containerd://fb8b131e1477d47582bfb432d06d6aeb2cfacbd9fa1b0610bbcec658197fbe69
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
until [ -f /run/nvidia/validations/workload-type ]; do echo waiting for workload type status file; sleep 5; done; if [ "$(</run/nvidia/validations/workload-type)" != "vm-vgpu" ]; then echo vgpu-devices not needed, skipping validation; exit 0; fi; until [ -f /run/nvidia/validations/vgpu-devices-ready ]; do echo waiting for vGPU devices...; sleep 5; done;
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 05 Dec 2023 03:53:50 +0000
Finished: Tue, 05 Dec 2023 03:53:50 +0000
Ready: True
Restart Count: 0
Environment:
NVIDIA_VISIBLE_DEVICES: void
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rl7xv (ro)
Containers:
nvidia-sandbox-device-plugin-ctr:
Container ID: containerd://94d85245ac917414ed675a0b717f294865154443e3fbf5f3c1d66da0001e0287
Image: nvcr.io/nvidia/kubevirt-gpu-device-plugin:v1.2.3
Image ID: nvcr.io/nvidia/kubevirt-gpu-device-plugin@sha256:1f2c9317858169d78638c2a7a4c0afa7a4e25cf5883bbbb7a79ee77fc6c832f9
Port: <none>
Host Port: <none>
Command:
nvidia-kubevirt-gpu-device-plugin
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 05 Dec 2023 03:55:25 +0000
Finished: Tue, 05 Dec 2023 03:55:25 +0000
Ready: False
Restart Count: 4
Environment: <none>
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rl7xv (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
kube-api-access-rl7xv:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.sandbox-device-plugin=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m48s default-scheduler Successfully assigned gpu-operator/nvidia-sandbox-device-plugin-daemonset-2pzvk to cloud-gpu3
Normal Pulled 2m47s kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0" already present on machine
Normal Created 2m46s kubelet Created container vfio-pci-validation
Normal Started 2m46s kubelet Started container vfio-pci-validation
Normal Pulled 2m46s kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0" already present on machine
Normal Created 2m45s kubelet Created container vgpu-devices-validation
Normal Started 2m44s kubelet Started container vgpu-devices-validation
Normal Pulled 2m2s (x4 over 2m44s) kubelet Container image "nvcr.io/nvidia/kubevirt-gpu-device-plugin:v1.2.3" already present on machine
Normal Created 2m2s (x4 over 2m44s) kubelet Created container nvidia-sandbox-device-plugin-ctr
Normal Started 2m1s (x4 over 2m43s) kubelet Started container nvidia-sandbox-device-plugin-ctr
Warning BackOff 84s (x7 over 2m42s) kubelet Back-off restarting failed container nvidia-sandbox-device-plugin-ctr in pod nvidia-sandbox-device-plugin-daemonset-2pzvk_gpu-operator(ec5044dc-a4e7-4b5c-b5dd-e8d73a2e7ec3)
If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
# kubectl -n gpu-operator logs -f nvidia-sandbox-device-plugin-daemonset-2pzvk --all-containers
vgpu-devices not needed, skipping validation
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by nvidia-kubevirt-gpu-device-plugin)
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by nvidia-kubevirt-gpu-device-plugin)
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
Not really needed for this issue.
The cause and fix was very clearly identified.
@robertwenquan I encounter the same issue and i change make build command in Makefile of nvidia-kubevirt-gpu-device-plugin with CGO_ENABLED=1 CC=musl-gcc go build -o nvidia-kubevirt-gpu-device-plugin --ldflags '-linkmode=external -extldflags=-static' kubevirt-gpu-device-plugin/cmd then rebuild the image with make build-image that solve my problem. If you missing package you can install it with apt-get install -y --no-install-recommends musl-dev musl-tools. Hope it works for you.
@simonyangcj@robertwenquan this has been fixed with nvcr.io/nvidia/kubevirt-gpu-device-plugin:v1.2.4 from operator version v23.9.1. Please update and verify.
The out-of-box
v23.9.0
is not working on vanilla Ubuntu 22.04, throwing out GLIBC compatibility errors.With some debugging, the issue was found to be the outdated
kubevirt-gpu-device-plugin:v1.2.3
that caused the issue.The issue could be manually rectified thru an
kubectl edit ds -n gpu-operator nvidia-sandbox-device-plugin-daemonset
by changing the
v1.2.3
tov1.2.4
Although this issue seems to be solved in the latest code in this commit a couple weeks ago
5d46e5d
It's not included in the latest release of the operator which confuses the common users.
Please consider release a new version with the fix.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
The nvidia-sandbox-device-plugin-daemonset keeps crashing after initial deployment of
operator v23.9.0
pod state as below:
pod log as below
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
After the helm install, the
nvidia-sandbox-device-plugin-daemonset
will be in that crashloop state forever.4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
Not really needed for this issue.
The cause and fix was very clearly identified.
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
Not really needed for this issue.
The cause and fix was very clearly identified.
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
The text was updated successfully, but these errors were encountered: