Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-sandbox-device-plugin-daemonset in CrashLoopBackOff with GLIBC compatibility errors #627

Closed
6 tasks done
robertwenquan opened this issue Dec 5, 2023 · 2 comments
Closed
6 tasks done

Comments

@robertwenquan
Copy link

robertwenquan commented Dec 5, 2023

The out-of-box v23.9.0 is not working on vanilla Ubuntu 22.04, throwing out GLIBC compatibility errors.

Defaulted container "nvidia-sandbox-device-plugin-ctr" out of: nvidia-sandbox-device-plugin-ctr, vfio-pci-validation (init), vgpu-devices-validation (init)
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by nvidia-kubevirt-gpu-device-plugin)
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by nvidia-kubevirt-gpu-device-plugin)

With some debugging, the issue was found to be the outdated kubevirt-gpu-device-plugin:v1.2.3 that caused the issue.

The issue could be manually rectified thru an
kubectl edit ds -n gpu-operator nvidia-sandbox-device-plugin-daemonset
by changing the v1.2.3 to v1.2.4

Although this issue seems to be solved in the latest code in this commit a couple weeks ago
5d46e5d
It's not included in the latest release of the operator which confuses the common users.

Please consider release a new version with the fix.

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):
# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04 LTS
Release:	22.04
Codename:	jammy
  • Kernel Version:
# uname -a
Linux cloud-gpu3 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
# crictl version
Version:  0.1.0
RuntimeName:  containerd
RuntimeVersion:  v1.6.24
RuntimeApiVersion:  v1

  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
# kubectl get no
NAME            STATUS   ROLES           AGE   VERSION
cloud-gpu3   Ready    control-plane   6d    v1.27.7
cloud-gpu4   Ready    control-plane   6d    v1.27.7
cloud-gpu5   Ready    control-plane   6d    v1.27.7
  • GPU Operator Version:
# helm list -A
NAME                   	NAMESPACE   	REVISION	UPDATED                               	STATUS  	CHART               	APP VERSION
cilium                 	kube-system 	1       	2023-11-29 02:50:18.53817559 +0000 UTC	deployed	cilium-1.13.4       	1.13.4
gpu-operator-1701742661	gpu-operator	1       	2023-12-05 02:17:43.14248693 +0000 UTC	deployed	gpu-operator-v23.9.0	v23.9.0

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

The nvidia-sandbox-device-plugin-daemonset keeps crashing after initial deployment of
operator v23.9.0

pod state as below:

# kubectl get po -n gpu-operator
NAME                                                              READY   STATUS                  RESTARTS      AGE
gpu-operator-1701742661-node-feature-discovery-gc-667b55d6lbsxv   1/1     Running                 0             3m40s
gpu-operator-1701742661-node-feature-discovery-master-f99czbxrp   1/1     Running                 0             3m40s
gpu-operator-1701742661-node-feature-discovery-worker-62rtq       1/1     Running                 0             3m40s
gpu-operator-1701742661-node-feature-discovery-worker-nrxts       1/1     Running                 0             3m40s
gpu-operator-1701742661-node-feature-discovery-worker-pgp2q       1/1     Running                 0             3m40s
gpu-operator-c4f4cc76b-zs5hl                                      1/1     Running                 0             3m40s
nvidia-sandbox-device-plugin-daemonset-cfmpt                      0/1     Init:0/2                0             3m19s
nvidia-sandbox-device-plugin-daemonset-mrqtl                      0/1     Init:0/2                0             3m19s
nvidia-sandbox-device-plugin-daemonset-zz8ns                      0/1     Init:0/2                0             3m19s
nvidia-sandbox-validator-2kh5c                                    0/1     Init:CrashLoopBackOff   5 (15s ago)   3m19s
nvidia-sandbox-validator-55kv8                                    0/1     Init:CrashLoopBackOff   5 (29s ago)   3m19s
nvidia-sandbox-validator-qsh5w                                    0/1     Init:CrashLoopBackOff   5 (26s ago)   3m19s
nvidia-vfio-manager-llq5x                                         0/1     Init:0/1                2 (69s ago)   3m19s
nvidia-vfio-manager-n8vrd                                         0/1     Init:0/1                2 (71s ago)   3m19s
nvidia-vfio-manager-vtrmj                                         0/1     Init:0/1                2 (56s ago)   3m19s

pod log as below

nvidia-sandbox-device-plugin-ctr" out of: nvidia-sandbox-device-plugin-ctr, vfio-pci-validation (init), vgpu-devices-validation (init)
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by nvidia-kubevirt-gpu-device-plugin)
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by nvidia-kubevirt-gpu-device-plugin)

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

After the helm install, the nvidia-sandbox-device-plugin-daemonset will be in that crashloop state forever.

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
# kubectl get po -n gpu-operator
NAME                                                              READY   STATUS                  RESTARTS        AGE
gpu-operator-1701742661-node-feature-discovery-gc-667b55d6lbsxv   1/1     Running                 0               9m49s
gpu-operator-1701742661-node-feature-discovery-master-f99czbxrp   1/1     Running                 0               9m49s
gpu-operator-1701742661-node-feature-discovery-worker-62rtq       1/1     Running                 0               9m49s
gpu-operator-1701742661-node-feature-discovery-worker-nrxts       1/1     Running                 0               9m49s
gpu-operator-1701742661-node-feature-discovery-worker-pgp2q       1/1     Running                 0               9m49s
gpu-operator-c4f4cc76b-zs5hl                                      1/1     Running                 0               9m49s
nvidia-sandbox-device-plugin-daemonset-cfmpt                      0/1     Init:CrashLoopBackOff   0               9m28s
nvidia-sandbox-device-plugin-daemonset-mrqtl                      0/1     Init:CrashLoopBackOff   0               9m28s
nvidia-sandbox-device-plugin-daemonset-zz8ns                      0/1     Init:CrashLoopBackOff   0               9m28s
nvidia-sandbox-validator-2kh5c                                    1/1     Running                 6 (3m33s ago)   9m28s
nvidia-sandbox-validator-55kv8                                    1/1     Running                 6 (3m51s ago)   9m28s
nvidia-sandbox-validator-qsh5w                                    1/1     Running                 6 (3m50s ago)   9m28s
nvidia-vfio-manager-llq5x                                         1/1     Running                 5 (14s ago)     9m28s
nvidia-vfio-manager-n8vrd                                         1/1     Running                 5 (2m28s ago)   9m28s
nvidia-vfio-manager-vtrmj                                         1/1     Running                 5 (2m13s ago)   9m28s
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
# kubectl get ds -n gpu-operator
NAME                                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                                   0         0         0       0            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   97m
gpu-operator-1701742661-node-feature-discovery-worker   3         3         3       3            3           <none>                                             98m
nvidia-container-toolkit-daemonset                      0         0         0       0            0           nvidia.com/gpu.deploy.container-toolkit=true       97m
nvidia-dcgm-exporter                                    0         0         0       0            0           nvidia.com/gpu.deploy.dcgm-exporter=true           97m
nvidia-device-plugin-daemonset                          0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true           97m
nvidia-driver-daemonset                                 0         0         0       0            0           nvidia.com/gpu.deploy.driver=true                  97m
nvidia-mig-manager                                      0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             97m
nvidia-operator-validator                               0         0         0       0            0           nvidia.com/gpu.deploy.operator-validator=true      97m
nvidia-sandbox-device-plugin-daemonset                  3         3         0       3            0           nvidia.com/gpu.deploy.sandbox-device-plugin=true   97m
nvidia-sandbox-validator                                3         3         3       3            3           nvidia.com/gpu.deploy.sandbox-validator=true       97m
nvidia-vfio-manager                                     3         3         3       3            3           nvidia.com/gpu.deploy.vfio-manager=true            97m
nvidia-vgpu-device-manager                              0         0         0       0            0           nvidia.com/gpu.deploy.vgpu-device-manager=true     97m
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
# kubectl -n gpu-operator describe pod nvidia-sandbox-device-plugin-daemonset-2pzvk
Name:                 nvidia-sandbox-device-plugin-daemonset-2pzvk
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-sandbox-device-plugin
Node:                 cloud-gpu3/192.168.8.23
Start Time:           Tue, 05 Dec 2023 03:53:45 +0000
Labels:               app=nvidia-sandbox-device-plugin-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=68f8974766
                      helm.sh/chart=gpu-operator-v23.9.0
                      pod-template-generation=3
Annotations:          <none>
Status:               Running
IP:                   10.0.0.42
IPs:
  IP:           10.0.0.42
Controlled By:  DaemonSet/nvidia-sandbox-device-plugin-daemonset
Init Containers:
  vfio-pci-validation:
    Container ID:  containerd://62b2550734b80ad5c243648c2adae37d08a121d921533cffa3eaa506c0e73b63
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      until [ -f /run/nvidia/validations/workload-type ]; do echo waiting for workload type status file; sleep 5; done; if [ "$(</run/nvidia/validations/workload-type)" != "vm-passthrough" ]; then echo vfio-pci not needed, skipping validation; exit 0; fi; until [ -f /run/nvidia/validations/vfio-pci-ready ]; do echo waiting for vfio-pci driver ...; sleep 5; done;
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 05 Dec 2023 03:53:48 +0000
      Finished:     Tue, 05 Dec 2023 03:53:48 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      NVIDIA_VISIBLE_DEVICES:  void
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rl7xv (ro)
  vgpu-devices-validation:
    Container ID:  containerd://fb8b131e1477d47582bfb432d06d6aeb2cfacbd9fa1b0610bbcec658197fbe69
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      until [ -f /run/nvidia/validations/workload-type ]; do echo waiting for workload type status file; sleep 5; done; if [ "$(</run/nvidia/validations/workload-type)" != "vm-vgpu" ]; then echo vgpu-devices not needed, skipping validation; exit 0; fi; until [ -f /run/nvidia/validations/vgpu-devices-ready ]; do echo waiting for vGPU devices...; sleep 5; done;
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 05 Dec 2023 03:53:50 +0000
      Finished:     Tue, 05 Dec 2023 03:53:50 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      NVIDIA_VISIBLE_DEVICES:  void
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rl7xv (ro)
Containers:
  nvidia-sandbox-device-plugin-ctr:
    Container ID:  containerd://94d85245ac917414ed675a0b717f294865154443e3fbf5f3c1d66da0001e0287
    Image:         nvcr.io/nvidia/kubevirt-gpu-device-plugin:v1.2.3
    Image ID:      nvcr.io/nvidia/kubevirt-gpu-device-plugin@sha256:1f2c9317858169d78638c2a7a4c0afa7a4e25cf5883bbbb7a79ee77fc6c832f9
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-kubevirt-gpu-device-plugin
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 05 Dec 2023 03:55:25 +0000
      Finished:     Tue, 05 Dec 2023 03:55:25 +0000
    Ready:          False
    Restart Count:  4
    Environment:    <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rl7xv (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  kube-api-access-rl7xv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.sandbox-device-plugin=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  2m48s                 default-scheduler  Successfully assigned gpu-operator/nvidia-sandbox-device-plugin-daemonset-2pzvk to cloud-gpu3
  Normal   Pulled     2m47s                 kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0" already present on machine
  Normal   Created    2m46s                 kubelet            Created container vfio-pci-validation
  Normal   Started    2m46s                 kubelet            Started container vfio-pci-validation
  Normal   Pulled     2m46s                 kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0" already present on machine
  Normal   Created    2m45s                 kubelet            Created container vgpu-devices-validation
  Normal   Started    2m44s                 kubelet            Started container vgpu-devices-validation
  Normal   Pulled     2m2s (x4 over 2m44s)  kubelet            Container image "nvcr.io/nvidia/kubevirt-gpu-device-plugin:v1.2.3" already present on machine
  Normal   Created    2m2s (x4 over 2m44s)  kubelet            Created container nvidia-sandbox-device-plugin-ctr
  Normal   Started    2m1s (x4 over 2m43s)  kubelet            Started container nvidia-sandbox-device-plugin-ctr
  Warning  BackOff    84s (x7 over 2m42s)   kubelet            Back-off restarting failed container nvidia-sandbox-device-plugin-ctr in pod nvidia-sandbox-device-plugin-daemonset-2pzvk_gpu-operator(ec5044dc-a4e7-4b5c-b5dd-e8d73a2e7ec3)
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
# kubectl -n gpu-operator logs -f nvidia-sandbox-device-plugin-daemonset-2pzvk --all-containers
vgpu-devices not needed, skipping validation
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by nvidia-kubevirt-gpu-device-plugin)
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by nvidia-kubevirt-gpu-device-plugin)
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi

Not really needed for this issue.
The cause and fix was very clearly identified.

  • containerd logs journalctl -u containerd > containerd.log
Dec 05 03:54:33 cloud-gpu3 containerd[1970]: time="2023-12-05T03:54:33.096715269Z" level=warning msg="cleanup warnings time=\"2023-12-05T03:54:33Z\" level=info msg=\"starting signal loop\" namespace=k8s>
Dec 05 03:55:25 cloud-gpu3 containerd[1970]: time="2023-12-05T03:55:25.171085449Z" level=warning msg="cleaning up after shim disconnected" id=94d85245ac917414ed675a0b717f294865154443e3fbf5f3c1d66da0001e>
Dec 05 03:55:25 cloud-gpu3 containerd[1970]: time="2023-12-05T03:55:25.187298224Z" level=warning msg="cleanup warnings time=\"2023-12-05T03:55:25Z\" level=info msg=\"starting signal loop\" namespace=k8s>
Dec 05 03:56:46 cloud-gpu3 containerd[1970]: time="2023-12-05T03:56:46.150372012Z" level=warning msg="cleaning up after shim disconnected" id=51de1c59df0454951dabd458a6cc8950f2baca0920572d165d1415307155>
Dec 05 03:56:46 cloud-gpu3 containerd[1970]: time="2023-12-05T03:56:46.168886231Z" level=warning msg="cleanup warnings time=\"2023-12-05T03:56:46Z\" level=info msg=\"starting signal loop\" namespace=k8s>
Dec 05 03:59:27 cloud-gpu3 containerd[1970]: time="2023-12-05T03:59:27.249762272Z" level=warning msg="cleaning up after shim disconnected" id=f91bf3143d2b889fedddd57e1aff0a81d220cd61adcbd78be7c7b7c29bcf>
Dec 05 03:59:27 cloud-gpu3 containerd[1970]: time="2023-12-05T03:59:27.268974332Z" level=warning msg="cleanup warnings time=\"2023-12-05T03:59:27Z\" level=info msg=\"starting signal loop\" namespace=k8s>
Dec 05 04:04:34 cloud-gpu3 containerd[1970]: time="2023-12-05T04:04:34.100494045Z" level=warning msg="cleaning up after shim disconnected" id=4266cec26a1bbaa79c1653b0ac194200796801850f50cacbc201c1a1ae00>
Dec 05 04:04:34 cloud-gpu3 containerd[1970]: time="2023-12-05T04:04:34.118684990Z" level=warning msg="cleanup warnings time=\"2023-12-05T04:04:34Z\" level=info msg=\"starting signal loop\" namespace=k8s>
Dec 05 04:08:39 cloud-gpu3 containerd[1970]: time="2023-12-05T04:08:39.712462348Z" level=warning msg="cleaning up after shim disconnected" id=6218ca5779511575691d52380a3b2e613380dd6299e5e6a114e143751c82>
Dec 05 04:08:39 cloud-gpu3 containerd[1970]: time="2023-12-05T04:08:39.799321861Z" level=warning msg="cleanup warnings time=\"2023-12-05T04:08:39Z\" level=info msg=\"starting signal loop\" namespace=k8s>
Dec 05 04:08:42 cloud-gpu3 containerd[1970]: time="2023-12-05T04:08:42.475190009Z" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.container>
Dec 05 04:08:42 cloud-gpu3 containerd[1970]: time="2023-12-05T04:08:42.475427765Z" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.contain>
Dec 05 04:08:42 cloud-gpu3 containerd[1970]: time="2023-12-05T04:08:42.475458432Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttr>
Dec 05 04:08:42 cloud-gpu3 containerd[1970]: time="2023-12-05T04:08:42.476068261Z" level=info msg="starting signal loop" namespace=k8s.io path=/run/containerd/io.containerd.runtime.v2.task/k8s.io/e5abb3>
Dec 05 04:08:43 cloud-gpu3 containerd[1970]: time="2023-12-05T04:08:43.345198648Z" level=warning msg="cleaning up after shim disconnected" id=009dcb0e7eecd9c03d096d6471190537b0d189a9daa390fb690e0129c478>
Dec 05 04:08:43 cloud-gpu3 containerd[1970]: time="2023-12-05T04:08:43.365039159Z" level=warning msg="cleanup warnings time=\"2023-12-05T04:08:43Z\" level=info msg=\"starting signal loop\" namespace=k8s>
Dec 05 04:08:44 cloud-gpu3 containerd[1970]: time="2023-12-05T04:08:44.827126955Z" level=warning msg="cleaning up after shim disconnected" id=3750807ba1b82c0a8323aebaaf938b3f18cd430a8c8290ec9409f4acf1ef>
Dec 05 04:08:44 cloud-gpu3 containerd[1970]: time="2023-12-05T04:08:44.842068813Z" level=warning msg="cleanup warnings time=\"2023-12-05T04:08:44Z\" level=info msg=\"starting signal loop\" namespace=k8s>
Dec 05 04:09:19 cloud-gpu3 containerd[1970]: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Dec 05 04:09:19 cloud-gpu3 containerd[1970]: level=warning msg="Unable to enter namespace \"\", will not delete interface" error="failed to Statfs \"\": no such file or directory" subsys=cilium-cni
Dec 05 04:09:19 cloud-gpu3 containerd[1970]: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Dec 05 04:09:19 cloud-gpu3 containerd[1970]: level=warning msg="Unable to enter namespace \"\", will not delete interface" error="failed to Statfs \"\": no such file or directory" subsys=cilium-cni

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

Not really needed for this issue.
The cause and fix was very clearly identified.

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

@simonyangcj
Copy link

simonyangcj commented Dec 5, 2023

@robertwenquan I encounter the same issue and i change make build command in Makefile of nvidia-kubevirt-gpu-device-plugin with CGO_ENABLED=1 CC=musl-gcc go build -o nvidia-kubevirt-gpu-device-plugin --ldflags '-linkmode=external -extldflags=-static' kubevirt-gpu-device-plugin/cmd then rebuild the image with make build-image that solve my problem. If you missing package you can install it with apt-get install -y --no-install-recommends musl-dev musl-tools. Hope it works for you.

@shivamerla
Copy link
Contributor

@simonyangcj @robertwenquan this has been fixed with nvcr.io/nvidia/kubevirt-gpu-device-plugin:v1.2.4 from operator version v23.9.1. Please update and verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants