GPU-operator goes CrashLoopBackOff #624

urbaman · 2023-11-30T12:06:50Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
Kernel Version: 5.15.0-89-generic
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): kubeadm 1.28.4
GPU Operator Version: 23.9.0

2. Issue or feature description

The gpu_operator pod goes crash loop backoff.
Logs:

...
{"level":"info","ts":1701345521.9651258,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-device-plugin","status":"disabled"}
{"level":"info","ts":1701345522.0023656,"logger":"controllers.ClusterPolicy","msg":"Kata Manager disabled, deleting all Kata RuntimeClasses"}
{"level":"info","ts":1701345522.0023997,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-kata-manager","status":"disabled"}
{"level":"info","ts":1701345522.0334861,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-cc-manager","status":"disabled"}
{"level":"info","ts":1701345522.0335908,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy is ready as all resources have been successfully reconciled"}
{"level":"error","ts":1701345527.9915156,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345537.9906301,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345547.9917858,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345557.9914362,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345567.990562,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345577.991085,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345587.9907615,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345597.9910758,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}

kubectl get pods -n gpu-operator
NAME                                                         READY   STATUS             RESTARTS         AGE
gpu-feature-discovery-97z5s                                  1/1     Running            1 (75m ago)      13h
gpu-feature-discovery-wpj2m                                  1/1     Running            1 (100m ago)     13h
gpu-feature-discovery-wvvpn                                  1/1     Running            1 (41m ago)      13h
gpu-operator-657c5b798d-xgvgc                                0/1     CrashLoopBackOff   12 (4m24s ago)   67m
gpu-operator-node-feature-discovery-gc-7cc7ccfff8-7tlsl      1/1     Running            0                50m
gpu-operator-node-feature-discovery-master-d8597d549-cx849   1/1     Running            0                85m
gpu-operator-node-feature-discovery-worker-5nl2l             1/1     Running            2 (45m ago)      13h
gpu-operator-node-feature-discovery-worker-5tfwv             1/1     Running            1 (100m ago)     13h
gpu-operator-node-feature-discovery-worker-89hbs             1/1     Running            1 (76m ago)      13h
gpu-operator-node-feature-discovery-worker-8tqp8             1/1     Running            3 (11m ago)      13h
gpu-operator-node-feature-discovery-worker-g9vmn             1/1     Running            1 (41m ago)      13h
gpu-operator-node-feature-discovery-worker-h5srt             1/1     Running            2 (45m ago)      13h
gpu-operator-node-feature-discovery-worker-hs7th             1/1     Running            1 (75m ago)      13h
gpu-operator-node-feature-discovery-worker-rd4f7             1/1     Running            1 (75m ago)      13h
gpu-operator-node-feature-discovery-worker-rf4cm             1/1     Running            1 (41m ago)      13h
nvidia-container-toolkit-daemonset-7jbnb                     1/1     Running            1 (41m ago)      13h
nvidia-container-toolkit-daemonset-c5j45                     1/1     Running            1 (76m ago)      13h
nvidia-container-toolkit-daemonset-xzqk4                     1/1     Running            1 (100m ago)     13h
nvidia-cuda-validator-44zgk                                  0/1     Completed          0                74m
nvidia-cuda-validator-gkknl                                  0/1     Completed          0                39m
nvidia-cuda-validator-pxrxh                                  0/1     Completed          0                98m
nvidia-dcgm-exporter-bq4h4                                   1/1     Running            1 (41m ago)      13h
nvidia-dcgm-exporter-v7mtb                                   1/1     Running            1 (75m ago)      13h
nvidia-dcgm-exporter-xnp84                                   1/1     Running            1 (100m ago)     13h
nvidia-device-plugin-daemonset-gjq48                         1/1     Running            1 (75m ago)      13h
nvidia-device-plugin-daemonset-hhgqr                         1/1     Running            1 (100m ago)     13h
nvidia-device-plugin-daemonset-r49v4                         1/1     Running            1 (41m ago)      13h
nvidia-operator-validator-flvzv                              1/1     Running            1 (41m ago)      13h
nvidia-operator-validator-kjkrx                              1/1     Running            1 (75m ago)      13h
nvidia-operator-validator-r6nf8                              1/1     Running            1 (100m ago)     13h

kubectl describe pod -n gpu-operator gpu-operator-657c5b798d-xgvgc
Name:                 gpu-operator-657c5b798d-xgvgc
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      gpu-operator
Node:                 k8cp3/10.0.50.53
Start Time:           Thu, 30 Nov 2023 11:57:27 +0100
Labels:               app=gpu-operator
                      app.kubernetes.io/component=gpu-operator
                      app.kubernetes.io/instance=gpu-operator
                      app.kubernetes.io/managed-by=Helm
                      app.kubernetes.io/name=gpu-operator
                      app.kubernetes.io/version=v23.9.0
                      helm.sh/chart=gpu-operator-v23.9.0
                      nvidia.com/gpu-driver-upgrade-drain.skip=true
                      pod-template-hash=657c5b798d
Annotations:          cni.projectcalico.org/containerID: 6f426fd1e135a7590216f2345f1097454576d8214402e3938625013a0a417257
                      cni.projectcalico.org/podIP: 10.50.144.103/32
                      cni.projectcalico.org/podIPs: 10.50.144.103/32
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "10.50.144.103"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      openshift.io/scc: restricted-readonly
Status:               Running
IP:                   10.50.144.103
IPs:
  IP:           10.50.144.103
Controlled By:  ReplicaSet/gpu-operator-657c5b798d
Containers:
  gpu-operator:
    Container ID:  containerd://d95017ccf650fa97e7266f431931f033d51f3b513a0b10b37b226287fe820afc
    Image:         nvcr.io/nvidia/gpu-operator:v23.9.0
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:3d76a3562ca957abca31a22dc13a32d0c2e5b03c29cca6dd1662106abdfb2e32
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      gpu-operator
    Args:
      --leader-elect
      --zap-time-encoding=epoch
      --zap-log-level=info
    State:          Running
      Started:      Thu, 30 Nov 2023 13:05:41 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 30 Nov 2023 12:58:18 +0100
      Finished:     Thu, 30 Nov 2023 13:00:38 +0100
    Ready:          True
    Restart Count:  13
    Limits:
      cpu:     500m
      memory:  350Mi
    Requests:
      cpu:      200m
      memory:   100Mi
    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      WATCH_NAMESPACE:
      OPERATOR_NAMESPACE:    gpu-operator (v1:metadata.namespace)
      DRIVER_MANAGER_IMAGE:  nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.2
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kdps2 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:
  kube-api-access-kdps2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Normal   Pulled     57m (x5 over 68m)     kubelet  Container image "nvcr.io/nvidia/gpu-operator:v23.9.0" already present on machine
  Normal   Created    57m (x5 over 68m)     kubelet  Created container gpu-operator
  Normal   Started    57m (x5 over 68m)     kubelet  Started container gpu-operator
  Warning  Unhealthy  56m                   kubelet  Readiness probe failed: Get "http://10.50.144.103:8081/readyz": dial tcp 10.50.144.103:8081: connect: connection refused
  Warning  BackOff    3m9s (x190 over 63m)  kubelet  Back-off restarting failed container gpu-operator in pod gpu-operator-657c5b798d-xgvgc_gpu-operator(df000894-06cc-482a-a829-f755b6487161)

3. Steps to reproduce the issue

Just install the helm chart

The text was updated successfully, but these errors were encountered:

urbaman · 2023-12-09T22:53:04Z

Hi,

My problem was updating the helm chart to v23.9.1 without updating the CRDs.
Once upgraded with upgradeCRD: true, it fixed the operator.

Thank you.

urbaman closed this as completed Dec 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-operator goes CrashLoopBackOff #624

GPU-operator goes CrashLoopBackOff #624

urbaman commented Nov 30, 2023

urbaman commented Dec 9, 2023

GPU-operator goes CrashLoopBackOff #624

GPU-operator goes CrashLoopBackOff #624

Comments

urbaman commented Nov 30, 2023

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

urbaman commented Dec 9, 2023