Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-operator goes CrashLoopBackOff #624

Closed
urbaman opened this issue Nov 30, 2023 · 1 comment
Closed

GPU-operator goes CrashLoopBackOff #624

urbaman opened this issue Nov 30, 2023 · 1 comment

Comments

@urbaman
Copy link

urbaman commented Nov 30, 2023

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 5.15.0-89-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): kubeadm 1.28.4
  • GPU Operator Version: 23.9.0

2. Issue or feature description

The gpu_operator pod goes crash loop backoff.
Logs:

...
{"level":"info","ts":1701345521.9651258,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-device-plugin","status":"disabled"}
{"level":"info","ts":1701345522.0023656,"logger":"controllers.ClusterPolicy","msg":"Kata Manager disabled, deleting all Kata RuntimeClasses"}
{"level":"info","ts":1701345522.0023997,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-kata-manager","status":"disabled"}
{"level":"info","ts":1701345522.0334861,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-cc-manager","status":"disabled"}
{"level":"info","ts":1701345522.0335908,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy is ready as all resources have been successfully reconciled"}
{"level":"error","ts":1701345527.9915156,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345537.9906301,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345547.9917858,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345557.9914362,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345567.990562,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345577.991085,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345587.9907615,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
{"level":"error","ts":1701345597.9910758,"logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource"}
kubectl get pods -n gpu-operator
NAME                                                         READY   STATUS             RESTARTS         AGE
gpu-feature-discovery-97z5s                                  1/1     Running            1 (75m ago)      13h
gpu-feature-discovery-wpj2m                                  1/1     Running            1 (100m ago)     13h
gpu-feature-discovery-wvvpn                                  1/1     Running            1 (41m ago)      13h
gpu-operator-657c5b798d-xgvgc                                0/1     CrashLoopBackOff   12 (4m24s ago)   67m
gpu-operator-node-feature-discovery-gc-7cc7ccfff8-7tlsl      1/1     Running            0                50m
gpu-operator-node-feature-discovery-master-d8597d549-cx849   1/1     Running            0                85m
gpu-operator-node-feature-discovery-worker-5nl2l             1/1     Running            2 (45m ago)      13h
gpu-operator-node-feature-discovery-worker-5tfwv             1/1     Running            1 (100m ago)     13h
gpu-operator-node-feature-discovery-worker-89hbs             1/1     Running            1 (76m ago)      13h
gpu-operator-node-feature-discovery-worker-8tqp8             1/1     Running            3 (11m ago)      13h
gpu-operator-node-feature-discovery-worker-g9vmn             1/1     Running            1 (41m ago)      13h
gpu-operator-node-feature-discovery-worker-h5srt             1/1     Running            2 (45m ago)      13h
gpu-operator-node-feature-discovery-worker-hs7th             1/1     Running            1 (75m ago)      13h
gpu-operator-node-feature-discovery-worker-rd4f7             1/1     Running            1 (75m ago)      13h
gpu-operator-node-feature-discovery-worker-rf4cm             1/1     Running            1 (41m ago)      13h
nvidia-container-toolkit-daemonset-7jbnb                     1/1     Running            1 (41m ago)      13h
nvidia-container-toolkit-daemonset-c5j45                     1/1     Running            1 (76m ago)      13h
nvidia-container-toolkit-daemonset-xzqk4                     1/1     Running            1 (100m ago)     13h
nvidia-cuda-validator-44zgk                                  0/1     Completed          0                74m
nvidia-cuda-validator-gkknl                                  0/1     Completed          0                39m
nvidia-cuda-validator-pxrxh                                  0/1     Completed          0                98m
nvidia-dcgm-exporter-bq4h4                                   1/1     Running            1 (41m ago)      13h
nvidia-dcgm-exporter-v7mtb                                   1/1     Running            1 (75m ago)      13h
nvidia-dcgm-exporter-xnp84                                   1/1     Running            1 (100m ago)     13h
nvidia-device-plugin-daemonset-gjq48                         1/1     Running            1 (75m ago)      13h
nvidia-device-plugin-daemonset-hhgqr                         1/1     Running            1 (100m ago)     13h
nvidia-device-plugin-daemonset-r49v4                         1/1     Running            1 (41m ago)      13h
nvidia-operator-validator-flvzv                              1/1     Running            1 (41m ago)      13h
nvidia-operator-validator-kjkrx                              1/1     Running            1 (75m ago)      13h
nvidia-operator-validator-r6nf8                              1/1     Running            1 (100m ago)     13h
kubectl describe pod -n gpu-operator gpu-operator-657c5b798d-xgvgc
Name:                 gpu-operator-657c5b798d-xgvgc
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      gpu-operator
Node:                 k8cp3/10.0.50.53
Start Time:           Thu, 30 Nov 2023 11:57:27 +0100
Labels:               app=gpu-operator
                      app.kubernetes.io/component=gpu-operator
                      app.kubernetes.io/instance=gpu-operator
                      app.kubernetes.io/managed-by=Helm
                      app.kubernetes.io/name=gpu-operator
                      app.kubernetes.io/version=v23.9.0
                      helm.sh/chart=gpu-operator-v23.9.0
                      nvidia.com/gpu-driver-upgrade-drain.skip=true
                      pod-template-hash=657c5b798d
Annotations:          cni.projectcalico.org/containerID: 6f426fd1e135a7590216f2345f1097454576d8214402e3938625013a0a417257
                      cni.projectcalico.org/podIP: 10.50.144.103/32
                      cni.projectcalico.org/podIPs: 10.50.144.103/32
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "10.50.144.103"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      openshift.io/scc: restricted-readonly
Status:               Running
IP:                   10.50.144.103
IPs:
  IP:           10.50.144.103
Controlled By:  ReplicaSet/gpu-operator-657c5b798d
Containers:
  gpu-operator:
    Container ID:  containerd://d95017ccf650fa97e7266f431931f033d51f3b513a0b10b37b226287fe820afc
    Image:         nvcr.io/nvidia/gpu-operator:v23.9.0
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:3d76a3562ca957abca31a22dc13a32d0c2e5b03c29cca6dd1662106abdfb2e32
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      gpu-operator
    Args:
      --leader-elect
      --zap-time-encoding=epoch
      --zap-log-level=info
    State:          Running
      Started:      Thu, 30 Nov 2023 13:05:41 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 30 Nov 2023 12:58:18 +0100
      Finished:     Thu, 30 Nov 2023 13:00:38 +0100
    Ready:          True
    Restart Count:  13
    Limits:
      cpu:     500m
      memory:  350Mi
    Requests:
      cpu:      200m
      memory:   100Mi
    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      WATCH_NAMESPACE:
      OPERATOR_NAMESPACE:    gpu-operator (v1:metadata.namespace)
      DRIVER_MANAGER_IMAGE:  nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.2
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kdps2 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:
  kube-api-access-kdps2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Normal   Pulled     57m (x5 over 68m)     kubelet  Container image "nvcr.io/nvidia/gpu-operator:v23.9.0" already present on machine
  Normal   Created    57m (x5 over 68m)     kubelet  Created container gpu-operator
  Normal   Started    57m (x5 over 68m)     kubelet  Started container gpu-operator
  Warning  Unhealthy  56m                   kubelet  Readiness probe failed: Get "http://10.50.144.103:8081/readyz": dial tcp 10.50.144.103:8081: connect: connection refused
  Warning  BackOff    3m9s (x190 over 63m)  kubelet  Back-off restarting failed container gpu-operator in pod gpu-operator-657c5b798d-xgvgc_gpu-operator(df000894-06cc-482a-a829-f755b6487161)

3. Steps to reproduce the issue

Just install the helm chart

@urbaman
Copy link
Author

urbaman commented Dec 9, 2023

Hi,

My problem was updating the helm chart to v23.9.1 without updating the CRDs.
Once upgraded with upgradeCRD: true, it fixed the operator.

Thank you.

@urbaman urbaman closed this as completed Dec 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant