Failure to deploy on new containerd releases #39

ktsakalozos · 2022-12-03T11:02:21Z

When deploying the gpu-operator the nvidia-container-toolkit is crashlooping. The container logs

> sudo microk8s.kubectl logs -n gpu-operator-resources   pod/nvidia-container-toolkit-daemonset-2pthk
.....
time="2022-12-03T10:41:38Z" level=info msg="Setting up runtime"
time="2022-12-03T10:41:38Z" level=info msg="Starting 'setup' for containerd"
time="2022-12-03T10:41:38Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2022-12-03T10:41:38Z" level=info msg="Successfully parsed arguments"
time="2022-12-03T10:41:38Z" level=info msg="Loading config: /runtime/config-dir/containerd-template.toml"
time="2022-12-03T10:41:38Z" level=info msg="Successfully loaded config"
time="2022-12-03T10:41:38Z" level=info msg="Config version: 2"
time="2022-12-03T10:41:38Z" level=info msg="Updating config"
time="2022-12-03T10:41:38Z" level=info msg="Successfully updated config"
time="2022-12-03T10:41:38Z" level=info msg="Flushing config"
time="2022-12-03T10:41:38Z" level=info msg="Successfully flushed config"
time="2022-12-03T10:41:38Z" level=info msg="Sending SIGHUP signal to containerd"
time="2022-12-03T10:41:38Z" level=info msg="Successfully signaled containerd"
time="2022-12-03T10:41:38Z" level=info msg="Completed 'setup' for containerd"
time="2022-12-03T10:41:38Z" level=info msg="Waiting for signal"
time="2022-12-03T10:41:40Z" level=info msg="Cleaning up Runtime"
time="2022-12-03T10:41:40Z" level=info msg="Starting 'cleanup' for containerd"
time="2022-12-03T10:41:40Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2022-12-03T10:41:40Z" level=info msg="Successfully parsed arguments"
time="2022-12-03T10:41:40Z" level=info msg="Loading config: /runtime/config-dir/containerd-template.toml"
time="2022-12-03T10:41:40Z" level=info msg="Successfully loaded config"
time="2022-12-03T10:41:40Z" level=info msg="Config version: 2"
time="2022-12-03T10:41:40Z" level=info msg="Reverting config"
time="2022-12-03T10:41:40Z" level=info msg="Successfully reverted config"
time="2022-12-03T10:41:40Z" level=info msg="Flushing config"
time="2022-12-03T10:41:40Z" level=info msg="Successfully flushed config"
time="2022-12-03T10:41:40Z" level=info msg="Sending SIGHUP signal to containerd"
time="2022-12-03T10:41:40Z" level=info msg="Successfully signaled containerd"
time="2022-12-03T10:41:40Z" level=info msg="Completed 'cleanup' for containerd"
time="2022-12-03T10:41:40Z" level=info msg="Shutting Down"
time="2022-12-03T10:41:40Z" level=info msg="Completed nvidia-toolkit"

Full logs: https://paste.ubuntu.com/p/f3byGQ4kpJ/

Here are the pods currently running:

NAMESPACE                NAME                                                              READY   STATUS             RESTARTS      AGE
kube-system              pod/calico-node-mv8b4                                             1/1     Running            0             11m
gpu-operator-resources   pod/gpu-operator-node-feature-discovery-worker-89vr2              1/1     Running            0             8m37s
gpu-operator-resources   pod/nvidia-device-plugin-daemonset-bszjv                          0/1     Init:0/1           0             7m32s
kube-system              pod/coredns-d489fb88-5pbbx                                        1/1     Running            0             7m38s
kube-system              pod/calico-kube-controllers-7b476cc597-5jnrn                      1/1     Running            0             7m38s
gpu-operator-resources   pod/nvidia-operator-validator-9q2gx                               0/1     Init:0/4           0             7m24s
gpu-operator-resources   pod/gpu-operator-5dc6b8989b-bpkgs                                 1/1     Running            0             7m38s
gpu-operator-resources   pod/gpu-operator-node-feature-discovery-master-65c9bd48c4-29f7t   1/1     Running            0             7m38s
gpu-operator-resources   pod/nvidia-driver-daemonset-k7v56                                 1/1     Running            0             7m49s
gpu-operator-resources   pod/gpu-feature-discovery-l6zzx                                   0/1     PodInitializing    0             7m32s
gpu-operator-resources   pod/nvidia-dcgm-exporter-qbbzb                                    0/1     PodInitializing    0             7m32s
gpu-operator-resources   pod/nvidia-container-toolkit-daemonset-2pthk                      0/1     CrashLoopBackOff   4 (20s ago)   7m32s

On the 1.6 containerd release series this behavior started appearing with the v1.6.9 (the v1.6.8 release seems to be working fine). The offending PR over at containerd seems to be containerd/containerd@a91dd67 and has been ported to the 1.5.x releases to so the 1.5 containerd track should be also affected. The offending PR seems to be changing how the pod sandboxes are handled.

I am attaching the debug logs of an older containerd where the deployment is successful and a newer that is failing:
working-containerd.log
not-working-containerd.log

The text was updated successfully, but these errors were encountered:

klueska · 2022-12-03T16:46:42Z

This is a known issue and will be resolved in the next release. See:
NVIDIA/gpu-operator#432 (comment)

ktsakalozos · 2022-12-03T20:48:12Z

@klueska that's great news. Any ETA on the release?

ktsakalozos closed this as completed Dec 3, 2022

klueska mentioned this issue Dec 6, 2022

nvidia-container-toolkit-daemonset-hrncd goes in CrashLoopBackOff after a Completed state indefinitely NVIDIA/gpu-operator#456

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to deploy on new containerd releases #39

Failure to deploy on new containerd releases #39

ktsakalozos commented Dec 3, 2022

klueska commented Dec 3, 2022

ktsakalozos commented Dec 3, 2022

Failure to deploy on new containerd releases #39

Failure to deploy on new containerd releases #39

Comments

ktsakalozos commented Dec 3, 2022

klueska commented Dec 3, 2022

ktsakalozos commented Dec 3, 2022