Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to deploy on new containerd releases #39

Closed
ktsakalozos opened this issue Dec 3, 2022 · 2 comments
Closed

Failure to deploy on new containerd releases #39

ktsakalozos opened this issue Dec 3, 2022 · 2 comments

Comments

@ktsakalozos
Copy link

When deploying the gpu-operator the nvidia-container-toolkit is crashlooping. The container logs

> sudo microk8s.kubectl logs -n gpu-operator-resources   pod/nvidia-container-toolkit-daemonset-2pthk
.....
time="2022-12-03T10:41:38Z" level=info msg="Setting up runtime"
time="2022-12-03T10:41:38Z" level=info msg="Starting 'setup' for containerd"
time="2022-12-03T10:41:38Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2022-12-03T10:41:38Z" level=info msg="Successfully parsed arguments"
time="2022-12-03T10:41:38Z" level=info msg="Loading config: /runtime/config-dir/containerd-template.toml"
time="2022-12-03T10:41:38Z" level=info msg="Successfully loaded config"
time="2022-12-03T10:41:38Z" level=info msg="Config version: 2"
time="2022-12-03T10:41:38Z" level=info msg="Updating config"
time="2022-12-03T10:41:38Z" level=info msg="Successfully updated config"
time="2022-12-03T10:41:38Z" level=info msg="Flushing config"
time="2022-12-03T10:41:38Z" level=info msg="Successfully flushed config"
time="2022-12-03T10:41:38Z" level=info msg="Sending SIGHUP signal to containerd"
time="2022-12-03T10:41:38Z" level=info msg="Successfully signaled containerd"
time="2022-12-03T10:41:38Z" level=info msg="Completed 'setup' for containerd"
time="2022-12-03T10:41:38Z" level=info msg="Waiting for signal"
time="2022-12-03T10:41:40Z" level=info msg="Cleaning up Runtime"
time="2022-12-03T10:41:40Z" level=info msg="Starting 'cleanup' for containerd"
time="2022-12-03T10:41:40Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2022-12-03T10:41:40Z" level=info msg="Successfully parsed arguments"
time="2022-12-03T10:41:40Z" level=info msg="Loading config: /runtime/config-dir/containerd-template.toml"
time="2022-12-03T10:41:40Z" level=info msg="Successfully loaded config"
time="2022-12-03T10:41:40Z" level=info msg="Config version: 2"
time="2022-12-03T10:41:40Z" level=info msg="Reverting config"
time="2022-12-03T10:41:40Z" level=info msg="Successfully reverted config"
time="2022-12-03T10:41:40Z" level=info msg="Flushing config"
time="2022-12-03T10:41:40Z" level=info msg="Successfully flushed config"
time="2022-12-03T10:41:40Z" level=info msg="Sending SIGHUP signal to containerd"
time="2022-12-03T10:41:40Z" level=info msg="Successfully signaled containerd"
time="2022-12-03T10:41:40Z" level=info msg="Completed 'cleanup' for containerd"
time="2022-12-03T10:41:40Z" level=info msg="Shutting Down"
time="2022-12-03T10:41:40Z" level=info msg="Completed nvidia-toolkit"

Full logs: https://paste.ubuntu.com/p/f3byGQ4kpJ/

Here are the pods currently running:

NAMESPACE                NAME                                                              READY   STATUS             RESTARTS      AGE
kube-system              pod/calico-node-mv8b4                                             1/1     Running            0             11m
gpu-operator-resources   pod/gpu-operator-node-feature-discovery-worker-89vr2              1/1     Running            0             8m37s
gpu-operator-resources   pod/nvidia-device-plugin-daemonset-bszjv                          0/1     Init:0/1           0             7m32s
kube-system              pod/coredns-d489fb88-5pbbx                                        1/1     Running            0             7m38s
kube-system              pod/calico-kube-controllers-7b476cc597-5jnrn                      1/1     Running            0             7m38s
gpu-operator-resources   pod/nvidia-operator-validator-9q2gx                               0/1     Init:0/4           0             7m24s
gpu-operator-resources   pod/gpu-operator-5dc6b8989b-bpkgs                                 1/1     Running            0             7m38s
gpu-operator-resources   pod/gpu-operator-node-feature-discovery-master-65c9bd48c4-29f7t   1/1     Running            0             7m38s
gpu-operator-resources   pod/nvidia-driver-daemonset-k7v56                                 1/1     Running            0             7m49s
gpu-operator-resources   pod/gpu-feature-discovery-l6zzx                                   0/1     PodInitializing    0             7m32s
gpu-operator-resources   pod/nvidia-dcgm-exporter-qbbzb                                    0/1     PodInitializing    0             7m32s
gpu-operator-resources   pod/nvidia-container-toolkit-daemonset-2pthk                      0/1     CrashLoopBackOff   4 (20s ago)   7m32s

On the 1.6 containerd release series this behavior started appearing with the v1.6.9 (the v1.6.8 release seems to be working fine). The offending PR over at containerd seems to be containerd/containerd@a91dd67 and has been ported to the 1.5.x releases to so the 1.5 containerd track should be also affected. The offending PR seems to be changing how the pod sandboxes are handled.

I am attaching the debug logs of an older containerd where the deployment is successful and a newer that is failing:
working-containerd.log
not-working-containerd.log

@klueska
Copy link
Contributor

klueska commented Dec 3, 2022

This is a known issue and will be resolved in the next release. See:
NVIDIA/gpu-operator#432 (comment)

@ktsakalozos
Copy link
Author

@klueska that's great news. Any ETA on the release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants