Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dfinit: restart-container-runtime restart loop #294

Open
kakkoyun opened this issue Jun 25, 2024 · 13 comments
Open

dfinit: restart-container-runtime restart loop #294

kakkoyun opened this issue Jun 25, 2024 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@kakkoyun
Copy link

Bug report:

The restart-container-runtime init container is configured to restart the container runtime without any conditions. As a result, the pod remains in an unready state (NotReady) perpetually. This happens because the container runtime is continuously being restarted, preventing the pod from reaching a stable, ready state.

The restart should only happen once if the configuration is changed. So that the next loop could be marked as ready.

Expected behavior:

Daemonset should start normally.

How to reproduce it:

values.yaml with

client:
  enable: true
  config:
    verbose: true
  dfinit:
    enable: true
    config:
      verbose: true
      containerRuntime:
        containerd:
          registries:
            - hostNamespace: docker.io
              serverAddr: https://index.docker.io
              capabilities: ["pull", "resolve"]
            - hostNamespace: ghcr.io
              serverAddr: https://ghcr.io
              capabilities: ["pull", "resolve"]

Environment:

  • Dragonfly version: v2.1.49 (chart v1.1.67)
  • OS: Linux`
  • Kernel (e.g. uname -a): Linux jack-oneill 6.9.3-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 31 May 2024 15:14:45 +0000 x86_64 GNU/Linux
  • Others:

Logs:

kubectl describe pod:

Details

Name:             dragonfly-client-bgw5s
Namespace:        dragonfly
Priority:         0
Service Account:  default
Node:             e2e/192.168.39.248
Start Time:       Tue, 25 Jun 2024 23:27:38 +0200
Labels:           app=dragonfly
                  component=client
                  controller-revision-hash=7745678fdd
                  pod-template-generation=3
                  release=dragonfly
Annotations:      checksum/config: ff55a474fbf9a76574ac381a461ce0b797d557fdf76759063600387a8eaf0831
                  kubectl.kubernetes.io/restartedAt: 2024-06-25T23:27:37+02:00
Status:           Pending
IP:               192.168.39.248
IPs:
  IP:           192.168.39.248
Controlled By:  DaemonSet/dragonfly-client
Init Containers:
  update-containerd-remove-registry-mirrors:
    Container ID:  containerd://bc64537fca42caecc1a78c1e9b3ae2e307ef1c9e27ef8876c6c34609367f2d6b
    Image:         python:3.12-slim
    Image ID:      docker.io/library/python@sha256:2fba8e70a87bcc9f6edd20dda0a1d4adb32046d2acbca7361bc61da5a106a914
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -cxe
      apt-get update && apt-get install -y jq
      pip install yq
      if tomlq -e '.plugins."io.containerd.grpc.v1.cri".registry.mirrors' /etc/containerd/config.toml > /dev/null; then
        tomlq -i -t 'del(.plugins."io.containerd.grpc.v1.cri".registry.mirrors)' /etc/containerd/config.toml
        nsenter -t 1 -m -- systemctl try-reload-or-restart containerd.service
        echo "containerd config updated"
      else
        echo "Entry does not exist, no changes made"
      fi
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 25 Jun 2024 23:27:38 +0200
      Finished:     Tue, 25 Jun 2024 23:27:42 +0200
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/containerd from containerd-config-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jj67n (ro)
  wait-for-scheduler:
    Container ID:  containerd://e79694fd393fd32ec9d161dbab25e1ff8cc023b5c92d227e096c849016f4fcd5
    Image:         docker.io/busybox:latest
    Image ID:      docker.io/library/busybox@sha256:9ae97d36d26566ff84e8893c64a6dc4fe8ca6d1144bf5b87b2b85a32def253c7
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      until nslookup dragonfly-scheduler.dragonfly.svc.cluster.local && nc -vz dragonfly-scheduler.dragonfly.svc.cluster.local 8002; do echo waiting for scheduler; sleep 2; done;
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 25 Jun 2024 23:27:43 +0200
      Finished:     Tue, 25 Jun 2024 23:27:43 +0200
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jj67n (ro)
  dfinit:
    Container ID:  containerd://935e0fe5c37bb824fc553fb717cbf40f80bf588b53fe1e01d1645b21ab1954c4
    Image:         docker.io/dragonflyoss/dfinit:v0.1.82
    Image ID:      docker.io/dragonflyoss/dfinit@sha256:4c793f262a9e1db6f55cedc2a7f322a1a01165fc50480b652637f5f7639b8192
    Port:          <none>
    Host Port:     <none>
    Args:
      --log-level=info
      --verbose
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 25 Jun 2024 23:27:44 +0200
      Finished:     Tue, 25 Jun 2024 23:27:44 +0200
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/containerd from containerd-config-dir (rw)
      /etc/dragonfly from dfinit-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jj67n (ro)
  restart-container-runtime:
    Container ID:  containerd://bd622dc89080b0d6d65e09078805f399a94bc7603123feae452cd463991441c9
    Image:         docker.io/busybox:latest
    Image ID:      docker.io/library/busybox@sha256:9ae97d36d26566ff84e8893c64a6dc4fe8ca6d1144bf5b87b2b85a32def253c7
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -cx
      nsenter -t 1 -m -- systemctl restart containerd.service
      echo "restart container"
    State:          Waiting
      Reason:       RunContainerError
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jj67n (ro)
Containers:
  client:
    Container ID:  
    Image:         docker.io/dragonflyoss/client:v0.1.82
    Image ID:      
    Ports:         4000/TCP, 4003/TCP, 4002/TCP, 4004/TCP
    Host Ports:    4000/TCP, 4003/TCP, 4002/TCP, 4004/TCP
    Args:
      --log-level=info
      --verbose
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  4Gi
    Requests:
      cpu:        0
      memory:     0
    Liveness:     exec [/bin/grpc_health_probe -addr=:4000] delay=15s timeout=1s period=10s #success=1 #failure=3
    Readiness:    exec [/bin/grpc_health_probe -addr=:4000] delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/dragonfly from config (rw)
      /var/lib/dragonfly/ from storage (rw)
      /var/log/dragonfly/dfdaemon/ from logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jj67n (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      dragonfly-client
    Optional:  false
  dfinit-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      dragonfly-dfinit
    Optional:  false
  containerd-config-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/containerd
    HostPathType:  DirectoryOrCreate
  storage:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/dragonfly/
    HostPathType:  DirectoryOrCreate
  logs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-jj67n:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              fal/group=default
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  41m   default-scheduler  Successfully assigned dragonfly/dragonfly-client-bgw5s to e2e
  Normal   Pulled     41m   kubelet            Container image "python:3.12-slim" already present on machine
  Normal   Created    41m   kubelet            Created container update-containerd-remove-registry-mirrors
  Normal   Started    41m   kubelet            Started container update-containerd-remove-registry-mirrors
  Normal   Pulled     41m   kubelet            Container image "docker.io/busybox:latest" already present on machine
  Normal   Created    41m   kubelet            Created container wait-for-scheduler
  Normal   Started    41m   kubelet            Started container wait-for-scheduler
  Normal   Pulled     41m   kubelet            Container image "docker.io/dragonflyoss/dfinit:v0.1.82" already present on machine
  Normal   Created    41m   kubelet            Created container dfinit
  Normal   Started    41m   kubelet            Started container dfinit
  Normal   Pulled     41m   kubelet            Container image "docker.io/busybox:latest" already present on machine
  Normal   Created    41m   kubelet            Created container restart-container-runtime
  Warning  Failed     41m   kubelet            Error: error reading from server: EOF

@kakkoyun kakkoyun added the bug Something isn't working label Jun 25, 2024
@gaius-qi
Copy link
Member

@kakkoyun I will fix it. Thanks!

@kakkoyun
Copy link
Author

@kakkoyun I will fix it. Thanks!

Thank you 🙏

@kakkoyun
Copy link
Author

This is my patch to make it work, but it's not production-worthy. Restarting the container runtime in a container is NOT a good idea but I'm not sure if there's any other way to do this.

Let's make sure there isn't any loop so that kubernetes can schedule them eventually.

- op: remove
  path: /spec/template/spec/initContainers/3

- op: add
  path: /spec/template/spec/initContainers/-
  value:
    name: restart-container-runtime
    image: docker.io/busybox:latest
    command:
      - /bin/sh
      - -cx
      - |-
        if [ -f /var/lib/dragonfly/container-runtime-restarted ]; then
          echo "container runtime already restarted once"
          exit 0
        fi
        echo "restarting container runtime..."
        touch /var/lib/dragonfly/container-runtime-restarted
        nsenter -t 1 -m -- systemctl try-reload-or-restart containerd.service
        echo "restart container"
    securityContext:
      privileged: true
    volumeMounts:
    - name: storage
      mountPath: /var/lib/dragonfly

@kakkoyun
Copy link
Author

One other issues, is about registery.mirrors. I had to do this for the GKE clusters.

# A Kubernetes DaemonSet patch to add initContainers to the dragonfly-client DaemonSet.
- op: add
  path: /spec/template/spec/initContainers/0
  value:
    name: update-containerd-remove-registry-mirrors
    image: python:3.12-slim
    securityContext:
      privileged: true
    volumeMounts:
      - name: containerd-config-dir
        mountPath: /etc/containerd
    # The command below is to remove the registry mirrors in the containerd config.toml file.
    # When confing_path is defined, 'mirrors' cannot be specified for the registry entry.
    command:
      - /bin/sh
      - -cxe
      - |-
        apt-get update && apt-get install -y jq
        pip install yq
        if tomlq -e '.plugins."io.containerd.grpc.v1.cri".registry.mirrors' /etc/containerd/config.toml > /dev/null; then
          tomlq -i -t 'del(.plugins."io.containerd.grpc.v1.cri".registry.mirrors)' /etc/containerd/config.toml
          nsenter -t 1 -m -- systemctl try-reload-or-restart containerd.service
          echo "containerd config updated"
        else
          echo "Entry does not exist, no changes made"
        fi

@gaius-qi
Copy link
Member

gaius-qi commented Jul 1, 2024

@kakkoyun Is it because containerd's configuration file has been changed incorrectly, causing containerd to fail to restart? If so, please provide me with the default configuration for GKE's containerd.

@gaius-qi
Copy link
Member

gaius-qi commented Jul 4, 2024

@kakkoyun Is it because containerd's configuration file has been changed incorrectly, causing containerd to fail to restart? If so, please provide me with the default configuration for GKE's containerd.

@kakkoyun Hey, can you provide the default configuration for GKE's containerd? I will fix it.

@kakkoyun
Copy link
Author

kakkoyun commented Jul 8, 2024

Hey @gaius-qi, sorry for the delayed response. I was on PTO and away from the keyboard.

I can create another issue for the GKE-specific error if it would be clearer. Let me know.

But briefly, GKE injects a containerd entry and specifies registry.mirrors for docker.io and points to the google container registry as proxy for caching. When dfinit (or previously dfdaemon) comes and injects

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"

containerd fails to start. It brings down the whole cluster. Because you can't specify both registry.mirrors and registry.config_path at the same time, apperantly.

So, with the quick and dirty solution that I proposed in #294 (comment), I made it work. However, dfinit needs to be more clever and check the existing config conflicts.

Let me know if you need further explanation.

@gaius-qi
Copy link
Member

gaius-qi commented Jul 9, 2024

@kakkoyun
Is the containerd configuration of GKE similar to Example A or Example B?

Example A:

[plugins."io.containerd.grpc.v1.cri".registry]
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
      endpoint = ["https://registry-1.docker.io"]
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."gcr.io"]
      endpoint = ["https://gcr.io"]

Example B:

[plugins."io.containerd.cri.v1.images".registry]
  [plugins."io.containerd.cri.v1.images".registry.mirrors]
    [plugins."io.containerd.cri.v1.images".registry.mirrors."docker.io"]
      endpoint = ["https://registry-1.docker.io"]
    [plugins."io.containerd.cri.v1.images".registry.mirrors."gcr.io"]
      endpoint = ["https://gcr.io"]

@kakkoyun
Copy link
Author

kakkoyun commented Jul 9, 2024

@gaius-qi Yes, quite similar. Here is exactly how it looks:

[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
  endpoint = ["https://mirror.gcr.io","https://registry-1.docker.io"]

@gaius-qi
Copy link
Member

gaius-qi commented Jul 10, 2024

@kakkoyun
Can you provide your entire containerd config before installing, dfinit entire config and dfinit verison?

If you don't know how to get dfinit entire config and dfinit verison, you can give me the helm chart config.

@gaius-qi gaius-qi self-assigned this Jul 15, 2024
@gaius-qi
Copy link
Member

gaius-qi commented Jul 15, 2024

@kakkoyun Can you provide your entire containerd config before installing, dfinit entire config and dfinit verison?

If you don't know how to get dfinit entire config and dfinit verison, you can give me the helm chart config.

@kakkoyun Can you help me to provide your entire containerd config before installing, dfinit entire config and dfinit verison?

I want to fix the bug. Thanks!

@kakkoyun
Copy link
Author

@gaius-qi I'll do it as soon as I've some free cycles.

@gaius-qi
Copy link
Member

@gaius-qi I'll do it as soon as I've some free cycles.

@kakkoyun Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants