Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keel container fails liveness probes, causing crash loop #790

Open
DrJosh9000 opened this issue Dec 30, 2024 · 2 comments
Open

Keel container fails liveness probes, causing crash loop #790

DrJosh9000 opened this issue Dec 30, 2024 · 2 comments

Comments

@DrJosh9000
Copy link

What happens

When I install Keel using the latest Helm chart, it is repeatedly killed by kubelet because it does not respond to liveness probes (and also fails readiness probes).

Pod events:

Type     Reason     Age                    From               Message
----     ------     ----                   ----               -------
Normal   Scheduled  5m57s                  default-scheduler  Successfully assigned kube-system/keel-bd5775d8b-kwjwd to blackbox
Normal   Pulled     5m54s                  kubelet            Successfully pulled image "keelhq/keel:0.20.0" in 1.777s (1.777s including waiting). Image size: 59399804 bytes.
Normal   Pulling    4m56s (x2 over 5m56s)  kubelet            Pulling image "keelhq/keel:0.20.0"
Normal   Created    4m55s (x2 over 5m54s)  kubelet            Created container keel
Normal   Pulled     4m55s                  kubelet            Successfully pulled image "keelhq/keel:0.20.0" in 1.741s (1.741s including waiting). Image size: 59399804 bytes.
Normal   Started    4m54s (x2 over 5m54s)  kubelet            Started container keel
Warning  Unhealthy  3m57s (x6 over 5m17s)  kubelet            Liveness probe failed: Get "http://10.42.3.186:9300/healthz": dial tcp 10.42.3.186:9300: connect: connection refused
Normal   Killing    3m57s (x2 over 4m57s)  kubelet            Container keel failed liveness probe, will be restarted
Warning  Unhealthy  47s (x19 over 5m17s)   kubelet            Readiness probe failed: Get "http://10.42.3.186:9300/healthz": dial tcp 10.42.3.186:9300: connect: connection refused

Container log:

time="2024-12-30T06:10:47Z" level=info msg="extension.credentialshelper: helper registered" name=aws
time="2024-12-30T06:10:47Z" level=info msg="bot: registered" name=slack
time="2024-12-30T06:10:49Z" level=info msg="extension.credentialshelper: helper registered" name=gcr
time="2024-12-30T06:10:50Z" level=info msg="keel starting..." arch=amd64 build_date=2024-12-22T191328Z go_version=go1.23.4 os=linux revision= version=
time="2024-12-30T06:10:54Z" level=info msg="initializing database" database_path=/data/keel.db type=sqlite3
time="2024-12-30T06:10:54Z" level=info msg="extension.notification.auditor: audit logger configured" name=auditor
time="2024-12-30T06:10:54Z" level=info msg="notificationSender: sender configured" sender name=auditor
time="2024-12-30T06:10:54Z" level=info msg="provider.kubernetes: using in-cluster configuration"
time="2024-12-30T06:10:55Z" level=info msg="provider.defaultProviders: provider 'kubernetes' registered"
time="2024-12-30T06:10:55Z" level=info msg="extension.credentialshelper: helper registered" name=secrets
time="2024-12-30T06:10:56Z" level=info msg="trigger.poll.manager: polling trigger configured"
time="2024-12-30T06:10:56Z" level=info msg="bot.slack.Configure(): SLACK_BOT_TOKEN must have the prefix \"xoxb-\", skip bot configuration."
time="2024-12-30T06:10:56Z" level=error msg="bot.Run(): can not get configuration for bot [slack]"
time="2024-12-30T06:10:56Z" level=info msg=started context=buffer
time="2024-12-30T06:10:56Z" level=info msg=started context=watch resource=daemonsets
time="2024-12-30T06:10:56Z" level=info msg=started context=watch resource=cronjobs
time="2024-12-30T06:10:56Z" level=info msg=started context=watch resource=statefulsets
time="2024-12-30T06:10:56Z" level=info msg=started context=watch resource=deployments
time="2024-12-30T06:10:57Z" level=info msg="authentication is not enabled, admin HTTP handlers are not initialized"
time="2024-12-30T06:10:57Z" level=info msg="webhook trigger server starting..." port=9300
stream closed EOF for kube-system/keel-868bc4bd65-kck5r (keel)

With debug logging enabled:

time="2024-12-30T05:55:44Z" level=info msg="extension.credentialshelper: helper registered" name=aws
time="2024-12-30T05:55:45Z" level=info msg="bot: registered" name=slack
time="2024-12-30T05:55:47Z" level=info msg="extension.credentialshelper: helper registered" name=gcr
time="2024-12-30T05:55:47Z" level=info msg="keel starting..." arch=amd64 build_date=2024-12-22T191328Z go_version=go1.23.4 os=linux revision= version=
time="2024-12-30T05:55:51Z" level=info msg="initializing database" database_path=/data/keel.db type=sqlite3
time="2024-12-30T05:55:51Z" level=debug msg="extension.notification: sender registered" name=auditor
time="2024-12-30T05:55:51Z" level=info msg="extension.notification.auditor: audit logger configured" name=auditor
time="2024-12-30T05:55:51Z" level=info msg="notificationSender: sender configured" sender name=auditor
time="2024-12-30T05:55:52Z" level=info msg="provider.kubernetes: using in-cluster configuration"
time="2024-12-30T05:55:53Z" level=info msg="provider.defaultProviders: provider 'kubernetes' registered"
time="2024-12-30T05:55:53Z" level=info msg="extension.credentialshelper: helper registered" name=secrets
time="2024-12-30T05:55:53Z" level=info msg="trigger.poll.manager: polling trigger configured"
time="2024-12-30T05:55:53Z" level=info msg="bot.slack.Configure(): SLACK_BOT_TOKEN must have the prefix \"xoxb-\", skip bot configuration."
time="2024-12-30T05:55:53Z" level=error msg="bot.Run(): can not get configuration for bot [slack]"
time="2024-12-30T05:55:54Z" level=info msg=started context=watch resource=daemonsets
time="2024-12-30T05:55:54Z" level=info msg=started context=watch resource=deployments
time="2024-12-30T05:55:54Z" level=info msg=started context=watch resource=statefulsets
time="2024-12-30T05:55:54Z" level=info msg=started context=buffer
time="2024-12-30T05:55:54Z" level=info msg=started context=watch resource=cronjobs
time="2024-12-30T05:55:54Z" level=info msg="authentication is not enabled, admin HTTP handlers are not initialized"
time="2024-12-30T05:55:55Z" level=info msg="webhook trigger server starting..." port=9300
time="2024-12-30T05:56:09Z" level=debug msg="added daemonset svclb-keel-fca6ce95" context=translator
time="2024-12-30T05:56:09Z" level=debug msg="added statefulset redis-node" context=translator
time="2024-12-30T05:56:09Z" level=debug msg="added daemonset svclb-traefik-c45d4f2b" context=translator
time="2024-12-30T05:56:09Z" level=debug msg="added daemonset engine-image-ei-51cc7b9c" context=translator
time="2024-12-30T05:56:09Z" level=debug msg="added daemonset longhorn-csi-plugin" context=translator
time="2024-12-30T05:56:09Z" level=debug msg="added daemonset longhorn-manager" context=translator
time="2024-12-30T05:56:09Z" level=debug msg="added statefulset ts-atwarrior-tailscale-frontend-kpzgx" context=translator
time="2024-12-30T05:56:09Z" level=debug msg="added statefulset ts-longhorn-tailscale-frontend-9wkvn" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment agent-stack-k8s" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment cnpg-cloudnative-pg" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment atwarrior" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment coredns" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment keel" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment local-path-provisioner" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment metrics-server" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment traefik" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment csi-attacher" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment csi-provisioner" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment csi-resizer" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment csi-snapshotter" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment longhorn-driver-deployer" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment longhorn-ui" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment system-upgrade-controller" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="added deployment operator" context=translator
time="2024-12-30T05:56:10Z" level=debug msg="updated deployment keel" context=translator
stream closed EOF for kube-system/keel-bd5775d8b-kwjwd (keel)

How to replicate

I ran the following:

helm repo add keel https://keel-hq.github.io/keel/
helm repo update
helm upgrade --install keel --namespace=kube-system keel/keel

Other notes

To check whether this could be caused by mystery broken networking on my node/cluster, I ran an nginx deployment with 1 replica and a similar liveness config as the Helm chart configures for Keel (probing /), and it started and ran successfully.

@david-garcia-garcia
Copy link
Collaborator

@DrJosh9000running the exact same versions here without issues.

Can you try to manually remove the liveness probe in the deployment so that the pod keeps running, and then port forward 9300 to locally test the probe?

You should be getting something like:

image

Also try enabling the admin UI and see if you can connect to it (sample values.yaml for the helm chart):

basicauth:
  enabled: "${chart_values_basicauth__enabled}"
  user: "${chart_values_basicauth__user}"
  password: "${chart_values_basicauth__password}"
image:
  repository: "keelhq/keel"
  tag: "${chart_values_image__tag}"

It looks to me more like a networking issue, but I can't think of a simple explanation for it. Networking can get as complex as you want.

@DrJosh9000
Copy link
Author

I tried experimenting again. Enabling the admin UI didn't seem to be enough, but I noticed that it was hitting resource limits. Bumping resources helped it enough to start up quickly enough for the probe:

resources:
  limits:
    cpu: 1000m
    memory: 256Mi
  requests:
    cpu: 500m
    memory: 128Mi

With these limits it used around 650m CPU starting up, and nearly all 128Mi of the request. I started looking for the reason why, and that's when I noticed that keel container image is only built for amd64 - it's probably too slow to start up with the default 100m CPU limit because I'm on ARM64, and have qemu-user-static + binfmt-support for transparent emulation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants