Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup probe failed: HTTP probe failed with statuscode: 500 #291

Open
robcharlwood opened this issue Sep 13, 2023 · 9 comments
Open

Startup probe failed: HTTP probe failed with statuscode: 500 #291

robcharlwood opened this issue Sep 13, 2023 · 9 comments

Comments

@robcharlwood
Copy link

robcharlwood commented Sep 13, 2023

Hi

We are seeing a problem with the latest version of the rancher-webhook (0.3.5) when running alongside the latest rancher (2.7.6). In both the Rancher HA cluster and imported K3S and GKE downstream clusters, the webhook pod has a warning about startup probe checks failing with status code 500.

Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  15s               default-scheduler  Successfully assigned cattle-system/rancher-webhook-998454b77-nvch5 to <redacted>
  Normal   Pulled     14s               kubelet            Container image "rancher/rancher-webhook:v0.3.5" already present on machine
  Normal   Created    14s               kubelet            Created container rancher-webhook
  Normal   Started    14s               kubelet            Started container rancher-webhook
  Warning  Unhealthy  5s (x2 over 10s)  kubelet            Startup probe failed: HTTP probe failed with statuscode: 500

If left for long enough, it eventually starts failing with a liveness probe error:

Events:
  Type     Reason     Age                 From     Message
  ----     ------     ----                ----     -------
  Warning  Unhealthy  41m (x52 over 19h)  kubelet  Liveness probe failed: Get "https://XXX.XXX.XXX.XXX:9443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

This is only ever generated as a warning and the pod itself never becomes unhealthy. The pod itself also does not give any useful logs:

time="2023-09-13T10:22:52Z" level=info msg="Rancher-webhook version v0.3.5 (2e89c65) is starting"
time="2023-09-13T10:22:52Z" level=info msg="Active TLS secret cattle-system/cattle-webhook-tls (ver=5511970) (count 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=XXXXXXXXXXXXXXXXXXXXXXXXXXXX]"
time="2023-09-13T10:22:52Z" level=info msg="Listening on :9443"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=Cluster controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=ClusterRoleTemplateBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=GlobalRole controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2023-09-13T10:22:52Z" level=info msg="Sleeping for 15 seconds then applying webhook config"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=RoleBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=PodSecurityAdmissionConfigurationTemplate controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting provisioning.cattle.io/v1, Kind=Cluster controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting management.cattle.io/v3, Kind=ProjectRoleTemplateBinding controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=Role controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting management.cattle.io/v3, Kind=RoleTemplate controller"
time="2023-09-13T10:22:53Z" level=info msg="Updating TLS secret for cattle-system/cattle-webhook-tls (count: 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=XXXXXXXXXXXXXXXXXXXXXXXXXXXX]"

This rancher is deployed in the following manner:

  • Private GKE cluster running in Google Cloud with etc encryption using custom KMS key
  • Cluster is running 1.26.4-gke.500 of Kubernetes
  • We allow GKE control plane ingress to the webhook on port 9443 via TCP in our firewall rules as per the docs

Any help or advice on this issue would be appreciated.

Many thanks!

@KevinJoiner
Copy link
Contributor

I don't have any immediate solutions to your problem, but it looks like the root cause is that kube-apiserver cannot communicate with the container running on the cluster.

To verify the problem is not with the webhook, you can validate the webhook configuration was created successfully
kubectl get validatingwebhookconfigurations rancher.cattle.io -o yaml

Your other solution, which is available on Rancher:v2.7-head but has not been released yet would be to have the webhook run on port 443
rancher/rancher#41142 (comment)

@robcharlwood
Copy link
Author

@KevinJoiner Thanks! I will check this and get back to you.

@robcharlwood
Copy link
Author

@KevinJoiner - So I ran the suggested command and YAML was returned successfully. I can't see anything problematic in the output. Is there anything specific I should be looking for?

@KevinJoiner
Copy link
Contributor

@robcharlwood No, if the resource exists and the webhook is not logging any errors we can have higher confidence that the problem is with the connection between the kube-apiserver and the rancher-webhook pod.

  1. I would double checks the steps for adding the firewall rule to make sure it is correctly configured since the symptoms seem to match https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#api_request_that_triggers_admission_webhook_timing_out

  2. You can try to edit the deployment of the Webhook and remove the startupProbe and livelinessProbe and see if things start to work. I don't expect this to fix the problem since other requests will most likely time out when you try to create a RoleTemplate, but if it does work, we might have a bug on our side.

@robcharlwood
Copy link
Author

@KevinJoiner Thanks! I will investigate and report back!

@danipanz
Copy link

danipanz commented Dec 14, 2023

We are experiencing the same issue.

  • Rancher v2.7.6 deployed on k3s 1.25.10
  • downstream cluster k8s vanilla + cilium (we also tried with calico): 1.28.2

Firewall rules allow any communication between nodes (trusted)

@danipanz
Copy link

Adding some extra info:

  • We reconfigured and tried to import a new mini k8s cluster (1M + 1W) multple times with different k8s versions (1.28.2, 1.25.12, 1.24.4. ) All tests failed.
  • The two machines we ran our test on had already been successfully imported previously (Rancher 2.5 and K8S 1.24.4).
  • We created a custom RoleTemplate and assigned it to a user on the downstream cluster and it seemed to work without any issue.

For the moment being we will try and remove the startupProbe and livelinessProbe

@Vox1984
Copy link

Vox1984 commented Jan 17, 2024

I have very same issue:

Rancher UI: 2.7.9
RKE version: v1.5.1   
K8s: v1.25.16

kubectl describe pod -n cattle-system rancher-webhook-7879bb6c5-vb7ss

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  85s (x28071 over 2d1h)  kubelet  Startup probe failed: HTTP probe failed with statuscode: 500

Problem started after upgrading from previous version.

@rahadiangg
Copy link

I have same issue

Rancher chart: rancher-2.8.1
Rancher webhook chart: rancher-webhook-103.0.1+up0.4.2
Kubernetes: v1.27.13

When try to hit the /healthz endpoint, I got this log message:

[-]Config Applied failed: reason withheld
healthz check failed

I'm struggle with reason withheld error because can't find out what the root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants