-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
webhook EOF errors #1509
Comments
cc @vaikas |
I already made two changed to try and mitigate this:
|
From the Go
I wondered if this might be our problem, but we call Lines 207 to 210 in f1b8240
ErrServerClosed
|
I started a thread in |
Alright, so one of the things I have been wondering is around Keep-Alives and whether that might be a reason the API server takes so long to realize an endpoint is no longer good. I noticed the following comment on Go's
... emphasis mine. Now it turns out that Go's http logic is already smart about disabling keep-alive when it is shutting down: func (s *Server) doKeepAlives() bool {
return atomic.LoadInt32(&s.disableKeepAlives) == 0 && !s.shuttingDown()
} However, when we lame duck we aren't yet I believe testing this should be as simple as calling Line 221 in 25f2aa6 However, if this works then we should probably consider doing something similar across our various dataplane components as well. cc @tcnghia |
We believe that @vaikas is still seeing some webhook failures that should be synced past the above change, so we aren't out of the woods yet. |
This implements a new `http.Handler` called `Drainer`, which is intended to wrap some inner `http.Handler` business logic with a new outer handler that can respond to Kubelet probes (successfully until told to "Drain()"). This takes over the webhook's relatively new probe handling and lame duck logic with one key difference. Previously the webhook waited for a fixed period after SIGTERM before exitting, but the new logic waits for this same grace period AFTER THE LAST REQUEST. So if the handler keeps getting (non-probe) requests, the timer will continually reset, and once it stops receiving requests for the configured grace period, "Drain()" will return and the webhook will exit. The goal of this work is to try to better cope with what we believe to be high tail latencies of the API server seeing that a webhook replica is shutting down. Related: knative#1509
This implements a new `http.Handler` called `Drainer`, which is intended to wrap some inner `http.Handler` business logic with a new outer handler that can respond to Kubelet probes (successfully until told to "Drain()"). This takes over the webhook's relatively new probe handling and lame duck logic with one key difference. Previously the webhook waited for a fixed period after SIGTERM before exitting, but the new logic waits for this same grace period AFTER THE LAST REQUEST. So if the handler keeps getting (non-probe) requests, the timer will continually reset, and once it stops receiving requests for the configured grace period, "Drain()" will return and the webhook will exit. The goal of this work is to try to better cope with what we believe to be high tail latencies of the API server seeing that a webhook replica is shutting down. Related: knative#1509
The new "Drainer" handler in the linked PR uses a dynamic drain timeout where it waits for at least |
* Implement a new shared "Drainer" handler. This implements a new `http.Handler` called `Drainer`, which is intended to wrap some inner `http.Handler` business logic with a new outer handler that can respond to Kubelet probes (successfully until told to "Drain()"). This takes over the webhook's relatively new probe handling and lame duck logic with one key difference. Previously the webhook waited for a fixed period after SIGTERM before exitting, but the new logic waits for this same grace period AFTER THE LAST REQUEST. So if the handler keeps getting (non-probe) requests, the timer will continually reset, and once it stops receiving requests for the configured grace period, "Drain()" will return and the webhook will exit. The goal of this work is to try to better cope with what we believe to be high tail latencies of the API server seeing that a webhook replica is shutting down. Related: #1509 * Switch to RWLock
This is attempting to try and combat the webhook Post EOF errors we have been seeing intermittently: knative/pkg#1509
This is attempting to try and combat the webhook Post EOF errors we have been seeing intermittently: knative/pkg#1509
One of the two flakes in eventing yesterday was this:
I need to track down whether that CI run had the drainer stuff in yet. |
Hmm looks like it was at Link to the full run: https://prow.knative.dev/view/gcs/knative-prow/logs/ci-knative-eventing-continuous/1284652424807583744 |
Tomorrow I'm going to add a workaround for this by using Retry loops, just like we do for when pod creates (or configmaps fail) with very specific error cases. This will be a workaround for now hopefully so that we'll be able to not get tests that fail if they hit this condition. Reconcilers should be resilient to this, but tests that use create*orFail will not, so this will at least cut down some noise. I'll make sure to log these so at least we see how often this happens. |
Still happening in serving too I believe: https://prow.knative.dev/view/gcs/knative-prow/logs/ci-knative-serving-continuous/1285454712115564545 |
It definitely is. Open to suggestions on how we might keep chipping away at this / instrument / debug / ... |
I had an idea chatting with @markusthoemmes a bit on slack. Right now there's a lot of machinery standing between us and the EOFs that it is making them harder to debug than they should be. The thought is: can we have a more dedicated probing process that we can use to reproduce this (maybe more consistently)? |
Interestingly one way to reproduce this consistently is to panic in the webhook when handling a request (ie. the defaulting logic) golang's http server recovers these panics and logs an error. We should potentially recover ourselves so we can return an 'internal server' type error |
As part of knative/serving#11225 I encountered EOF's/context deadline exceeded. After adding some tracing I've seen our web hooks respond <10ms but then the API server still returns a timeout. So I don't think this is isolated to just our web hooks. My next steps is to start testing with a non-managed k8s service to be able to get API server logs |
Serving has done this as well, the chaosduck killing the webhook causes errors being tracked in knative/pkg#1509
Mitigation for knative/pkg#1509. Same fix was used in eventing core to mitigate webhook EOF errors. Signed-off-by: Pierangelo Di Pilato <[email protected]>
Mitigation for knative/pkg#1509. Same fix was used in eventing core to mitigate webhook EOF errors. Signed-off-by: Pierangelo Di Pilato <[email protected]>
Mitigation for knative/pkg#1509. Same fix was used in eventing core to mitigate webhook EOF errors. Signed-off-by: Pierangelo Di Pilato <[email protected]>
Mitigation for knative/pkg#1509. Same fix was used in eventing core to mitigate webhook EOF errors. Signed-off-by: Pierangelo Di Pilato <[email protected]>
Mitigation for knative/pkg#1509. Same fix was used in eventing core to mitigate webhook EOF errors. Signed-off-by: Pierangelo Di Pilato <[email protected]>
This test sometimes fails due to [1]. [1] knative/pkg#1509 Signed-off-by: Pierangelo Di Pilato <[email protected]>
This test sometimes fails due to [1]. [1] knative/pkg#1509 Signed-off-by: Pierangelo Di Pilato <[email protected]>
This test sometimes fails due to [1]. [1] knative/pkg#1509 Signed-off-by: Pierangelo Di Pilato <[email protected]>
/area API
/area test-and-release
/kind bug
Expected Behavior
When we run our e2e tests with chaos there are no failures due to the webhook shutting down.
Actual Behavior
We intermittently see failures like this:
Post https://eventing-webhook.knative-eventing-qh1fjbnng8.svc:443/resource-conversion?timeout=30s: EOF
ever since we enabled webhook chaos.The text was updated successfully, but these errors were encountered: