-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release 1.10/ack improvement highcpu consumption #426
release 1.10/ack improvement highcpu consumption #426
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## release-1.10 #426 +/- ##
================================================
+ Coverage 45.57% 51.06% +5.49%
================================================
Files 29 30 +1
Lines 1953 2197 +244
================================================
+ Hits 890 1122 +232
+ Misses 1008 1001 -7
- Partials 55 74 +19 ☔ View full report in Codecov by Sentry. |
So I added this because we ran into an issue where the time to spin up a Knative Service from 0-replicas was longer than the default AckWait. The dispatcher doesn't know when the In an ideal scenario, the dispatcher would know about the AckWait and dispatch the actual HTTP request using a context with timeout I wouldn't want to merge this PR as-is until we have a solution for this 0-replica cold-start issue, but happy to hear any alternative solutions? |
Ah, another point... I'm not 100% sure, but I feel like If this PR changes to my suggestion of setting a timeout on the request context, the jitter also needs to take into account that, from an |
Regarding this part, you can set AckWait bigger and set request timeout. E.g. it take 20sec to spin up a service and 30 seconds to process a request, then you set delivery.timeout to 35sec and Channel AckWait to 35*(number of retries)+delta, e.g. 120sec |
@dan-j , overall as I understand you need to calculate AckWait based on delivery.retry and delivery.timeout you set to you channel's subscription. E.g. just to set it big enough, e.g. 300s? and control slow consumers by MaxAckPending. Having that timer you basiclly set AckWait to unlimited and add additional load on cpu. In my case I have broker and around 30 triggers which leads to 30 consumers with MaxAckPending=1000 it becomes 30000 timers in the worst case, it creates very big load on CPU. |
I'm not sure about that, it is push based consumer, if I scale dispatcher it round-robin messages one by one in turn. |
Yeah this is what I'm getting at. We could do that and merge this PR, but it would still introduce bugs in our environment, because the request dispatcher doesn't set any timeout/deadline on the request's I'm happy to remove the ticker, your issue is a real one which I agree needs fixing, just we need a solution which works for both of us. One concern I've always had with the operator is how we handle retries. At the moment it's really confusing because we retry in multiple places: 1) at the JS layer via a consumer's It would be nice to solve both issues here. The simplest option is to remove redelivery from the JS layer and reuse the The more robust solution is to remove |
* implemented update subscription * do not call addstream if it is existing, to prevent error propagation * added comments * added reconciler test * added reconciler tests * removed unused types * added check for err
…er are not come from Subscription
…er are not come from Subscription
…ment-highcpu-consumption' into feature/release-1.10/ack-improvement-highcpu-consumption
hey @dan-j , I've done initial implementation (have not tested it though), your review is welcomed. |
* implemented update subscription * do not call addstream if it is existing, to prevent error propagation * added comments * added reconciler test * added reconciler tests * removed unused types * added check for err
…er are not come from Subscription
@dan-j , here are results of load test. Scenario is pretty simple: |
Awesome stuff! Thanks for the graphs too 👍 It's a shame we've had to reimplement the whole message_dispatcher logic but from our earlier conversations this was going the be the only way. I'm happy to merge this, but need to fix my permissions on the github org so I can't do it just yet |
@dan-j , now it is green , ready to merge |
@pierDipi could we have this merged? I will try to sort my permissions on knative/org or wherever it is this week |
/assign @pierDipi |
Hi @astelmashenko, |
@creydr , it is because we are using 1.10 in production. I'll backport to later versions with e.g. cherry-pick or manually merge it into main. |
@astelmashenko can we update the PR to go into |
@dan-j , I would not do that, main was updated and, afaik, is not directly compatible with this PR |
Ah, fair enough. Let's try and get this all wrapped up this week. I've created knative/community#1479 to add us both as approvers. @zhaojizhuang at the moment you're the only approver of this repo, or there's probably an eventing admin who can look. Once this is in, I'll get the changes onto latest too |
/lgtm |
@astelmashenko and I are now approvers, but since this is being merged into knative-extensions:release-1.10 the Can someone from @knative-extensions/eventing-writers please approve this PR? |
@dan-j we can backport the approvers update, feel free to open a PR |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: astelmashenko, dan-j, pierDipi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
711fe4f
into
knative-extensions:release-1.10
/cherrypick main |
@astelmashenko: #426 failed to apply on top of branch "main":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixes #
Highcpu consumption. Introduced by PR 41b2d76 to solve jetstream redelivery. It should be controlled by AckWait consumer configuration.
cc @dan-j
Proposed Changes
Removed code which created timer for each incomming message to mark InProgress to prevent redelivery from JetStream steam.
Big changes: implemented retries based on JetStream deliveryMax feature instead of in-memory golang retry module.
Release Note