release 1.10/ack improvement highcpu consumption #426

astelmashenko · 2023-09-05T10:56:39Z

Fixes #

Highcpu consumption. Introduced by PR 41b2d76 to solve jetstream redelivery. It should be controlled by AckWait consumer configuration.

cc @dan-j

Proposed Changes

Removed code which created timer for each incomming message to mark InProgress to prevent redelivery from JetStream steam.
Big changes: implemented retries based on JetStream deliveryMax feature instead of in-memory golang retry module.

🐛 Fix bug

Release Note

Fixes high cpu consumption of dispatcher

codecov · 2023-09-05T11:16:32Z

Codecov Report

Attention: 70 lines in your changes are missing coverage. Please review.

Comparison is base (38d85ad) 45.57% compared to head (e8ae33d) 51.06%.
Report is 1 commits behind head on release-1.10.

Files	Patch %	Lines
...channel/jetstream/dispatcher/message_dispatcher.go	77.37%	43 Missing and 19 partials ⚠️
pkg/channel/jetstream/dispatcher/consumer.go	0.00%	8 Missing ⚠️

Additional details and impacted files

@@               Coverage Diff                @@
##           release-1.10     #426      +/-   ##
================================================
+ Coverage         45.57%   51.06%   +5.49%     
================================================
  Files                29       30       +1     
  Lines              1953     2197     +244     
================================================
+ Hits                890     1122     +232     
+ Misses             1008     1001       -7     
- Partials             55       74      +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dan-j · 2023-09-05T11:40:35Z

So I added this because we ran into an issue where the time to spin up a Knative Service from 0-replicas was longer than the default AckWait. The dispatcher doesn't know when the *nats.Msg has expired, and the request is held on the knative activator until the service is ready. By the time the request is actually handled the AckWait has passed and the Ack fails.. However, now JetStream thinks the message hasn't been delivered so attempts a redelivery.

In an ideal scenario, the dispatcher would know about the AckWait and dispatch the actual HTTP request using a context with timeout AckWait - jitter (some jitter to make sure there's time for the dispatcher to do it's processing before/after the request). It doesn't look like you can access the AckWait/deadline on the *nats.Msg directly, but you could get the ConsumerInfo from the *nats.Subscription when creating the Consumer during Dispatcher#subscribe(). The ConsumerInfo has a Config.AckWait field, but I'd double check this is properly set if defaults are used when it's created.

I wouldn't want to merge this PR as-is until we have a solution for this 0-replica cold-start issue, but happy to hear any alternative solutions?

dan-j · 2023-09-05T13:47:32Z

Ah, another point...

I'm not 100% sure, but I feel like nats.QueueSubscribe() receives messages in batches and then calls the MsgHandler in serial. I observed this when working on 41b2d76, which is why forwarding the actual event is done in another goroutine.

If this PR changes to my suggestion of setting a timeout on the request context, the jitter also needs to take into account that, from an AckWait perspective, the message is considered as "delivered" the moment it's received by the client. I'm not sure there's a way to know what this duration is, I imagine it will typically be sub-milliseconds but could spike to longer if there's high load.

astelmashenko · 2023-09-06T06:32:25Z

So I added this because we ran into an issue where the time to spin up a Knative Service from 0-replicas was longer than the default AckWait. The dispatcher doesn't know when the *nats.Msg has expired, and the request is held on the knative activator until the service is ready. By the time the request is actually handled the AckWait has passed and the Ack fails.. However, now JetStream thinks the message hasn't been delivered so attempts a redelivery.

Regarding this part, you can set AckWait bigger and set request timeout. E.g. it take 20sec to spin up a service and 30 seconds to process a request, then you set delivery.timeout to 35sec and Channel AckWait to 35*(number of retries)+delta, e.g. 120sec

astelmashenko · 2023-09-06T06:39:53Z

@dan-j , overall as I understand you need to calculate AckWait based on delivery.retry and delivery.timeout you set to you channel's subscription. E.g. just to set it big enough, e.g. 300s? and control slow consumers by MaxAckPending.

Having that timer you basiclly set AckWait to unlimited and add additional load on cpu. In my case I have broker and around 30 triggers which leads to 30 consumers with MaxAckPending=1000 it becomes 30000 timers in the worst case, it creates very big load on CPU.

astelmashenko · 2023-09-06T06:42:05Z

I'm not 100% sure, but I feel like nats.QueueSubscribe() receives messages in batches and then calls the MsgHandler in serial.

I'm not sure about that, it is push based consumer, if I scale dispatcher it round-robin messages one by one in turn.

dan-j · 2023-09-06T09:34:24Z

Regarding this part, you can set AckWait bigger and set request timeout. E.g. it take 20sec to spin up a service and 30 seconds to process a request, then you set delivery.timeout to 35sec and Channel AckWait to 35*(number of retries)+delta, e.g. 120sec

Yeah this is what I'm getting at. We could do that and merge this PR, but it would still introduce bugs in our environment, because the request dispatcher doesn't set any timeout/deadline on the request's context.Context. Which is what I mean about not wanting to merge this as-is.

I'm happy to remove the ticker, your issue is a real one which I agree needs fixing, just we need a solution which works for both of us.

One concern I've always had with the operator is how we handle retries. At the moment it's really confusing because we retry in multiple places: 1) at the JS layer via a consumer's maxDelivery and 2) at the in-memory layer via the DispatchMessageWithRetries() function which configured by the Subscription.

It would be nice to solve both issues here.

The simplest option is to remove redelivery from the JS layer and reuse the DispatchMessageWithRetries() functionality. However this would result in message loss in the event of a pod failure.

The more robust solution is to remove NatsJetStreamChannel.spec.consumerConfigTemplate.ackWait and NatsJetStreamChannel.spec.consumerConfigTemplate.maxDeliver from the CRDs, and calculate the proper JS consumer options based on the Subscription.spec.delivery configuration. Then let JS do all the retrying and use DispatchMessage instead of DispatchMessageWithRetries, setting the context.Context properly to abort requests which time out. This might be a bit of a headache to do exponential backoffs because ackWait would need to be the maximum possible, and the context's timeout would need to be calculated on each retry (which is possible because you can get the redelivery counter from meta, _ := msg.Metadata(); meta.NumDelivered)

* implemented update subscription * do not call addstream if it is existing, to prevent error propagation * added comments * added reconciler test * added reconciler tests * removed unused types * added check for err

…er are not come from Subscription

…ment-highcpu-consumption' into feature/release-1.10/ack-improvement-highcpu-consumption

astelmashenko · 2023-10-09T17:34:57Z

hey @dan-j , I've done initial implementation (have not tested it though), your review is welcomed.

* implemented update subscription * do not call addstream if it is existing, to prevent error propagation * added comments * added reconciler test * added reconciler tests * removed unused types * added check for err

…er are not come from Subscription

astelmashenko · 2023-12-06T14:36:56Z

@dan-j , here are results of load test. Scenario is pretty simple:
http workers -> broker (channel) -> trigger (sub) -> functionQ
A functionQ processes a request with random sleep.
Trigger has request timeout, so that functionQ 25% of times fails to process in time.
Please pay attention on blue line, it is http latency of the dispatcher.
Retries are in-memory:

Retries are based on JetStream:

dan-j · 2023-12-17T23:43:44Z

Awesome stuff! Thanks for the graphs too 👍

It's a shame we've had to reimplement the whole message_dispatcher logic but from our earlier conversations this was going the be the only way.

I'm happy to merge this, but need to fix my permissions on the github org so I can't do it just yet

dan-j · 2023-12-17T23:45:18Z

Codecov checks are failing:

astelmashenko · 2023-12-18T10:49:48Z

@dan-j , now it is green , ready to merge

dan-j · 2023-12-18T17:38:40Z

@pierDipi could we have this merged? I will try to sort my permissions on knative/org or wherever it is this week

astelmashenko · 2023-12-20T06:43:23Z

/assign @pierDipi

creydr · 2023-12-21T15:23:38Z

Hi @astelmashenko,
thanks for your PR!
Is there a reason, why you're targeting directly 1.10 branch and not main and then backporting it? That way it will only be fixed in 1.10 and not in any future releases :/

astelmashenko · 2023-12-21T16:09:54Z

@creydr , it is because we are using 1.10 in production. I'll backport to later versions with e.g. cherry-pick or manually merge it into main.

dan-j · 2024-01-02T08:58:08Z

@astelmashenko can we update the PR to go into main and then cherry-pick afterwards.

astelmashenko · 2024-01-02T09:53:32Z

@dan-j , I would not do that, main was updated and, afaik, is not directly compatible with this PR

dan-j · 2024-01-02T10:15:09Z

Ah, fair enough. Let's try and get this all wrapped up this week. I've created knative/community#1479 to add us both as approvers.

@zhaojizhuang at the moment you're the only approver of this repo, or there's probably an eventing admin who can look.

Once this is in, I'll get the changes onto latest too

dan-j · 2024-01-04T10:18:44Z

/lgtm
/approve

dan-j · 2024-01-04T10:21:52Z

@astelmashenko and I are now approvers, but since this is being merged into knative-extensions:release-1.10 the OWNERS_ALIASES file isn't updated.

Can someone from @knative-extensions/eventing-writers please approve this PR?

pierDipi · 2024-01-04T16:03:21Z

@dan-j we can backport the approvers update, feel free to open a PR

pierDipi · 2024-01-04T16:03:42Z

/approve

knative-prow · 2024-01-04T16:03:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astelmashenko, dan-j, pierDipi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [pierDipi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

astelmashenko · 2024-01-12T09:37:11Z

/cherrypick main

knative-prow-robot · 2024-01-12T09:37:49Z

@astelmashenko: #426 failed to apply on top of branch "main":

Applying: removed timer which marked messages as InProgress
Applying: goimports
Applying: implemented update subscription (#427)
Using index info to reconstruct a base tree...
M	pkg/channel/jetstream/dispatcher/dispatcher.go
M	pkg/channel/jetstream/dispatcher/dispatcher_test.go
M	pkg/channel/jetstream/dispatcher/natsjetstreamchannel_test.go
M	pkg/channel/jetstream/dispatcher/reconciler.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/channel/jetstream/dispatcher/natsjetstreamchannel_test.go
Auto-merging pkg/channel/jetstream/dispatcher/dispatcher.go
No changes -- Patch already applied.
Applying: implemented retries based on JetStream; consumer ackWait and maxDeliver are not come from Subscription
Using index info to reconstruct a base tree...
M	pkg/channel/jetstream/dispatcher/consumer.go
M	pkg/channel/jetstream/dispatcher/dispatcher.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/channel/jetstream/dispatcher/dispatcher.go
Auto-merging pkg/channel/jetstream/dispatcher/consumer.go
CONFLICT (content): Merge conflict in pkg/channel/jetstream/dispatcher/consumer.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0004 implemented retries based on JetStream; consumer ackWait and maxDeliver are not come from Subscription
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick main

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

removed timer which marked messages as InProgress

eae4389

knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 5, 2023

knative-prow bot requested review from lionelvillard and odacremolbap September 5, 2023 10:56

knative-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 5, 2023

goimports

53f9fd0

astelmashenko added 2 commits October 9, 2023 20:30

implemented update subscription (knative-extensions#427)

ea95457

* implemented update subscription * do not call addstream if it is existing, to prevent error propagation * added comments * added reconciler test * added reconciler tests * removed unused types * added check for err

implemented retries based on JetStream; consumer ackWait and maxDeliv…

518d83b

…er are not come from Subscription

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 9, 2023

knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 9, 2023

astelmashenko added 4 commits October 9, 2023 20:32

removed timer which marked messages as InProgress

f9d8a19

goimports

e1e31fa

implemented retries based on JetStream; consumer ackWait and maxDeliv…

75f59c1

…er are not come from Subscription

Merge remote-tracking branch 'origin/feature/release-1.10/ack-improve…

b0acca1

…ment-highcpu-consumption' into feature/release-1.10/ack-improvement-highcpu-consumption

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 9, 2023

astelmashenko added 6 commits October 9, 2023 20:47

changed nil to noRetries

71b33a3

removed timer which marked messages as InProgress

0f4f9c8

goimports

a9674e5

implemented update subscription (knative-extensions#427)

8977e4b

* implemented update subscription * do not call addstream if it is existing, to prevent error propagation * added comments * added reconciler test * added reconciler tests * removed unused types * added check for err

implemented retries based on JetStream; consumer ackWait and maxDeliv…

47ed484

…er are not come from Subscription

changed nil to noRetries

23def06

astelmashenko marked this pull request as ready for review November 2, 2023 14:06

knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 2, 2023

astelmashenko mentioned this pull request Nov 16, 2023

Proposal to improve eventing/channel/MessageDispatcher knative/eventing#7456

Closed

astelmashenko added 2 commits December 18, 2023 12:26

tests: message_dispatcher.go

18f0ee9

goimports: message_dispatcher.go

e8ae33d

knative-prow bot assigned pierDipi Dec 20, 2023

knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jan 4, 2024

knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 4, 2024

knative-prow bot merged commit 711fe4f into knative-extensions:release-1.10 Jan 4, 2024
21 checks passed

This was referenced Jan 5, 2024

chore: backport OWNERS_ALIASES to release-1.10 #483

Merged

Apply high CPU issues resolved on 1.10 to main #489

Closed

astelmashenko mentioned this pull request Apr 15, 2024

re-applied backport pkg files from release-1.10 #545

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release 1.10/ack improvement highcpu consumption #426

release 1.10/ack improvement highcpu consumption #426

astelmashenko commented Sep 5, 2023 •

edited

Loading

codecov bot commented Sep 5, 2023 •

edited

Loading

dan-j commented Sep 5, 2023

dan-j commented Sep 5, 2023

astelmashenko commented Sep 6, 2023 •

edited

Loading

astelmashenko commented Sep 6, 2023 •

edited

Loading

astelmashenko commented Sep 6, 2023

dan-j commented Sep 6, 2023 •

edited

Loading

astelmashenko commented Oct 9, 2023

astelmashenko commented Dec 6, 2023

dan-j commented Dec 17, 2023

dan-j commented Dec 17, 2023

astelmashenko commented Dec 18, 2023

dan-j commented Dec 18, 2023

astelmashenko commented Dec 20, 2023

creydr commented Dec 21, 2023 •

edited

Loading

astelmashenko commented Dec 21, 2023 •

edited

Loading

dan-j commented Jan 2, 2024

astelmashenko commented Jan 2, 2024

dan-j commented Jan 2, 2024

dan-j commented Jan 4, 2024

dan-j commented Jan 4, 2024

pierDipi commented Jan 4, 2024

pierDipi commented Jan 4, 2024

knative-prow bot commented Jan 4, 2024

astelmashenko commented Jan 12, 2024

knative-prow-robot commented Jan 12, 2024

release 1.10/ack improvement highcpu consumption #426

release 1.10/ack improvement highcpu consumption #426

Conversation

astelmashenko commented Sep 5, 2023 • edited Loading

Proposed Changes

codecov bot commented Sep 5, 2023 • edited Loading

Codecov Report

dan-j commented Sep 5, 2023

dan-j commented Sep 5, 2023

astelmashenko commented Sep 6, 2023 • edited Loading

astelmashenko commented Sep 6, 2023 • edited Loading

astelmashenko commented Sep 6, 2023

dan-j commented Sep 6, 2023 • edited Loading

astelmashenko commented Oct 9, 2023

astelmashenko commented Dec 6, 2023

dan-j commented Dec 17, 2023

dan-j commented Dec 17, 2023

astelmashenko commented Dec 18, 2023

dan-j commented Dec 18, 2023

astelmashenko commented Dec 20, 2023

creydr commented Dec 21, 2023 • edited Loading

astelmashenko commented Dec 21, 2023 • edited Loading

dan-j commented Jan 2, 2024

astelmashenko commented Jan 2, 2024

dan-j commented Jan 2, 2024

dan-j commented Jan 4, 2024

dan-j commented Jan 4, 2024

pierDipi commented Jan 4, 2024

pierDipi commented Jan 4, 2024

knative-prow bot commented Jan 4, 2024

astelmashenko commented Jan 12, 2024

knative-prow-robot commented Jan 12, 2024

astelmashenko commented Sep 5, 2023 •

edited

Loading

codecov bot commented Sep 5, 2023 •

edited

Loading

astelmashenko commented Sep 6, 2023 •

edited

Loading

astelmashenko commented Sep 6, 2023 •

edited

Loading

dan-j commented Sep 6, 2023 •

edited

Loading

creydr commented Dec 21, 2023 •

edited

Loading

astelmashenko commented Dec 21, 2023 •

edited

Loading