Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent xdsIR updates from overwriting RateLimit configs from other xdsIR #3771

Merged

Conversation

sanposhiho
Copy link
Contributor

What type of PR is this?

What this PR does / why we need it:

This PR tries to prevent xdsIR updates from overwriting RateLimit configs from other xdsIR.

When there are multiple gateway classes, the gateway API runner updates r.XdsIR.Store several times with different keys.

for _, resources := range *val {

But, the ratelimit config runner create a ratelimit config from zero every time it gets updates of XdsIR.

message.HandleSubscription(message.Metadata{Runner: string(egv1a1.LogComponentGlobalRateLimitRunner), Message: "xds-ir"}, r.XdsIR.Subscribe(ctx),
func(update message.Update[string, *ir.Xds], errChan chan error) {
r.Logger.Info("received a notification")
if update.Delete {
if err := r.addNewSnapshot(ctx, nil); err != nil {
r.Logger.Error(err, "failed to update the config snapshot")
errChan <- err
}
} else {
// Translate to ratelimit xDS Config.
rvt, err := r.translate(update.Value)
if err != nil {
r.Logger.Error(err, err.Error())
}
// Update ratelimit xDS config cache.
if rvt != nil {
r.updateSnapshot(ctx, rvt.XdsResources)
}
}
},
)

As a result, an update to a XdsIR could overwrite rate limit configs generated from other XdsIRs in the ratelimit config runner.

Which issue(s) this PR fixes:

Fixes #3678

@sanposhiho sanposhiho requested a review from a team as a code owner July 7, 2024 06:00
@sanposhiho sanposhiho force-pushed the bug-fix/prevent-ratelimitconfig-overwrite branch from 577b96f to b716627 Compare July 7, 2024 06:01
Copy link

codecov bot commented Jul 7, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 69.10%. Comparing base (d5fde3e) to head (74281cf).

Files Patch % Lines
internal/globalratelimit/runner/runner.go 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3771      +/-   ##
==========================================
+ Coverage   68.97%   69.10%   +0.13%     
==========================================
  Files         176      176              
  Lines       21731    21737       +6     
==========================================
+ Hits        14989    15022      +33     
+ Misses       5662     5628      -34     
- Partials     1080     1087       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sanposhiho
Copy link
Contributor Author

/retest

@arkodg
Copy link
Contributor

arkodg commented Jul 9, 2024

thanks for picking this one up @sanposhiho !
the approach looks good, added a comment
can you also add a test in runner_test.go where you instrument some creates and deletes for multiple keys similar to

func TestRunner(t *testing.T) {

@zhaohuabing zhaohuabing added the cherrypick/release-v1.1 cherrypick to release/v1.1 label Jul 11, 2024
@sanposhiho
Copy link
Contributor Author

@arkodg Addressed your points.

@sanposhiho sanposhiho force-pushed the bug-fix/prevent-ratelimitconfig-overwrite branch from 061a071 to e6fbf2b Compare July 12, 2024 07:03
@sanposhiho sanposhiho force-pushed the bug-fix/prevent-ratelimitconfig-overwrite branch from e6fbf2b to e27bef7 Compare July 12, 2024 07:05
@sanposhiho sanposhiho force-pushed the bug-fix/prevent-ratelimitconfig-overwrite branch from e27bef7 to 118e965 Compare July 12, 2024 07:07
Signed-off-by: Kensei Nakada <[email protected]>
@sanposhiho sanposhiho force-pushed the bug-fix/prevent-ratelimitconfig-overwrite branch from 35345de to 63146e5 Compare July 12, 2024 07:33
func (r *Runner) subscribeAndTranslate(ctx context.Context) {
// rateLimitConfigsCache is a cache of the rate limit config, which is keyed by the xdsIR key.
rateLimitConfigsCache := map[string][]cachetype.Resource{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to not maintain this cache here and instead create an API like rCache := r.getSnapshot(ctx) instead ? this will allow us to reduce one copy in memory

Copy link
Contributor Author

@sanposhiho sanposhiho Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try; depending on the complexity that would give though, that looks like a better idea.

Copy link
Contributor Author

@sanposhiho sanposhiho Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change would be like this sanposhiho@a9f0a19

So, either way, we have to have a cache to track which update.Key generates which rate limit configurations. (A cache size would be smaller than the current cache of entire configs, that's one good point though.)
The cache is required because otherwise we wouldn't know which part of snapshot we have to update. (A message.Update comes from watchable.Snapshot doesn't have a previous state of ir.)

So, I'm not very motivated to proceed that way; For me, it looks like only a little gain (less memory usage) would come with a certain downside (complexity).

Copy link
Contributor

@arkodg arkodg Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @sanposhiho thanks for writing up the quick diff, what I was suggesting was a little different, instead of holding the entire cache on the stack using rateLimitConfigsCache which incurs a constant memory hit, can we retrieve the cache snapshot using an API like
https://github.com/envoyproxy/go-control-plane/blob/1da4500d00e270d803caefbe0c20e4d3d162e586/pkg/cache/v3/snapshot.go#L124
every time we get a new new message, and then update it (add or delete key based on type of update) ?

Copy link
Contributor Author

@sanposhiho sanposhiho Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what's the difference from my diff sanposhiho@a9f0a19? It fetches the previous snapshot with GetSnapshot -> GetResources, and try to rebuild a new snapshot from there already.

My point is that we, either way, need keyToRateLimitCfg cache to know which ir.Key has generated which part of snapshot so that we can calculate a correct snapshot after update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that 🙈
lets go ahead with diff to keep the code complexity low and we can revisit these if/when we hit a memory bottleneck

Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks !

@arkodg arkodg merged commit 762eb42 into envoyproxy:main Jul 13, 2024
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherrypick/release-v1.1 cherrypick to release/v1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ratelimit does not work across multiple GatewayClasses
4 participants