-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve various data races in translation and validation #7207
Conversation
/skip-changelog for now |
This will definitely fix the problem with concurrent writes in plugins. I'm pretty confident that it won't deadlock. Are we also planning to share a single translator between validation and translation? |
@elcasteel I think that's a good idea and the right direction forward; torn on whether we want to make that specific change in our backport however. Still reasoning about the shared memory here |
Visit the preview URL for this PR (updated for commit 1ebcd00): https://gloo-edge--pr7207-lock-translation-vsvuq7kv.web.app (expires Fri, 30 Sep 2022 17:47:18 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 |
We shared the translator before we started reusing plugins so I don't believe there are any other issues with shared memory. |
projects/gloo/pkg/syncer/sanitizer/route_replacing_sanitizer.go
Outdated
Show resolved
Hide resolved
Issues linked to changelog: |
@@ -68,6 +71,8 @@ func (t *translatorInstance) Translate( | |||
proxy *v1.Proxy, | |||
) (envoycache.Snapshot, reporter.ResourceReports, *validationapi.ProxyReport) { | |||
// setup tracing, logging | |||
t.lock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much time does translation take? could this be a problem if this take a long time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that and potential scaling concerns are why I don't want to rush this change out. I'd like to run this in your environment and benchmark more thoroughly before merging.
tldr; i don't know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we see a big slowdown after this, we can give the validation webhook one translator and translation+proxy_reconciler another translator which will separate the two truly independent uses of the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I thought that's what translatorInstance
was for, to instantiate one for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; although I think it's a bit better to have validation and core translation use the same if possible. For example, think about two plugins resolving DNS async .. imagine if one copy of translator has DNS entries and the other does not, or something to this effect. Validation translation may succeed using its cached DNS data but then fail on the real translation. This type of bug is why I think they should be the same.
6bb3b35
to
3989982
Compare
3989982
to
51b4f7b
Compare
routesProto = append(routesProto, resource.NewEnvoyResource(routeCfg)) | ||
// important to clone, ownership will be transferred to goroutine serving xds snapshot. | ||
// marshalling a proto to serve over grpc xds can mutate it, which can race with other reads (e.g. hashing last seen snapshot) | ||
routesProto = append(routesProto, resource.NewEnvoyResource(proto.Clone(routeCfg))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called from two places and I don't think we need the clone when it's called from route_replacing_sanitizer so we could pass in a bool to decide whether or not to clone (had this in a previous iteration of the clone removal change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuval-k is considering another solution to this that updates hashstructure to be read-only so that we can keep these as they are, rather than add clones
Issues linked to changelog: |
@@ -49,6 +49,8 @@ var ( | |||
) | |||
|
|||
type RouteReplacingSanitizer struct { | |||
// note to devs: this can be called in parallel by the validation webhook and main translation loops at the same time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on first pass!
* Add lock for translation * Share translator to reuse caches * Add lock for sanitizers * Add note for devs * No need to be too safe before we need it * No need for RWMutex * Add changelog * Avoid global scheme in broken skv2 function * Need to use global scheme, but register it before race can happen * Import the right scheme * Fix data race in with serving xds * Fix data race with status reporter * Adding changelog file to new location * Deleting changelog file from old location * Lock on validator sync * Remove extra lock * Add more locks * Need write lock * Copy write namespaces to avoid race * Update changelog * Revert "Fix data race with status reporter" This reverts commit 256767e. * Revert "Fix data race in with serving xds" This reverts commit 7df16e1. Co-authored-by: changelog-bot <changelog-bot>
* Resolve various data races in translation and validation (#7207) * Add lock for translation * Share translator to reuse caches * Add lock for sanitizers * Add note for devs * No need to be too safe before we need it * No need for RWMutex * Add changelog * Avoid global scheme in broken skv2 function * Need to use global scheme, but register it before race can happen * Import the right scheme * Fix data race in with serving xds * Fix data race with status reporter * Adding changelog file to new location * Deleting changelog file from old location * Lock on validator sync * Remove extra lock * Add more locks * Need write lock * Copy write namespaces to avoid race * Update changelog * Revert "Fix data race with status reporter" This reverts commit 256767e. * Revert "Fix data race in with serving xds" This reverts commit 7df16e1. Co-authored-by: changelog-bot <changelog-bot> * Fix merge * Optimize rate limit hash calculation (#7198) * gencode with unrelease solokit * update solokit * codegen * changelog * empty * Adding changelog file to new location * Deleting changelog file from old location Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com> Co-authored-by: changelog-bot <changelog-bot> Co-authored-by: Kevin Dorosh <[email protected]> * Use new hashstructure (#7227) * update protoc-gen-ext * code gen * remove remaining usages * make update-licenses * changelog * update solo-apis 9746539fc625fed45d1d146c4e0c1359e70f8036 - gloo-edge-safe-hasher * go mod tidy * fix merge solo kit v0.30.4 * Go mod tidy Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com> Co-authored-by: Kevin Dorosh <[email protected]> * Codegen * Consolidate changelogs * Prefer fix changelog type Co-authored-by: Yuval Kohavi <[email protected]> Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com>
Description
Ran Gloo in a large environment under race detector, which identified some races which are fixed in this PR:
RateLimitConfig
client construction to happen duringinit()
so it cannot race with global scheme access by k8s shared informersChecklist:
make -B install-go-tools generated-code
to ensure there will be no code diffBOT NOTES:
resolves solo-io#7202
resolves solo-io/skv2#375
resolves solo-io#7213