Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve various data races in translation and validation #7207

Merged
merged 25 commits into from
Sep 23, 2022

Conversation

kdorosh
Copy link
Contributor

@kdorosh kdorosh commented Sep 21, 2022

Description

Ran Gloo in a large environment under race detector, which identified some races which are fixed in this PR:

  • Updates validation server to use a shared translator that is synchronized which prevents data races across translation
  • Updates locking in validation server to prevent concurrent access to last snapshot
  • Updates RateLimitConfig client construction to happen during init() so it cannot race with global scheme access by k8s shared informers

Checklist:

  • I included a concise, user-facing changelog (for details, see https://github.com/solo-io/go-utils/tree/master/changelogutils) which references the issue that is resolved.
  • If I updated APIs (our protos) or helm values, I ran make -B install-go-tools generated-code to ensure there will be no code diff
  • I followed guidelines laid out in the Gloo Edge contribution guide
  • I opened a draft PR or added the work in progress label if my PR is not ready for review
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

BOT NOTES:
resolves solo-io#7202
resolves solo-io/skv2#375
resolves solo-io#7213

@kdorosh kdorosh added the work in progress signals bulldozer to keep pr open (don't auto-merge) label Sep 21, 2022
@github-actions github-actions bot added the keep pr updated signals bulldozer to keep pr up to date with base branch label Sep 21, 2022
@kdorosh
Copy link
Contributor Author

kdorosh commented Sep 21, 2022

/skip-changelog for now

@kdorosh kdorosh removed the keep pr updated signals bulldozer to keep pr up to date with base branch label Sep 21, 2022
@kdorosh
Copy link
Contributor Author

kdorosh commented Sep 21, 2022

@elcasteel
Copy link
Contributor

This will definitely fix the problem with concurrent writes in plugins. I'm pretty confident that it won't deadlock. Are we also planning to share a single translator between validation and translation?

@kdorosh
Copy link
Contributor Author

kdorosh commented Sep 21, 2022

@elcasteel I think that's a good idea and the right direction forward; torn on whether we want to make that specific change in our backport however. Still reasoning about the shared memory here

@github-actions
Copy link

github-actions bot commented Sep 21, 2022

Visit the preview URL for this PR (updated for commit 1ebcd00):

https://gloo-edge--pr7207-lock-translation-vsvuq7kv.web.app

(expires Fri, 30 Sep 2022 17:47:18 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

@elcasteel
Copy link
Contributor

@elcasteel I think that's a good idea and the right direction forward; torn on whether we want to make that specific change in our backport however. Still reasoning about the shared memory here

We shared the translator before we started reusing plugins so I don't believe there are any other issues with shared memory.
I think if we are trying to avoid waiting on locks the best set up would be to give the validation webhook its own translator (and own validator) and then to have a separate validator for the proxy reconciler that can share a translator with translation.
I think it's worth scale testing with a single translator for everything.

@solo-changelog-bot
Copy link

Issues linked to changelog:
solo-io#7202

@@ -68,6 +71,8 @@ func (t *translatorInstance) Translate(
proxy *v1.Proxy,
) (envoycache.Snapshot, reporter.ResourceReports, *validationapi.ProxyReport) {
// setup tracing, logging
t.lock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much time does translation take? could this be a problem if this take a long time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that and potential scaling concerns are why I don't want to rush this change out. I'd like to run this in your environment and benchmark more thoroughly before merging.

tldr; i don't know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we see a big slowdown after this, we can give the validation webhook one translator and translation+proxy_reconciler another translator which will separate the two truly independent uses of the lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I thought that's what translatorInstance was for, to instantiate one for each.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; although I think it's a bit better to have validation and core translation use the same if possible. For example, think about two plugins resolving DNS async .. imagine if one copy of translator has DNS entries and the other does not, or something to this effect. Validation translation may succeed using its cached DNS data but then fail on the real translation. This type of bug is why I think they should be the same.

yuval-k added a commit that referenced this pull request Sep 21, 2022
@kdorosh
Copy link
Contributor Author

kdorosh commented Sep 22, 2022

routesProto = append(routesProto, resource.NewEnvoyResource(routeCfg))
// important to clone, ownership will be transferred to goroutine serving xds snapshot.
// marshalling a proto to serve over grpc xds can mutate it, which can race with other reads (e.g. hashing last seen snapshot)
routesProto = append(routesProto, resource.NewEnvoyResource(proto.Clone(routeCfg)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called from two places and I don't think we need the clone when it's called from route_replacing_sanitizer so we could pass in a bool to decide whether or not to clone (had this in a previous iteration of the clone removal change).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuval-k is considering another solution to this that updates hashstructure to be read-only so that we can keep these as they are, rather than add clones

@solo-changelog-bot
Copy link

Issues linked to changelog:
solo-io#7202
solo-io/skv2#375
solo-io#7213

@kdorosh kdorosh changed the title Add lock for translation Resolve various data races in translation and validation Sep 23, 2022
@kdorosh kdorosh added keep pr updated signals bulldozer to keep pr up to date with base branch and removed work in progress signals bulldozer to keep pr open (don't auto-merge) labels Sep 23, 2022
@@ -49,6 +49,8 @@ var (
)

type RouteReplacingSanitizer struct {
// note to devs: this can be called in parallel by the validation webhook and main translation loops at the same time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

Copy link
Contributor

@sam-heilbron sam-heilbron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on first pass!

@soloio-bulldozer soloio-bulldozer bot merged commit c780bcc into master Sep 23, 2022
@soloio-bulldozer soloio-bulldozer bot deleted the lock_translation branch September 23, 2022 18:53
kdorosh pushed a commit that referenced this pull request Sep 23, 2022
* Add lock for translation

* Share translator to reuse caches

* Add lock for sanitizers

* Add note for devs

* No need to be too safe before we need it

* No need for RWMutex

* Add changelog

* Avoid global scheme in broken skv2 function

* Need to use global scheme, but register it before race can happen

* Import the right scheme

* Fix data race in with serving xds

* Fix data race with status reporter

* Adding changelog file to new location

* Deleting changelog file from old location

* Lock on validator sync

* Remove extra lock

* Add more locks

* Need write lock

* Copy write namespaces to avoid race

* Update changelog

* Revert "Fix data race with status reporter"

This reverts commit 256767e.

* Revert "Fix data race in with serving xds"

This reverts commit 7df16e1.

Co-authored-by: changelog-bot <changelog-bot>
soloio-bulldozer bot added a commit that referenced this pull request Sep 23, 2022
* Resolve various data races in translation and validation (#7207)

* Add lock for translation

* Share translator to reuse caches

* Add lock for sanitizers

* Add note for devs

* No need to be too safe before we need it

* No need for RWMutex

* Add changelog

* Avoid global scheme in broken skv2 function

* Need to use global scheme, but register it before race can happen

* Import the right scheme

* Fix data race in with serving xds

* Fix data race with status reporter

* Adding changelog file to new location

* Deleting changelog file from old location

* Lock on validator sync

* Remove extra lock

* Add more locks

* Need write lock

* Copy write namespaces to avoid race

* Update changelog

* Revert "Fix data race with status reporter"

This reverts commit 256767e.

* Revert "Fix data race in with serving xds"

This reverts commit 7df16e1.

Co-authored-by: changelog-bot <changelog-bot>

* Fix merge

* Optimize rate limit hash calculation (#7198)

* gencode with unrelease solokit

* update solokit

* codegen

* changelog

* empty

* Adding changelog file to new location

* Deleting changelog file from old location

Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: changelog-bot <changelog-bot>
Co-authored-by: Kevin Dorosh <[email protected]>

* Use new hashstructure (#7227)

* update protoc-gen-ext

* code gen

* remove remaining usages

* make update-licenses

* changelog

* update solo-apis 9746539fc625fed45d1d146c4e0c1359e70f8036 - gloo-edge-safe-hasher

* go mod tidy

* fix merge solo kit v0.30.4

* Go mod tidy

Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Kevin Dorosh <[email protected]>

* Codegen

* Consolidate changelogs

* Prefer fix changelog type

Co-authored-by: Yuval Kohavi <[email protected]>
Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep pr updated signals bulldozer to keep pr up to date with base branch
Projects
None yet
5 participants