-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-24.3: cli: actually drain after decommission #139556
Open
blathers-crl
wants to merge
6
commits into
release-24.3
Choose a base branch
from
blathers/backport-release-24.3-138732
base: release-24.3
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I do not know how this was ever supposed to work, but the old code can not have been intentional: it created a drain client but then did not consume from it. This had the effect of kicking off the drain, but ~immediately cancelling the context on the goroutine carrying it out on the decommissioning node. This PR actually waits for the drain to complete. There is a related issue here, though. You can't drain a node that isn't live, so attempting to decommission a node that's down will fail on the drain step. This is certainly true as of this PR, but should have been true before as well. Our docs[1] do not mention this rather large caveat at all, and it seems strange anyway; if the node is down why would you let the failing drain get in the way. Really the code ought to distinguish between the case of a live and dead node and react accordingly - this is not something this PR achieves. Fixes #138265. [1]: https://www.cockroachlabs.com/docs/v24.3/node-shutdown?filters=decommission#remove-nodes Epic: none Release note (ops change): the `node decommission` cli command now waits until the target node is drained before marking it as fully decommissioned. Previously, it would start drain but not wait, leaving the target node briefly in a state where it would be unable to communicate with the cluster but would still accept client requests (which would then hang or hit unexpected errors).
- rename `decommission/drains` to `decommission/drains/alive` - add `decommission/drains/dead` flavor - run these weekly instead of daily (we have enough other decom tests and since I'm adding one, we can also clamp down a bit) - remove an old workaround that also accepted errors that should be impossible now that we properly wait for drain. Epic: none
blathers-crl
bot
force-pushed
the
blathers/backport-release-24.3-138732
branch
from
January 22, 2025 08:13
9022557
to
e92ee6f
Compare
blathers-crl
bot
added
blathers-backport
This is a backport that Blathers created automatically.
O-robot
Originated from a bot.
labels
Jan 22, 2025
Thanks for opening a backport. Please check the backport criteria before merging:
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
Also, please add a brief release justification to the body of your PR to justify this |
blathers-crl
bot
added
the
backport
Label PR's that are backports to older release branches
label
Jan 22, 2025
arulajmani
approved these changes
Jan 24, 2025
kvoli
approved these changes
Jan 27, 2025
I set a reminder to come back to this in a month or two. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
backport
Label PR's that are backports to older release branches
blathers-backport
This is a backport that Blathers created automatically.
O-robot
Originated from a bot.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 6/6 commits from #138732 on behalf of @tbg.
/cc @cockroachdb/release
I do not know how this was ever supposed to work, but the old code can not have
been intentional: it created a drain client but then did not consume from it.
This had the effect of kicking off the drain, but ~immediately cancelling the
context on the goroutine carrying it out on the decommissioning node. This PR
actually waits for the drain to complete.
There is a related issue here, though. You can't drain a node that isn't live,
so attempting to decommission a node that's down will fail on the drain step.
This is certainly true as of this PR, but should have been true before as well.
Our docs1 do not mention this rather large caveat at all, and it seems
strange anyway; if the node is down why would you let the failing drain get in
the way. Really the code ought to distinguish between the case of a live and
dead node and react accordingly - this is not something this PR achieves.
Fixes #138265.
Fixes #137240.
Release justification:
Footnotes
https://www.cockroachlabs.com/docs/v24.3/node-shutdown?filters=decommission#remove-nodes
Epic: none
Release note (ops change): the
node decommission
cli command now waitsuntil the target node is drained before marking it as fully
decommissioned. Previously, it would start drain but not wait, leaving
the target node briefly in a state where it would be unable to
communicate with the cluster but would still accept client requests
(which would then hang or hit unexpected errors). ↩