Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-24.3: cli: actually drain after decommission #139556

Open
wants to merge 6 commits into
base: release-24.3
Choose a base branch
from

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Jan 22, 2025

Backport 6/6 commits from #138732 on behalf of @tbg.

/cc @cockroachdb/release


I do not know how this was ever supposed to work, but the old code can not have
been intentional: it created a drain client but then did not consume from it.
This had the effect of kicking off the drain, but ~immediately cancelling the
context on the goroutine carrying it out on the decommissioning node. This PR
actually waits for the drain to complete.

There is a related issue here, though. You can't drain a node that isn't live,
so attempting to decommission a node that's down will fail on the drain step.
This is certainly true as of this PR, but should have been true before as well.

Our docs1 do not mention this rather large caveat at all, and it seems
strange anyway; if the node is down why would you let the failing drain get in
the way. Really the code ought to distinguish between the case of a live and
dead node and react accordingly - this is not something this PR achieves.

Fixes #138265.
Fixes #137240.


Release justification:

Footnotes

  1. https://www.cockroachlabs.com/docs/v24.3/node-shutdown?filters=decommission#remove-nodes
    Epic: none
    Release note (ops change): the node decommission cli command now waits
    until the target node is drained before marking it as fully
    decommissioned. Previously, it would start drain but not wait, leaving
    the target node briefly in a state where it would be unable to
    communicate with the cluster but would still accept client requests
    (which would then hang or hit unexpected errors).

tbg added 6 commits January 9, 2025 14:08
I do not know how this was ever supposed to work, but the old code can not have
been intentional: it created a drain client but then did not consume from it.
This had the effect of kicking off the drain, but ~immediately cancelling the
context on the goroutine carrying it out on the decommissioning node. This PR
actually waits for the drain to complete.

There is a related issue here, though. You can't drain a node that isn't live,
so attempting to decommission a node that's down will fail on the drain step.
This is certainly true as of this PR, but should have been true before as well.

Our docs[1] do not mention this rather large caveat at all, and it seems
strange anyway; if the node is down why would you let the failing drain get in
the way.  Really the code ought to distinguish between the case of a live and
dead node and react accordingly - this is not something this PR achieves.

Fixes #138265.

[1]: https://www.cockroachlabs.com/docs/v24.3/node-shutdown?filters=decommission#remove-nodes
Epic: none
Release note (ops change): the `node decommission` cli command now waits
until the target node is drained before marking it as fully
decommissioned. Previously, it would start drain but not wait, leaving
the target node briefly in a state where it would be unable to
communicate with the cluster but would still accept client requests
(which would then hang or hit unexpected errors).
- rename `decommission/drains` to `decommission/drains/alive`
- add `decommission/drains/dead` flavor
- run these weekly instead of daily (we have enough other decom tests
  and since I'm adding one, we can also clamp down a bit)
- remove an old workaround that also accepted errors that
  should be impossible now that we properly wait for drain.

Epic: none
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-24.3-138732 branch from 9022557 to e92ee6f Compare January 22, 2025 08:13
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Jan 22, 2025
Copy link
Author

blathers-crl bot commented Jan 22, 2025

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot requested review from arulajmani and kvoli January 22, 2025 08:13
@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Jan 22, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg self-requested a review January 29, 2025 09:02
@tbg
Copy link
Member

tbg commented Jan 30, 2025

I set a reminder to come back to this in a month or two.

@tbg tbg removed their request for review January 30, 2025 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants