release-24.3: cli: actually drain after decommission #139556

blathers-crl · 2025-01-22T08:13:09Z

Backport 6/6 commits from #138732 on behalf of @tbg.

/cc @cockroachdb/release

I do not know how this was ever supposed to work, but the old code can not have
been intentional: it created a drain client but then did not consume from it.
This had the effect of kicking off the drain, but ~immediately cancelling the
context on the goroutine carrying it out on the decommissioning node. This PR
actually waits for the drain to complete.

There is a related issue here, though. You can't drain a node that isn't live,
so attempting to decommission a node that's down will fail on the drain step.
This is certainly true as of this PR, but should have been true before as well.

Our docs¹ do not mention this rather large caveat at all, and it seems
strange anyway; if the node is down why would you let the failing drain get in
the way. Really the code ought to distinguish between the case of a live and
dead node and react accordingly - this is not something this PR achieves.

Fixes #138265.
Fixes #137240.

Release justification:

https://www.cockroachlabs.com/docs/v24.3/node-shutdown?filters=decommission#remove-nodes
Epic: none
Release note (ops change): the node decommission cli command now waits
until the target node is drained before marking it as fully
decommissioned. Previously, it would start drain but not wait, leaving
the target node briefly in a state where it would be unable to
communicate with the cluster but would still accept client requests
(which would then hang or hit unexpected errors). ↩

I do not know how this was ever supposed to work, but the old code can not have been intentional: it created a drain client but then did not consume from it. This had the effect of kicking off the drain, but ~immediately cancelling the context on the goroutine carrying it out on the decommissioning node. This PR actually waits for the drain to complete. There is a related issue here, though. You can't drain a node that isn't live, so attempting to decommission a node that's down will fail on the drain step. This is certainly true as of this PR, but should have been true before as well. Our docs[1] do not mention this rather large caveat at all, and it seems strange anyway; if the node is down why would you let the failing drain get in the way. Really the code ought to distinguish between the case of a live and dead node and react accordingly - this is not something this PR achieves. Fixes #138265. [1]: https://www.cockroachlabs.com/docs/v24.3/node-shutdown?filters=decommission#remove-nodes Epic: none Release note (ops change): the `node decommission` cli command now waits until the target node is drained before marking it as fully decommissioned. Previously, it would start drain but not wait, leaving the target node briefly in a state where it would be unable to communicate with the cluster but would still accept client requests (which would then hang or hit unexpected errors).

- rename `decommission/drains` to `decommission/drains/alive` - add `decommission/drains/dead` flavor - run these weekly instead of daily (we have enough other decom tests and since I'm adding one, we can also clamp down a bit) - remove an old workaround that also accepted errors that should be impossible now that we properly wait for drain. Epic: none

blathers-crl · 2025-01-22T08:13:12Z

cockroach-teamcity · 2025-01-22T08:13:36Z

This change is

tbg · 2025-01-30T13:06:07Z

I set a reminder to come back to this in a month or two.

tbg added 6 commits January 9, 2025 14:08

cli: improve logging

e83a009

cli: improve logging

5585904

cli: fix typo

0e215a3

roachtest: small refactor

8e7548f

blathers-crl bot force-pushed the blathers/backport-release-24.3-138732 branch from 9022557 to e92ee6f Compare January 22, 2025 08:13

blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Jan 22, 2025

blathers-crl bot assigned tbg Jan 22, 2025

blathers-crl bot requested review from arulajmani and kvoli January 22, 2025 08:13

blathers-crl bot added the backport Label PR's that are backports to older release branches label Jan 22, 2025

arulajmani approved these changes Jan 24, 2025

View reviewed changes

kvoli approved these changes Jan 27, 2025

View reviewed changes

tbg self-requested a review January 29, 2025 09:02

tbg removed their request for review January 30, 2025 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-24.3: cli: actually drain after decommission #139556

release-24.3: cli: actually drain after decommission #139556

blathers-crl bot commented Jan 22, 2025

blathers-crl bot commented Jan 22, 2025

cockroach-teamcity commented Jan 22, 2025

tbg commented Jan 30, 2025

release-24.3: cli: actually drain after decommission #139556

Are you sure you want to change the base?

release-24.3: cli: actually drain after decommission #139556

Conversation

blathers-crl bot commented Jan 22, 2025

Footnotes

blathers-crl bot commented Jan 22, 2025

cockroach-teamcity commented Jan 22, 2025

tbg commented Jan 30, 2025