Implement etcd member management in pre-terminate hook #435

Danil-Grigorev · 2024-09-10T11:26:13Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #431

Special notes for your reviewer:

Checklist:

squashed commits into logical changes
includes documentation
adds unit tests
adds or updates e2e tests

Danil-Grigorev · 2024-09-10T12:03:22Z

Currently testing. Getting errors in etcd after scaling down:

2024-09-10T11:56:31.901660199Z stderr F {"level":"warn","ts":"2024-09-10T11:56:31.901294Z","caller":"etcdserver/server.go:2133","msg":"failed to publish local member to cluster through raft","local-member-id":"aeb9bf27c9633dc9","local-member-attributes":"{Name:docker-dqwkg-9sdrq-e523dc15 ClientURLs:[https://192.168.32.7:2379]}","request-path":"/0/members/aeb9bf27c9633dc9/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}

ETCD cluster is not accessible, and API server is not recovering because of that

Danil-Grigorev · 2024-09-10T14:50:52Z

It seems the previous problem is fixed, but another problem appears on removal of the last machine.

I0910 14:37:55.029275       1 machine_controller.go:357] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed“

There is a discussion upstream about a potentially related issue https://kubernetes.slack.com/archives/C8TSNPY4T/p1725952675583209

furkatgofurov7 · 2024-09-10T14:55:47Z

It seems the previous problem is fixed, but another problem appears on removal of the last machine.
I0910 14:37:55.029275       1 machine_controller.go:357] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed“
There is a discussion upstream about a potentially related issue https://kubernetes.slack.com/archives/C8TSNPY4T/p1725952675583209

Looks like a potential fix upstream will be available to use with the new v1.8.3 CAPI patch release scheduled for today, however we have not yet bumped CAPI to v1.8.x series in CAPRKE2

Danil-Grigorev · 2024-09-12T09:57:00Z

Watch logs and metric collection causes failures in CI

Danil-Grigorev · 2024-09-13T15:21:52Z

This PR depends on #440

tmmorin · 2024-09-13T16:13:40Z

controlplane/internal/controllers/rke2controlplane_controller.go


 	var errs []error

 	for i := range machinesToDelete {
 		m := machinesToDelete[i]
 		logger := logger.WithValues("machine", m)

+		// During RKE2CP deletion we don't care about forwarding etcd leadership or removing etcd members.
+		// So we are removing the pre-terminate hook.
+		// This is important because when deleting KCP we will delete all members of etcd and it's not possible


Suggested change

// This is important because when deleting KCP we will delete all members of etcd and it's not possible

// This is important because when deleting RKE2CP we will delete all members of etcd and it's not possible

tmmorin · 2024-09-13T16:21:18Z

controlplane/internal/controllers/rke2controlplane_controller.go

+		}
+
+		// Note: Removing the etcd member will lead to the etcd and the kube-apiserver Pod on the Machine shutting down.
+		// If ControlPlaneKubeletLocalMode is used, the kubelet is communicating with the local apiserver and thus now


isn't this comment kubeadm-specific ?

Yes, this comment can be simplified.

Signed-off-by: Danil-Grigorev <[email protected]>

- Lint fixes Signed-off-by: Danil-Grigorev <[email protected]>

When the last machine is in a deleting state, this means that cluster is removed also. In such scenario, waiting for draining is not feasible, because it is performes only when node deletion is allowed. Which is not, due to cluster removal. Cluster API prevents draining with the "cluster is being deleted" error. Signed-off-by: Danil-Grigorev <[email protected]>

Signed-off-by: Danil-Grigorev <[email protected]>

controlplane/internal/controllers/rke2controlplane_controller.go

alexander-demicev

thanks a lot for taking care of this issue

furkatgofurov7

Thanks for the fix @Danil-Grigorev!

Danil-Grigorev requested a review from a team as a code owner September 10, 2024 11:26

Danil-Grigorev force-pushed the reconcile-etcd-memebers-on-pre-delete branch 3 times, most recently from ef72b5a to 452143d Compare September 10, 2024 14:18

Danil-Grigorev force-pushed the reconcile-etcd-memebers-on-pre-delete branch 3 times, most recently from 47e23e5 to 2ed8d90 Compare September 11, 2024 11:50

zioc mentioned this pull request Sep 11, 2024

Avoid deleting etcd member before node is drained #434

Closed

Danil-Grigorev force-pushed the reconcile-etcd-memebers-on-pre-delete branch 3 times, most recently from 6788009 to 5257503 Compare September 12, 2024 08:45

Danil-Grigorev force-pushed the reconcile-etcd-memebers-on-pre-delete branch 15 times, most recently from a547119 to 9f985d0 Compare September 13, 2024 14:33

Danil-Grigorev added the kind/bug Something isn't working label Sep 13, 2024

tmmorin reviewed Sep 13, 2024

View reviewed changes

Danil-Grigorev added 5 commits September 16, 2024 11:02

Implement etcd member management in pre-terminate hook

679d205

Signed-off-by: Danil-Grigorev <[email protected]>

Prevent removing non-drained machine for rke2 case

eb09be3

Signed-off-by: Danil-Grigorev <[email protected]>

Propagate PreTerminateHookCleanupAnnotation on old machines

931e7cc

- Lint fixes Signed-off-by: Danil-Grigorev <[email protected]>

Ensure running machines always contain hook annoation

fc6f21d

Signed-off-by: Danil-Grigorev <[email protected]>

Danil-Grigorev force-pushed the reconcile-etcd-memebers-on-pre-delete branch from 9f985d0 to fc6f21d Compare September 16, 2024 09:13

alexander-demicev reviewed Sep 16, 2024

View reviewed changes

controlplane/internal/controllers/rke2controlplane_controller.go Show resolved Hide resolved

alexander-demicev approved these changes Sep 16, 2024

View reviewed changes

alexander-demicev requested a review from furkatgofurov7 September 16, 2024 10:08

furkatgofurov7 approved these changes Sep 16, 2024

View reviewed changes

alexander-demicev merged commit 7820d2d into rancher:main Sep 17, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement etcd member management in pre-terminate hook #435

Implement etcd member management in pre-terminate hook #435

Danil-Grigorev commented Sep 10, 2024 •

edited

Loading

Danil-Grigorev commented Sep 10, 2024

Danil-Grigorev commented Sep 10, 2024

furkatgofurov7 commented Sep 10, 2024 •

edited

Loading

Danil-Grigorev commented Sep 12, 2024

Danil-Grigorev commented Sep 13, 2024

tmmorin Sep 13, 2024

tmmorin Sep 13, 2024

Danil-Grigorev Sep 16, 2024

alexander-demicev left a comment

furkatgofurov7 left a comment

	// This is important because when deleting KCP we will delete all members of etcd and it's not possible
	// This is important because when deleting RKE2CP we will delete all members of etcd and it's not possible

Implement etcd member management in pre-terminate hook #435

Implement etcd member management in pre-terminate hook #435

Conversation

Danil-Grigorev commented Sep 10, 2024 • edited Loading

Danil-Grigorev commented Sep 10, 2024

Danil-Grigorev commented Sep 10, 2024

furkatgofurov7 commented Sep 10, 2024 • edited Loading

Danil-Grigorev commented Sep 12, 2024

Danil-Grigorev commented Sep 13, 2024

tmmorin Sep 13, 2024

Choose a reason for hiding this comment

tmmorin Sep 13, 2024

Choose a reason for hiding this comment

Danil-Grigorev Sep 16, 2024

Choose a reason for hiding this comment

alexander-demicev left a comment

Choose a reason for hiding this comment

furkatgofurov7 left a comment

Choose a reason for hiding this comment

Danil-Grigorev commented Sep 10, 2024 •

edited

Loading

furkatgofurov7 commented Sep 10, 2024 •

edited

Loading