Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement etcd member management in pre-terminate hook #435

Conversation

Danil-Grigorev
Copy link
Contributor

@Danil-Grigorev Danil-Grigorev commented Sep 10, 2024

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #431

Special notes for your reviewer:

Checklist:

  • squashed commits into logical changes
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

@Danil-Grigorev Danil-Grigorev requested a review from a team as a code owner September 10, 2024 11:26
@Danil-Grigorev
Copy link
Contributor Author

Currently testing. Getting errors in etcd after scaling down:

2024-09-10T11:56:31.901660199Z stderr F {"level":"warn","ts":"2024-09-10T11:56:31.901294Z","caller":"etcdserver/server.go:2133","msg":"failed to publish local member to cluster through raft","local-member-id":"aeb9bf27c9633dc9","local-member-attributes":"{Name:docker-dqwkg-9sdrq-e523dc15 ClientURLs:[https://192.168.32.7:2379]}","request-path":"/0/members/aeb9bf27c9633dc9/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}

ETCD cluster is not accessible, and API server is not recovering because of that

@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-memebers-on-pre-delete branch 3 times, most recently from ef72b5a to 452143d Compare September 10, 2024 14:18
@Danil-Grigorev
Copy link
Contributor Author

It seems the previous problem is fixed, but another problem appears on removal of the last machine.

I0910 14:37:55.029275       1 machine_controller.go:357] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed“

There is a discussion upstream about a potentially related issue https://kubernetes.slack.com/archives/C8TSNPY4T/p1725952675583209

@furkatgofurov7
Copy link
Contributor

furkatgofurov7 commented Sep 10, 2024

It seems the previous problem is fixed, but another problem appears on removal of the last machine.

I0910 14:37:55.029275       1 machine_controller.go:357] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed“

There is a discussion upstream about a potentially related issue https://kubernetes.slack.com/archives/C8TSNPY4T/p1725952675583209

Looks like a potential fix upstream will be available to use with the new v1.8.3 CAPI patch release scheduled for today, however we have not yet bumped CAPI to v1.8.x series in CAPRKE2

@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-memebers-on-pre-delete branch 3 times, most recently from 47e23e5 to 2ed8d90 Compare September 11, 2024 11:50
@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-memebers-on-pre-delete branch 3 times, most recently from 6788009 to 5257503 Compare September 12, 2024 08:45
@Danil-Grigorev
Copy link
Contributor Author

Watch logs and metric collection causes failures in CI

@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-memebers-on-pre-delete branch 15 times, most recently from a547119 to 9f985d0 Compare September 13, 2024 14:33
@Danil-Grigorev Danil-Grigorev added the kind/bug Something isn't working label Sep 13, 2024
@Danil-Grigorev
Copy link
Contributor Author

This PR depends on #440


var errs []error

for i := range machinesToDelete {
m := machinesToDelete[i]
logger := logger.WithValues("machine", m)

// During RKE2CP deletion we don't care about forwarding etcd leadership or removing etcd members.
// So we are removing the pre-terminate hook.
// This is important because when deleting KCP we will delete all members of etcd and it's not possible
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This is important because when deleting KCP we will delete all members of etcd and it's not possible
// This is important because when deleting RKE2CP we will delete all members of etcd and it's not possible

}

// Note: Removing the etcd member will lead to the etcd and the kube-apiserver Pod on the Machine shutting down.
// If ControlPlaneKubeletLocalMode is used, the kubelet is communicating with the local apiserver and thus now
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this comment kubeadm-specific ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this comment can be simplified.

When the last machine is in a deleting state, this means that cluster is
removed also. In such scenario, waiting for draining is not feasible,
because it is performes only when node deletion is allowed. Which is
not, due to cluster removal. Cluster API prevents draining with the
"cluster is being deleted" error.

Signed-off-by: Danil-Grigorev <[email protected]>
@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-memebers-on-pre-delete branch from 9f985d0 to fc6f21d Compare September 16, 2024 09:13
Copy link
Member

@alexander-demicev alexander-demicev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for taking care of this issue

Copy link
Contributor

@furkatgofurov7 furkatgofurov7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @Danil-Grigorev!

@alexander-demicev alexander-demicev merged commit 7820d2d into rancher:main Sep 17, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Development

Successfully merging this pull request may close these issues.

Rolling upgrades are blocked by nodes that are not properly drained
4 participants