Skip to content

Conversation

@eggfoobar
Copy link
Contributor

added a cert check function to the monitor call to force a restart of etcd when the certs have been changed

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 17, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2085/1/input

@clobrano clobrano self-requested a review October 17, 2025 06:30
Copy link
Collaborator

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks quite good, and I've added a few comments. My main concern now is that the nodes can act on certificate change at the same time, which results in the etcd-cluster being disrupted until both recover.
First of all, is it expected that both nodes can update certificate at the same time? If so, we might want to synchronize them, e.g. with CIB attributes. There are a few examples in the code, look for attribute_learner_node function for example.

@eggfoobar eggfoobar force-pushed the add-podman-etcd-file-monitor branch from 0a83ec4 to 06fd7a2 Compare October 17, 2025 14:26
@knet-jenkins
Copy link

knet-jenkins bot commented Oct 17, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2085/2/input

@eggfoobar
Copy link
Contributor Author

@clobrano I don't know about the CIB mechanism, looking into it now so I can port over the attribute_learner_node functionality for syncing this restart. I'll update the code soon, but the other suggestions should be up now

added a cert check function to the monitor call to force a restart of etcd when the certs have been changed

Signed-off-by: ehila <[email protected]>
@eggfoobar eggfoobar force-pushed the add-podman-etcd-file-monitor branch from 06fd7a2 to 6bfbe1d Compare October 21, 2025 04:26
@knet-jenkins
Copy link

knet-jenkins bot commented Oct 21, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2085/3/input

@eggfoobar eggfoobar changed the title WIP: feat: add support for podman-etcd cert rotation OCPEDGE-1502: feat: add support for podman-etcd cert rotation Oct 21, 2025
@eggfoobar
Copy link
Contributor Author

@clobrano I don't think we'll need the CIB, was able to validate this and the cert rotation is a rolling change, so the first node updates, then the second. I think we'll be adding extra complexity to have CIB in the mix. I haven't had a chance to look at the etcd-operator code yet, but I'll find this out so we can be sure that is the behavior. In my experimentation it was a rolling process from node to node.

@clobrano
Copy link
Collaborator

@clobrano I don't think we'll need the CIB, was able to validate this and the cert rotation is a rolling change, so the first node updates, then the second. I think we'll be adding extra complexity to have CIB in the mix. I haven't had a chance to look at the etcd-operator code yet, but I'll find this out so we can be sure that is the behavior. In my experimentation it was a rolling process from node to node.

If that so, I agree, there is no need to add complexity.

the cert rotation is a rolling change

does it mean the node also reboots, or that CEO just waits for one to be completed before moving to the next?

If the first, we need no logic at all, but I remember we reproduced a scenario when the node does not reboot after a certificate change.

If the latter, we should ensure the process is long enough for the agent to see (and signal) the file change, be stopped (which might require up to 80 seconds) and restart (which is usually much quicker), and rejoin the cluster, before the peer node is affected by the certificate change.

@oalbrigt oalbrigt changed the title OCPEDGE-1502: feat: add support for podman-etcd cert rotation OCPEDGE-1502: feat: podman-etcd: add support for cert rotation Oct 21, 2025
@clobrano
Copy link
Collaborator

We discussed this in another thread, as I was a little worried by the fact that Node==Ready doesn't equal to "Etcd member is voter", but given that this is a GNS scenario, and that we have the member auto-promotion patch in podman-etcd, the transition should be a lot smoother and quick

/lgtm

@oalbrigt oalbrigt merged commit fc3c646 into ClusterLabs:main Oct 23, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants