Slow VIP failover (~1 min) after control plane node power loss #11581
Replies: 1 comment
-
Talos is using etcd elections (not leader elections, but another election mechanism) to figure out what node has the VIP. There is a balance between two nodes erroneously assigning the VIP at the same time because of a timeout in etcd communication, and failover time. It can't be "instant" in any way. At the moment this is not configurable. Please keep in mind when using KubePrism and discovery (default configuration) VIP failover doesn't affect workloads within the cluster, it only affects external access to the cluster. In any case, VIP failover will take some time for the client to catch up with, as existing TCP connections will be broken (for long-lived HTTP/2 connections). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Environment:
Talos cluster: 3 control plane nodes, 2 worker nodes
VIP configured for control plane nodes
Issue:
On graceful shutdown of the control plane node holding the VIP, the VIP fails over instantly.
On sudden failure (e.g., power loss) of the VIP-holding control plane node, VIP failover takes ~1 minute.
etcd leader election completes in ~1 second after the failure.
Question:
Is there a way to reduce the VIP failover time in failure scenarios?
Can the timeout for VIP takeover be configured? Since etcd elects a new leader almost instantly, it seems feasible to switch the VIP much faster.
Beta Was this translation helpful? Give feedback.
All reactions