Slow VIP failover (~1 min) after control plane node power loss #11581

michaelkebe · 2025-08-14T18:38:04Z

michaelkebe
Aug 14, 2025

Environment:

Talos cluster: 3 control plane nodes, 2 worker nodes

VIP configured for control plane nodes

Issue:

On graceful shutdown of the control plane node holding the VIP, the VIP fails over instantly.

On sudden failure (e.g., power loss) of the VIP-holding control plane node, VIP failover takes ~1 minute.

etcd leader election completes in ~1 second after the failure.

Question:
Is there a way to reduce the VIP failover time in failure scenarios?
Can the timeout for VIP takeover be configured? Since etcd elects a new leader almost instantly, it seems feasible to switch the VIP much faster.

smira · 2025-08-15T09:55:40Z

smira
Aug 15, 2025
Maintainer

Talos is using etcd elections (not leader elections, but another election mechanism) to figure out what node has the VIP.

There is a balance between two nodes erroneously assigning the VIP at the same time because of a timeout in etcd communication, and failover time. It can't be "instant" in any way.

At the moment this is not configurable.

Please keep in mind when using KubePrism and discovery (default configuration) VIP failover doesn't affect workloads within the cluster, it only affects external access to the cluster.

In any case, VIP failover will take some time for the client to catch up with, as existing TCP connections will be broken (for long-lived HTTP/2 connections).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Slow VIP failover (~1 min) after control plane node power loss #11581

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Slow VIP failover (~1 min) after control plane node power loss #11581

Uh oh!

michaelkebe Aug 14, 2025

Replies: 1 comment

Uh oh!

smira Aug 15, 2025 Maintainer

michaelkebe
Aug 14, 2025

smira
Aug 15, 2025
Maintainer