-
Notifications
You must be signed in to change notification settings - Fork 262
Infrakit recovery options when less then half of the managers are available #741
Comments
I am glad you're raising this issue. Clearly it's not sufficient to just have Infrakit give up when the backend it depends on loses quorum. For the purpose of discussion let's limit the scope to recovering the managers.... those in groups where logical IDs are specified. The goal here is to have any remaining manager nodes act in recovering the quorum, even if leadership isn't known at the moment. To document the issue -- this is the current flow:
For N=3, I think this is fairly straight-forward. When 1 of the 3 managers are left, the lone manager, even if not a leader, can take over and restore the quorum. There are some changes though:
As you pointed out, this won't work for 2/5 scenario. In this case, the remaining managers (2) know they are not leaders... so
In the case of N-k managers are down, each of the k managers remaining can all act, in the worse case, if each is only responsible for starting up exactly 1 other node... This way we will not over-provision. As long as more nodes are coming back online, they are going to try to rejoin the quorum... and when a quorum can be established, a new leader will be elected. The new leader can always be the single actor and restore any missing nodes as necessary. What do you think? I haven't manually played around with swarm to see how this could work. Obviously this also assumes we are reattaching the volumes /var/lib/docker so we are not doing all kinds of swarm demote/join/leave which alters the original topology. |
@chungers I think that that approach could work when there are only 2/5. What about the case where there are 4/5? Since this is an even number what is the state of the swarm? Is there a leader at this point? If not, then we need to also handle this case (and we wouldn't want all 4 to provision 1 additional node). |
I think for the 4/5 case we are ok as there's still a leader even with 4 active nodes: https://docs.docker.com/engine/swarm/admin_guide/#add-manager-nodes-for-fault-tolerance |
We are testing some HA recovery scenarios and we killed 2 of the 3 managers; this resulted in the following:
At this point there is no leader and Infrakit will never attempt to recover the manager nodes because those operations only run on the leader (and without manual intervention a leader will never be elected).
In theory, if a temporary leader could be determined, then:
Provision
requests to the instance provider to create the managersThis seems to align with the self-healing goals of Infrakit.
This flow seems to be straightforward when only 1 of 3 managers are left (since the last manager would assume temporary leadership); however it is not as clean when there are on 2 of 5 left.
A thought on handing the 2/5 scenario would be to rely on the fact that all managers have a unique
LogicalID
. If the managers had awareness to know which other managers were still around, then each node would be able to identify if they are the lowest (based on string sort of theLogicalID
s) manager that is left then that node could self-elect.The text was updated successfully, but these errors were encountered: