Reduce chance of duelling leaders #86

Cjen1 · 2023-07-20T13:07:31Z

Currently if multiple replicas concurrently try to recover from a failed leader, they can collide and must retry.
This is likely to occur in well behaved networks, since replicas will detect the fault at similar times.
It is also exacerbated as the number of replicas increase.

This PR uses randomness to reduce the chance of collisions by spreading out the candidates over $2^{16}$ terms, and effectively electing the candidate which chooses the highest term.

Since this is equivalent to each candidate being partitioned from the cluster until it hits its chosen term, there should be no interactions with the rest of the consensus protocol.

The downside of this approach is that if some candidate with a higher term becomes a candidate after a lower termed candidate is already elected, it will be prolong the outage.
This case is less likely to occur if PreVote is enabled.

The graph below compares time to recovery when the leader fails, which we define as the maximum interval without any committed requests.

Signed-off-by: cjen1 <[email protected]>

pav-kv · 2023-09-15T07:31:57Z

Nice. Interestingly, this change makes Raft behave more like vanilla Paxos.

Paxos assigns unique terms by construction, all candidates use non-intersecting terms. Correspondingly, if there is contention during phase 1 (equivalent to Raft election), the highest term wins.

Raft can have multiple candidates requesting votes for the same term, and has to retry in case of split votes. This PR reduces the chance of split votes, and makes the candidate with the latest term win election.

Both policies have pros/cons, we can't swap one for another without consideration. I think election should behave like the old code by default, for backwards compatibility. But we can make it possible to opt in for the new policy.

One way to do that: make the number of random bits at the end of Term configurable. By default it's zero, so there are no random bits, and "next term" just returns +1.

serathius · 2023-09-15T08:01:53Z

Aside of the improvement, I would be also pretty interested in the methodology you used to measure the maximum recovery time. Not sure if we already have a tool for this, but would be great if we can document the performance difference and methodology to reproducing the measurement.

erikgrinaker · 2023-09-15T08:24:13Z

How does this interact with CheckQuorum+PreVote? Leader stability can be important for availability under certain kinds of network partitions -- it looks like this will still respect the leader recency condition when granting prevotes and votes, so I think it should be fine.

With this change, would it make sense to drop the random 1x-2x factor of the election timeout, and simply use a fixed election timeout, relying on the term randomness to break ties? This would speed up elections on average. A delayed candidate can also upset an already elected leader by sending out vote requests with higher terms. That scenario would still be possible due to network latency alone, so we could end up electing multiple leaders in rapid succession until it settles on the candidate with the highest term -- I think that should be fine, as far as I can tell, but it's worth thinking through the implications here.

Cjen1 · 2023-09-15T09:03:36Z

@pavelkalinnikov Yep this is exactly where the idea came from (I was doing some comparisons of leader election performance between multipaxos and raft). Also that sounds like a good idea regarding how to merge this without breaking the internet :)

@serathius The measurements were taken using this tool: https://github.com/cjen1/reckon which is reasonably lightweight and very reproducible for these kind of tests (single machine with an emulated network). Here are some slides from a talk I gave on this work: https://coseners.net/wp-content/uploads/2023/08/Coseners-MP-vs-Raft-.pdf
The tool is currently usable, but if you have any suggestions on UX I'd appreciate it!

@erikgrinaker My understanding was that etcd doesn't do the random exponential backoff when leaders duel? If that exists currently it can still be useful for leader stability if timeouts are set too low by a user. Otherwise I haven't looked into testing it with PreVote, or intermittent network conditions (see the 2020 cloudflare outage).

erikgrinaker · 2023-09-15T09:07:15Z

My understanding was that etcd doesn't do the random exponential backoff when leaders duel?

It doesn't do exponential backoff, but each node's election timeout is randomized by a factor of [1-2) to reduce the chance of ties:

raft/raft.go

Lines 413 to 416 in ee0fe9d

    
           // randomizedElectionTimeout is a random number between 
        
           // [electiontimeout, 2 * electiontimeout - 1]. It gets reset 
        
           // when raft changes its state to follower or candidate. 
        
           randomizedElectionTimeout int

raft/raft.go

Lines 1989 to 1991 in ee0fe9d

    
           func (r *raft) resetRandomizedElectionTimeout() { 
        
           	r.randomizedElectionTimeout = r.electionTimeout + globalRand.Intn(r.electionTimeout) 
        
           }

Cjen1 · 2023-09-15T09:31:14Z

It doesn't do exponential backoff, but each node's election timeout is randomized by a factor of [1-2) to reduce the chance of ties:

Ah! Then randomisation of the terms would work instead of the randomisation of the timeout.

ahrtr · 2023-09-18T14:56:02Z

This PR might lead to subtle availability issues. Under the existing election timeout mechanism, each member only votes once in each term, accordingly at most one leader will be elected in each term. But in this PR, multiple leaders may be elected in a short period, because each member may vote multiple times (due to each randomised term). From application perspective, the leader changes multiple times in a short period in such situation.

Another minor concern is this PR also make the term a little human unreadable. Users have to extract the high 48 bits to get the real term.

The existing election timeout mechanism is a tradeoff between understandability and efficiency. I am not fully convinced we should proceed with this approach. The bottom line is that we should add a flag to enable or disable the new behavior. But again it complicates the user API, for we already have too many such flags.

boringhello · 2024-02-23T07:27:05Z

Interestingly, It could be more beneficial if the generation of the next term could vary based on the freshness of the log(log term, log index).

Randomise lower bits of terms

c976ceb

Signed-off-by: cjen1 <[email protected]>

Cjen1 force-pushed the static-ballot branch from 74217aa to c976ceb Compare July 20, 2023 13:09

ahrtr mentioned this pull request Feb 23, 2024

Delay the next campaign if the node lost the vote #169

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce chance of duelling leaders #86

Reduce chance of duelling leaders #86

Cjen1 commented Jul 20, 2023

pav-kv commented Sep 15, 2023

serathius commented Sep 15, 2023

erikgrinaker commented Sep 15, 2023

Cjen1 commented Sep 15, 2023

erikgrinaker commented Sep 15, 2023

Cjen1 commented Sep 15, 2023

ahrtr commented Sep 18, 2023

boringhello commented Feb 23, 2024

Reduce chance of duelling leaders #86

Are you sure you want to change the base?

Reduce chance of duelling leaders #86

Conversation

Cjen1 commented Jul 20, 2023

pav-kv commented Sep 15, 2023

serathius commented Sep 15, 2023

erikgrinaker commented Sep 15, 2023

Cjen1 commented Sep 15, 2023

erikgrinaker commented Sep 15, 2023

Cjen1 commented Sep 15, 2023

ahrtr commented Sep 18, 2023

boringhello commented Feb 23, 2024