recipes: Leader election#60
Conversation
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
|
I've marked it as ready for review because apparently no one receives notifications about drafts? |
|
Hi @rodio , thank you for the contribution. I had a look, i think the general approach looks good. I see two problems (not sure if Draft related):
use tokio::sync::watch;
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum LeaderState {
// volunteered, watching predecessor
Pending,
// at index 0
Leader,
// session/candidacy ended; terminal
Resigned,
}
/// Configuration for participating in a leader election.
#[derive(Debug, Clone)]
pub struct LeaderElection {
election_node: String,
zk: ZooKeeper,
}
/// An active candidacy. Drop to stop observing (the ephemeral znode
/// is still only removed when the underlying session ends).
#[derive(Debug)]
pub struct Candidate {
my_path: String,
state: watch::<LeaderState>,
...
}
impl Candidate {
pub fn path(&self) -> &str { &self.my_path }
pub fn state(&self) -> LeaderState { self.state.borrow().clone() }
/// Resolves once we become leader, or returns `Err` if we lose
/// the session before that happens.
pub async fn wait_for_leadership(&mut self) -> Result<(), ElectionError> {
loop {
match *self.state.borrow_and_update() {
LeaderState::Leader => return Ok(()),
// Should probably even be split into something like `SessionLost` and `Withdrawn`
LeaderState::Resigned => return Err(ElectionError::SessionLost),
LeaderState::Pending => {}
}
self.state.changed().await.map_err(|_| ElectionError::SessionLost)?;
}
}
}With a volunteer impl like: pub async fn volunteer(&self) -> Result<Candidate, ElectionError> {...}So the candidate only exists after you volunteered and This will help with a couple of your TODOs:
Some other things:
Malte |
|
Hi, @maltesander! Thank you for the thorough review! I'd like to argue a bit here and defend the oneshot approach. The "Resigned" state that you mentioned in your code snippet is not a valid state at all and should not be exposed to users. You have the reason in your comment there: In fact, if users really want to stop participating they should drop their connection like I did in the test. The connection and the participation should be always be coupled and it should not be possible to drop one without dropping the other. This way there'll be no "orphan" ephemeral nodes. And the way to do it is the one-shot approach: you either are the leader, or wait for the leadership or drop the whole connection. |
|
Hi @rodio, yeah, i agree that "Resigned" is wrong and we shouldnt expose "Withdrawal". Id still argue about involuntary withdrawal? E.g. Session loss (which the candidate did not choose)? Consider: A candidate becomes leader, fires Leader on the oneshot. Application starts doing leader work. A little later, the ZK session expires (network partition, GC pause, ZK server failover, whatever). The ephemeral znode is gone. Some other node is now the leader. The original application is still doing leader work, because nobody told it otherwise. The oneshot already fired and can't fire again. Without withdrawal, if we go with the oneshot we should document explicitly that we have to watch connect() for session expiry. Then it should treat itself as no longer Leader, regardless of the oneshot? Im fine with both approaches. Malte |
|
I think you're right that we need to monitor session loss after and I think it is better to hide it inside the implementation of the recipe because it probably is something that almost everyone needs. I'll think more about it later and we'll see what I can come up with. Thanks! |
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
|
@maltesander I've pushed what I've come up whith. |
maltesander
left a comment
There was a problem hiding this comment.
Hi @rodio, thank you for the fixes. Some proposals and questions.
|
|
||
| impl LeaderElection { | ||
| /// Create a new leader election struct | ||
| pub fn new(zk: ZooKeeper, election_node: &str, acl: &'static [Acl]) -> Self { |
There was a problem hiding this comment.
This wont allow constructing ACLs at runtime. What about just a Vec<Acl>?
| .notify(|err, dur| { | ||
| warn!("retrying {:?} after {:?}", err, dur); | ||
| }) | ||
| .when(|e| e.to_string() == ZNODE_NOT_FOUND_ERROR_MSG) |
There was a problem hiding this comment.
This should be proper error variants and not a string match.
enum ObserveError { ZnodeNotFound, ... } and .when(|e| matches!(e, ObserveError::ZnodeNotFound))
| static CANT_SEND_ERROR_MSG: &str = "Can't send leadership state update"; | ||
| static RX_CANCELLED_ERROR_MSG: &str = "Watch receiver canceled, server disconnected?"; | ||
| static ZNODE_NOT_FOUND_ERROR_MSG: &str = "The ephemeral znode of the participant was not found"; |
There was a problem hiding this comment.
This should be proper error variants and replace the whatever!
#[derive(Debug, Snafu)]
pub enum LeaderElectionError {
#[snafu(display("ephemeral znode of the participant was not found"))]
ZnodeNotFound,
#[snafu(display("zookeeper error"))]
Zk { source: crate::Error },
// ...
}
and when(|e| matches!(e, LeaderElectionError::ZnodeNotFound))
| error!(?event, "leader's ephemeral node changed"); | ||
| whatever!( | ||
| "Unexpected change to the leader's node: {:?}", | ||
| event.event_type | ||
| ); |
There was a problem hiding this comment.
I would not log "Unexpected" or "error!" when the leader's session ends. When the leader's ephemeral disappears because the ZK session expired or was reset, that's the expected terminal condition for a leader losing the session. The "error!" can be downgraded to a info/warn?
Furthermore this collapses into LeadershipState::Error in observe(), which hides the "session-induced loss of leadership"?
Should we add a LeadershipState::Lost variant so callers can react appropriately (fail over, reconnect) versus genuine error conditions?
There was a problem hiding this comment.
I do not understand the difference between session expiration and session reset yet, but just in case: whenever ZK server is killed, for example, we get Canceled on line 275 so that is the expected terminal condition that I've seen happening. I understand that we should not collapse the Canceled thing, I will add another LeadershipState variant but I think it would still make sense to log these unexpected changes as errors here.
With the log here on line 277 I wanted to indicate that something unexpected happened to the znode itself, for example NodeDataChanged or DataWatchRemoved. I do not know what to do when we receive such kinds of events in this leader election context, so I just wanted to hard-fail here.
But I'm guessing we could potentially also receive a WatchedEvent with only the KeeperState field set here on line 275 which could be any of these:
tokio-zookeeper/src/types/watch.rs
Lines 20 to 40 in 76afe96
SaslAuthenticated event here... Should I just unpack all these possible values and ignore all the SaslAuthenticated, SyncConnected and send something like LeadershipState::Lost in case we get Expired or Disconnected?
There was a problem hiding this comment.
There is no difference between session expiration and session reset from a ZooKeeper perspective. The cluster decides a session is dead because the heartbeats stopped.
Yes, i think we should handle the potential WatchedEvent. What do you think about something like this (not complete - and using whatever! for brevity)?
match (event.event_type, event.keeper_state) {
// expected leadership loss paths
(NodeDeleted, _) => {
info!("leader's ephemeral was deleted, session likely expired");
leader_sender.send(LeadershipState::Lost).ok();
Ok(())
}
(None, KeeperState::Expired) => {
info!("session expired while leader");
leader_sender.send(LeadershipState::Lost).ok();
Ok(())
}
(None, KeeperState::AuthFailed) => {
// genuinely unrecoverable.
whatever!("auth failed while leader")
}
// transient, don't transition, re-await
(None, KeeperState::Disconnected) => {
// Session still alive; we're just temporarily unable to reach the cluster.
// The whole point of session timeouts is to ride this out without losing leadership.
debug!("transient disconnect while leader, continuing to wait");
// Re-run observe, the watch is gone with the connection, we'll re-set it.
Ok(())
}
(None, KeeperState::SyncConnected | KeeperState::SaslAuthenticated | KeeperState::ConnectedReadOnly) => {
// not sure about this one
debug!(?event.keeper_state, "benign state event while leader, continuing to wait");
Ok(())
}
// genuinely unexpected, keep hard-fail
(NodeDataChanged | NodeChildrenChanged | DataWatchRemoved | NodeCreated, _) => {
// this can be error logged but will appear via snafu as well
whatever!("unexpected change to leader's node: {:?}", event.event_type)
}
}There was a problem hiding this comment.
I'm having second thoughts on having a separate LeadershipState::Lost because I'm struggling to understand why would callers need to react differently in different situations. For example, TCP connection dies - we get rx Canceled on line 275 - we send an error to the watcher - caller needs to reconnect; we receive KeeperState::Expired - client still needs to reconnect so it might as well be the same error. Do you have a specific example in mind where we might need to differentiate?
There was a problem hiding this comment.
Yeah it might be overengineered. But i think only because the possible (reaction) of a caller might be the same, we should not hide any (error) state? But if it is 100% equal we dont have to split as well...
Thinking about it, i dont really have strong examples to not collapse it for now.
I can think of observability (e.g. operators care why leader ship was lost (network, quorom lost -> infrastructure)) and that mentioned race condition like the difference between:
- you lost leadership and know it
- you lost leadership and a successor might already be acting
If i want to react differently on the suspicion of loss vs confirmed loss, we would lose that.
But i rather go without that distinction for now than make it more complicated than it has to be.
There was a problem hiding this comment.
I've made changes to error handling here that are similar to your suggestion. Not sure if the retrying on "benign" errors is going to make a lot of sense because right on the next iteration of the loop we're doing get_children()and I think it would fail as well, but it is not possible to retry get_children() in a meniningful way for now because it returns a generic crate::Error.
| } | ||
| Err(e) => { | ||
| warn!( | ||
| "can't get children: {e}, recoverable error, will now try to find my guid again..." |
There was a problem hiding this comment.
This is from the "create()"
| "can't get children: {e}, recoverable error, will now try to find my guid again..." | |
| "can't create ephemeral node: {e}, recoverable error, will now try to find my guid again..." |
| /// An error has occured in the leader election procedure | ||
| Error, |
There was a problem hiding this comment.
Can we document Error as terminal. Once observe (line 222) sends the Error and returns, the spawned task is done. There is no path back without constructing a fresh LeaderElection and revolunteering.
| Leader, | ||
| /// Leadership participation procedure has not yet started | ||
| Uninitialized, | ||
| /// An error has occured in the leader election procedure |
There was a problem hiding this comment.
| /// An error has occured in the leader election procedure | |
| /// An error has occurred in the leader election procedure |
| /// leader. To stop participating, drop the underlying ZooKeeper connection, | ||
| /// so that the underlying ephemeral znodes are removed. | ||
| /// - A [tokio::runtime::task::abort::AbortHandle]. Call .abort() to stop | ||
| /// participating in leader election. If a connection to ZooKeeper is kept |
There was a problem hiding this comment.
| /// participating in leader election. If a connection to ZooKeeper is kept | |
| /// participating in leader election. If a connection to ZooKeeper is kept |
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
|
@maltesander I've applied your suggestions, please have a look whenever you have time. Thanks! |
I am implementing the leader election recipe from #10 and would appreciate early feedback to see if it makes sense.
I thought that a nice API for leader would be similar to watches: the leader election function would return a receiving end of a oneshot channel so that users could use it to check if a node is a leader. There would no way to "withdraw" yourself from leader election process once you've volunteered to be a leader because I think that no one needs it in practice.
So now that I have a somewhat working draft, I'd appreciate if someone could give feedback on this API and function signatures or really anything. Eventually I would like to resolve all these TODOs hide this recipe behind a feture flag.
To test it:
./bin/zkCli.sh -server localhost:2181 create /election ""cargo test election_works -- --nocapture