Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to avoid isolated node split-brain #39

Open
kaiyou opened this issue May 6, 2020 · 8 comments
Open

Try to avoid isolated node split-brain #39

kaiyou opened this issue May 6, 2020 · 8 comments
Labels
enhancement New feature or request

Comments

@kaiyou
Copy link
Contributor

kaiyou commented May 6, 2020

There are two kinds of split-brain:

  • a fairly rare case of half-half split-brain, where halft the nodes are completely disconnected from the other half for long enough so a split-brain occur, this might sometimes happen if entire LANs are connected using wesher ;
  • a much more common case of isolated node split-brain, where a single node loses connection to the reste of the cluster, and acts as an isolated node from now on.

There would be a general fix for both of these, which involves keeping track of some super-nodes (maybe all known nodes ?) and regularly try to join these nodes to the memberlist with some kind of backoff mechanism, and maybe forget them after some (fairly long) time.

This would probably require some complex code, should not be run from inside the main loop to avoid deadlocking, and quiet frankly: it sounds scary to me. I would love to get into it later, but I am not familiar enough with the wesher code for now.

However, the second more common case has a quick and (not so) dirty fix. If the memberlist becomes empty, it is usually safe to consider we are facing a split-brain, and more generally we know for sure we are in a deadend (until some nodes leaves/joins that is). So I think it would be safe to simply fatal-exit. Then, it is the service manager responsibility to handle restarting if required by the admin.

I have a patch working for this, and have tested it using systemd unit files with success. I need to isolate the changes and provide a PR.

@costela
Copy link
Owner

costela commented May 6, 2020

I like the idea of focusing on this specific "easy" case first!

However, I'm not sure about a couple of points:

  1. using the fact that the memberlist is empty as an indicator seems like it could hide a few annoying corner-cases. Ideally we would focus on one more specific case: the transition from "many (>1) peers" to "no peers". But then we still have the issue of timing: the membership changes come in for single nodes. To observe this kind of sudden change, we'd have to consider multiple changes in an arbitrary time-range as being a single change. This would conceivably cause false positive if you have - for instance - an auto-scaling mesh that reacts to batched workloads. You would spin up several nodes and they would quite possibly finish up inside this arbitrary time-range, making it look like a split-brain.

  2. assuming we can keep the false positives to a minimum, we'd still have the issue of what to do with "orphaned" nodes. If one of two nodes gets forcefully scaled down, how should the first node get rid of this dangling node, if it thinks it just saw a split? A similar case applies to bigger meshes: let's say a 3 node mesh has a split. Node A is cut off, notices the split, and keeps B and C as peers (important: memberlist already considers them "gone"). During the split, node C is scaled down. Etc.

  3. lastly, I don't want to make assumptions about the service manager (systemd in this case). If people want to use wesher in a different context, I would try to make it work as well as possible.

In general, I suspect simply increasing the memberlist node eviction timeouts should suffice for the majority of cases, without incurring much added complexity.
I'm open for couter-arguments, of course!

@costela costela added the enhancement New feature or request label May 8, 2020
@kaiyou
Copy link
Contributor Author

kaiyou commented May 20, 2020

Edit: please ignore this and the following message. They are useful to the thinking process, but expose ideas I do not find relevant anymore.

I gave this some time and here are some of my thoughts.

I cannot think of a proper corner case where we reach a state of 0 node and are not isolated. Nodes will be removed from membership when they leave the cluster, or when they timeout. Timeouts are the same on all nodes, so the current node will be removed from other nodes state at pretty much the same time as it will forget about them. In the end, except if all other nodes are leaving at the same time (global restart), reaching 0 node pretty much means we are isolated and the rest of the cluster (1 or more other nodes) probably also has forgotten about us.

In my opinion, reaching such a state where we have almost 0 chance of being passively reintegrated in the cluster can lead to two different responses:

  • consider this is a fatal situation, then exit;
  • try to join again, using the configured join node if any.

Trying to join again is complex because we do not know when connectivity is back, so we would need to try foerever, probably using some backing off retry. If we try forever upon isolation, why wouldn't we retry forever during first join? My best guess is that we want to detect a misconfiguration on startup, so we should not retry forever, then if we get isolated retries should be infinite, but generate logs so we know something is going wrong.

The other option is to plainly fail, which we must do if no join address is setup (contacting old nodes might not have any sense or might even not be safe, and we have no decent heuristic for guessing which old node might still be up). I think we might want to provide the option to fail anyway, and exit with error status: many people will not like the infinite retry behavior, and will prefer a clean exit and manually trigerring responses, like automatic restart, monitoring notification, or even machine reboot.

In the end, I would add a --rejoin option, that if set will try to rejoin forever upon isolation. We would fatal upon startup if --rejoin is provided without --join. And we would fatal upon isolation if --rejoin is not provided.

@kaiyou
Copy link
Contributor Author

kaiyou commented May 20, 2020

On the more general split-brain problem, we could add a timer and check at regular intervals that any node provided in --join is still a cluster member, and try to add them back if they have left. This would also cover the simple use case of node isolation.

However, implementation sounds a bit trickier. We would want to backoff in some way, and also not block the entire main loop, so other cluster operations can go normally. I have no idea how to design this atm.

@kaiyou
Copy link
Contributor Author

kaiyou commented May 21, 2020

After giving it some time and some fiddling around, here is my new take on the matter: solving the issue might actually be fairly simple.

We could simply add a tick channel to the main loop, with a decent interval, and check once every while that every address provided in --join is still a cluster member, and try to add them back if not. It would extend the semantic of join nodes to: stable nodes that are used to join and rejoin the cluster, used to avoid split-brain. We could control the behavior using --rejoin <interval>, with a default interval of 0 which means no automatic rejoin.

My asumption is this simple change will solve most split-brain issues as long as join nodes are still online, hence the new semantic about their stability.

My other asumption is that backoff is not necessary. As long as the interval is long enough, we do not have to retry quickly: upon detecting lack of a join node, there is a fair chance it is offline or we are experiencing split-brain, very possibly due to network issues. Due to memberlist timeouts, we also know that the join node is gone for some time, so there is no telling if it will be back soon, and no reason to retry fast in the first place. So we should retry slowly, at regular, long enough intervals.

My last asumption is that, since we only retry at regular, long enough intervals, the overhead in the main loop is small enough to avoid any other channel filling up. So we can simply take care of this in the main loop.

This makes implementation almost trivial. I will try something in the next couple of days.

@kaiyou
Copy link
Contributor Author

kaiyou commented May 21, 2020

I have a test implementation running on a couple live clusters. First tests are promising against manually triggered failures. I will provide feedback about real life efficiency in a couple days.

@costela
Copy link
Owner

costela commented May 27, 2020

Sounds nice! Can you open a PR so we can start iterating on the code?

I'm a bit apprehensive about overloading the semantics of --join, though. The technical side sounds reasonable, but I suspect the usability might suffer in non-obvious ways.
If, for instance, split-brains are not a particular concern (say, because of enough network redundancy), and you naively provision your nodes by starting wesher --join ALL_MY_CURRENT_NODES, older nodes might end up with a potentially big list of peers they will keep retrying forever, long after the peers are gone.
Some clean way to more explicitly say "these specific nodes can be treated as non-ephemerous" seems a bit cleaner to me.

But again, maybe after seeing the code this fear might turn out to be unfounded.

@kaiyou
Copy link
Contributor Author

kaiyou commented May 28, 2020

Maybe a flag to specify the delay between rejoins, with a default value that disables the feature?

Or a different, more explicit --persistent-nodes to distinguish between initial and later rejoins.

@costela
Copy link
Owner

costela commented May 28, 2020

Both sound ok. I'm slightly more inclined towards the latter. I played a bit with this idea and even added it to README:

[...] Future versions might include the notion of a "static" node to more cleanly avoid this.

The second variant has the advantage of being slightly more flexible: you could have your provisioning set up as --join WHOLE_CLUSTER --join-persistent SINGLE_COORDINATOR and have your scale-up events work even during, say, a maintenance window of the SINGLE_COORDINATOR node. Your current cluster is disjoint from your ideal cluster when starting up.

But I'm open to counter-arguments as always!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants