-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to avoid isolated node split-brain #39
Comments
I like the idea of focusing on this specific "easy" case first! However, I'm not sure about a couple of points:
In general, I suspect simply increasing the memberlist node eviction timeouts should suffice for the majority of cases, without incurring much added complexity. |
Edit: please ignore this and the following message. They are useful to the thinking process, but expose ideas I do not find relevant anymore. I gave this some time and here are some of my thoughts. I cannot think of a proper corner case where we reach a state of 0 node and are not isolated. Nodes will be removed from membership when they leave the cluster, or when they timeout. Timeouts are the same on all nodes, so the current node will be removed from other nodes state at pretty much the same time as it will forget about them. In the end, except if all other nodes are leaving at the same time (global restart), reaching 0 node pretty much means we are isolated and the rest of the cluster (1 or more other nodes) probably also has forgotten about us. In my opinion, reaching such a state where we have almost 0 chance of being passively reintegrated in the cluster can lead to two different responses:
Trying to join again is complex because we do not know when connectivity is back, so we would need to try foerever, probably using some backing off retry. If we try forever upon isolation, why wouldn't we retry forever during first join? My best guess is that we want to detect a misconfiguration on startup, so we should not retry forever, then if we get isolated retries should be infinite, but generate logs so we know something is going wrong. The other option is to plainly fail, which we must do if no join address is setup (contacting old nodes might not have any sense or might even not be safe, and we have no decent heuristic for guessing which old node might still be up). I think we might want to provide the option to fail anyway, and exit with error status: many people will not like the infinite retry behavior, and will prefer a clean exit and manually trigerring responses, like automatic restart, monitoring notification, or even machine reboot. In the end, I would add a |
On the more general split-brain problem, we could add a timer and check at regular intervals that any node provided in However, implementation sounds a bit trickier. We would want to backoff in some way, and also not block the entire main loop, so other cluster operations can go normally. I have no idea how to design this atm. |
After giving it some time and some fiddling around, here is my new take on the matter: solving the issue might actually be fairly simple. We could simply add a tick channel to the main loop, with a decent interval, and check once every while that every address provided in My asumption is this simple change will solve most split-brain issues as long as join nodes are still online, hence the new semantic about their stability. My other asumption is that backoff is not necessary. As long as the interval is long enough, we do not have to retry quickly: upon detecting lack of a join node, there is a fair chance it is offline or we are experiencing split-brain, very possibly due to network issues. Due to memberlist timeouts, we also know that the join node is gone for some time, so there is no telling if it will be back soon, and no reason to retry fast in the first place. So we should retry slowly, at regular, long enough intervals. My last asumption is that, since we only retry at regular, long enough intervals, the overhead in the main loop is small enough to avoid any other channel filling up. So we can simply take care of this in the main loop. This makes implementation almost trivial. I will try something in the next couple of days. |
I have a test implementation running on a couple live clusters. First tests are promising against manually triggered failures. I will provide feedback about real life efficiency in a couple days. |
Sounds nice! Can you open a PR so we can start iterating on the code? I'm a bit apprehensive about overloading the semantics of But again, maybe after seeing the code this fear might turn out to be unfounded. |
Maybe a flag to specify the delay between rejoins, with a default value that disables the feature? Or a different, more explicit |
Both sound ok. I'm slightly more inclined towards the latter. I played a bit with this idea and even added it to README:
The second variant has the advantage of being slightly more flexible: you could have your provisioning set up as But I'm open to counter-arguments as always! |
There are two kinds of split-brain:
There would be a general fix for both of these, which involves keeping track of some super-nodes (maybe all known nodes ?) and regularly try to join these nodes to the memberlist with some kind of backoff mechanism, and maybe forget them after some (fairly long) time.
This would probably require some complex code, should not be run from inside the main loop to avoid deadlocking, and quiet frankly: it sounds scary to me. I would love to get into it later, but I am not familiar enough with the wesher code for now.
However, the second more common case has a quick and (not so) dirty fix. If the memberlist becomes empty, it is usually safe to consider we are facing a split-brain, and more generally we know for sure we are in a deadend (until some nodes leaves/joins that is). So I think it would be safe to simply fatal-exit. Then, it is the service manager responsibility to handle restarting if required by the admin.
I have a patch working for this, and have tested it using systemd unit files with success. I need to isolate the changes and provide a PR.
The text was updated successfully, but these errors were encountered: