-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimum time between kills #67
Comments
I could work on this. It might have some overlap with the blacklist feature. @etiennetremel, would you accept a pull request? |
The random component of the kill times is chosen between 0 and 12h. The way I would implement this is to decompose this 12h interval into equal active intervals for each node, each separated by equally inactive intervals ==
To solve this, I would put a superior limit on the interval. How much we limit has a direct impact on the randomness of the kill times which is the whole point of this tool. The following formula applies
Where:
Since node count is predetermined at runtime, I would like to put a limit on either
(edge cases not dealt with, ignore that, I don't have the brainpower right now to find the perfect formula) Either I would impose
If this condition would not be respected, an error will be shown saying My questions for anyone interested in this feature:
|
Nice! A few comments:
This is not the case with autoscaling clusters. Our number of nodes fluctuates throughout the day. Or do I misunderstand?
Why is this necessary? Wouldn't it be easier to allow any minimum interval
|
Well, the killer either runs once and exits or runs periodically with a given time period via
Consider this then: I am not an actual user of this tool, nor do I understand why killing preemptible nodes earlier than the original schedule of 24h is important, I think it's for fault detection?!, so I have to ask: is this precise time of killing a problem? If not, then The good thing is, those who do specify the minimum kill interval can be made aware through a help entry, right, so it's not enforced on all users, and by default it will be 0. Just say the word, and I'll get to work. |
Ah okay, I understand. Then indeed it can be viewed as constant. However the issue with determining the constraints is still the same - if the constraint on the minimum time (a config value) is determined by the number of nodes (a value that can change), then the config can be invalidated just because the cluster has scaled up, no?
It's because Google will often kill them if they reach that age, so if many of your nodes were created at the same time (new application deployed), then there is a good chance they'll all get killed at that time too. This is compounded by the fact that there are also certain times of day when a lot of killing happens, and if many of your nodes are killed at that time, then tomorrow it's even more likely to happen again. Randomly killing them earlier helps keep the killing spread throughout the day.
No, imo the whole goal here is to spread the killing out as opposed to making it truly "random", though I'm not the creator so they may have different opinions. So I'd vote for |
Hi, currently the implementation doesn't look at all nodes at once, but just when a node is seen for the first time by the controller - either by the node watcher or the fallback loop - it sets an expiry time without taking any other node into account. Because it randomises the expiration between 12 and 24 hours it spreads their termination, avoiding a mass termination 24 hours after a large number of nodes have been created simultaneously. Doing it this way keeps the logic extremely simple, but of course might have some unintended side effects where multiple nodes get killed at the same time. When the controller kills a node though it drains pods from the node, so it shouldn't impact operations of your applications too much. However a couple of things can be improved:
Pull requests for these improvements are more than welcome! We rely less on preemptibles these days since in our region Google started preempting nodes much more frequently, often before this controller terminated them. And when they preempt a node GKE doesn't get notified of this and thinks pods are still running on that node for a while, leading to issues with kube-dns. For that reason we're not spending much time on improving this controller. These days however a combination of this controller with https://github.com/GoogleCloudPlatform/k8s-node-termination-handler would be a good option, because it also ensures nodes get cleaned up in a better way in case of real preemptions. |
Respecting PDBs would already be a great start indeed. I raised this issue because of a slow-starting application that was disrupted by its nodes getting killed all within the startup time, leading to an outage. We do also run the node termination handler for "real" preemptions but this is a good way to avoid those also happening all at the same time (since the preemptions also obviously don't respect PDBs) |
Done #69. Give it a spin. Here's a stress test run.
|
Hi @andrei-pavel , @JohnPaton sorry for being slow to respond to this issue and the PR. Parking something to look at it later sometimes leads to it escaping my attention fully :| I think a lot of weird workarounds we're trying to add to this controller could be avoided my making the application PDB aware instead. What do you think? If a node comes up for preemption we would have to check all PDBs related to the applications running on that node and await any of them that don't allow for disruptions (probably up to 45 minutes or so, just like most other PDB aware parts of Kubernetes). |
That would make sense to me, and it would become "automatically" interoperable with existing configs so 👍 |
Selecting the pdb's for each pod is done as follows in the cluster autoscaler: Doesn't seem to efficient, but given that it's not executed very often this would be suitable for deleting pods in this application. |
It would be nice to have an option for the minimum time between node expirations. Obviously if you have many many nodes then this isn't feasible, but if you only have a handful then it can be quite disruptive to kill several within a few minutes of each other (avoiding this was actually my initial reason for looking into this project).
Maybe something like
MINIMUM_KILL_INTERVAL
?The text was updated successfully, but these errors were encountered: