Node NotReady Disruption Controller #1659
Labels
kind/feature
Categorizes issue or PR as related to a new feature.
needs-triage
Indicates an issue or PR lacks a `triage/foo` label and requires one.
Description
What problem are you trying to solve?
Sometimes nodes just become
NotReady
for a variety of reasons (bad cloud provider instance, non-responsive kubelet, etc). When a Node has been in aReady
state and then transitions intoNotReady
, I think that Karpenter should have another Disruption Controller that monitors for these nodes and terminates them.Third party controllers like the Spot.io Ocean Product, and the Cluster Autoscaler both handle nodes that become
NotReady
for you automatically. Karpenter should be able to do the same thing.(Note we have also raised this with our AWS TAM via a support ticket, and we were recommended to open a feature-request here)
Related: #1573
How important is this feature to you?
This is actually a blocker for us migrating off of our current tools - we launch enough nodes and we have enough failures throughout the day that we cannot fully migrate unless we have a completely automated self healing system where these nodes get cycled out once they become
NotReady
.(separate but related, is the ongoing discussion at bottlerocket-os/bottlerocket#4075 about EKS nodes becoming unready due to heavy memory pressure)
The text was updated successfully, but these errors were encountered: