update node info processors to include unschedulable nodes #8520

elmiko · 2025-09-10T20:21:18Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR adds a new lister for ready unschedulable nodes, it also connects that lister to a new parameter in the node info processors Process function. This change enables the autoscaler to use unschedulable, but otherwise ready, nodes as a last resort when creating node templates for scheduling simulation.

Which issue(s) this PR fixes:

Fixes #8380

Special notes for your reviewer:

I'm not sure if this is the best way to solve this problem, but i am proposing this for further discussion and design.

Does this PR introduce a user-facing change?

Node groups where all the nodes are ready but unschedulable will be processed as potential candidates for scaling when simulating cluster scheduling.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-09-10T20:21:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elmiko
Once this PR has been reviewed and has the lgtm label, please assign aleksandra-malinowska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko · 2025-09-10T20:21:50Z

i'm working on adding more unit tests for this behavior, but i wanted to share this solution so we could start talking about it.

elmiko · 2025-10-02T20:50:56Z

i've rewritten this patch to use all nodes as the secondary value instead of using a new list of ready unschedulable nodes.

elmiko · 2025-10-02T21:09:16Z

i need to do a little more testing on this locally, but i think this is fine for review.

elmiko · 2025-10-03T13:48:04Z

cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go

 	// Last resort - unready/unschedulable nodes.
-	for _, node := range nodes {
+	// we want to check not only the ready nodes, but also ready unschedulable nodes.
+	for _, node := range append(nodes, allNodes...) {


i'm not sure that this is appropriate to append these. theoretically the allNodes should already contain nodes. i'm going to test this out using just allNodes.

due to filtering that happens in obtainNodeLists, we need to combine both lists of nodes here.

elmiko · 2025-10-03T16:38:32Z

i updated the argument names in the Process function to make the source of the nodes more clear. i also changed the mixed node info processor to not double count the nodes for the unschedulable/unready detection clause.

elmiko · 2025-10-03T16:49:10Z

it seems like the update to the mixed node processor needs a little more investigation.

elmiko · 2025-10-03T17:01:36Z

it looks like we need both the readyNodes and allNodes lists due to the filtering that happens in the core.

This change updates the `Process` function of the node info processor interface so that it can accept a second list of nodes. The second list contains all the nodes that are not in the first list. This will allow the mixed node info processor to properly detect unready and unschedulable nodes for use as templates.

elmiko · 2025-10-08T18:49:25Z

rebased

elmiko · 2025-10-14T15:27:56Z

@jackfrancis @towca any chance at a review here?

jackfrancis · 2025-10-16T15:27:36Z

cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go

+	// we want to check not only the ready nodes, but also ready unschedulable nodes.
+	// this needs to combine readyNodes and allNodes due to filtering that occurs at
+	// a higher level.
+	for _, node := range append(readyNodes, allNodes...) {


Two things:

Isn't readyNodes a subset of allNodes? In which case this will range over the nodes in readyNodes twice.

Also, how do we know the diff of allNodes - readyNodes are nodes of type Ready + Unschedulable? Aren't there going to be other types of nodes not classified as readyNodes in that set (for example, various flavors of NotReady nodes)?

Isn't readyNodes a subset of allNodes? In which case this will range over the nodes in readyNodes twice.

readyNodes is not a pure subset of allNodes, there is some filtering that occurs to remove some of the nodes in readyNodes from allNodes.

if you change this line to only use allNodes you will see some unit tests fail.

Also, how do we know the diff of allNodes - readyNodes are nodes of type Ready + Unschedulable? Aren't there going to be other types of nodes not classified as readyNodes in that set (for example, various flavors of NotReady nodes)?

i did look at how readyNodes and allNodes are created, there is filtering that happens in this function https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L993

i have run a series of tests using allNodes and also using a version of this patch that specifically creates a readyUnschedulableNodes list. after running both tests, and putting them through CI, i am more convinced that using allNodes here is the appropriate thing to do. adding a new lister for "ready unschedulable" nodes did not change the results of my testing, and it makes the code more complicated. this is why i went to using the allNodes approach.

Aren't there going to be other types of nodes not classified as readyNodes in that set (for example, various flavors of NotReady nodes)?

one more point about this, looking at the function in the mixed node info processor, we can see that there are other conditions than just "ready unschedulable" that are checked for. i think the original intent of this function was to look at all nodes.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go#L181

@towca does this seem like the right path forward to you as well?

there is some filtering that occurs to remove some of the nodes in readyNodes from allNodes.

Should we be more precise about how we append allNodes to readyNodes to avoid duplicates? Or are we confident that the effort to de-dup is equivalent or more costly than duplicate processing?

I attempted a bit of archaeology and found the PR that added the filtering by readiness to this logic: #72. This is pretty bizarre, because even way back then it seems that this logic would only see Ready Nodes:

ScaleUp() gets passed readyNodes:

autoscaler/cluster-autoscaler/core/static_autoscaler.go

Line 230 in ea7bd81

scaledUp, err := ScaleUp(autoscalingContext, unschedulablePodsToHelp, readyNodes, daemonsets)

Which it then passes straight through to this logic:

autoscaler/cluster-autoscaler/core/scale_up.go

Line 49 in ea7bd81

nodeInfos, err := GetNodeInfosForGroups(nodes, context.CloudProvider, context.ClientSet,

In any case, IMO the most readable change would be to:

Start passing allNodes instead of readyNodes to TemplateNodeInfoProvider.Process() without changing the signature. This is what the interface definition suggests anyway.

At the beginning of MixedTemplateNodeInfoProvider.Process(), group the passed allNodes into good and bad candidates utilizing isNodeGoodTemplateCandidate(). Then iterate over the good ones in the first loop, and over the bad ones in the last loop.

This should work because:

readyNodes should be a subset of allNodes. So the logic should see all the same Nodes as before + additional ones. This is a bit murky because the Node lists are modified by CustomResourcesProcessor after being listed. CustomResourcesProcessor implementations should only remove Nodes from readyNodes and hack their Ready condition in allNodes. This is what the in-tree implementations do, if an out-of-tree implementations breaks the assumption they might not be subsets but IMO this isn't a supported case and such implementations should be ready for things breaking.

The set of conditions checked in isNodeGoodTemplateCandidate() (ready && stable && schedulable && !toBeDeleted) is a superset of conditions by which readyNodes are filtered from allNodes (ready && schedulable). Both places use the same kube_util.GetReadinessState() function for determining the ready part, the schedulable part is just checking the same field on the Node.

Based on the two above if a Node is in allNodes, but not readyNodes, isNodeGoodTemplateCandidate() should always return false for it. So all the Nodes from allNodes - readyNodes should be categorized as bad candidates like we want.

And if a Node is in readyNodes, it should also be in allNodes and the result of isNodeGoodTemplateCandidate() should be identical for both versions. So the good candidates determined from allNodes via isNodeGoodTemplateCandidate() should be exactly the same as determined from readyNodes like we do now.

Does that make sense? @x13n could you double-check my logic here?

@elmiko IIUC you attempted something like this and got unit test failures? Could you describe what kind? I could definitely see the tests just being too coupled to the current implementation.

Should we be more precise about how we append allNodes to readyNodes to avoid duplicates? Or are we confident that the effort to de-dup is equivalent or more costly than duplicate processing?

at this point in the processing, i don't think the duplicates is an issue.

@elmiko IIUC you attempted something like this and got unit test failures? Could you describe what kind? I could definitely see the tests just being too coupled to the current implementation.

i will need to run the unit tests again, but essentially, if only allNodes is used in the final clause of the mixed node infos processor's Process function, then a few tests fail. my impression is that the filtering occurring with the custom node processor or the processor that filters out nodes with startup taints is causing the issues.

i can certainly take another look at passing only allNodes to Process. i didn't want to break anything else though XD

this is one of the main failures i see when changing to use only allNodes:

generated by running go test ./... in the cluster-autoscaler/core directory.

--- FAIL: TestScaleUpToMeetNodeGroupMinSize (0.00s) orchestrator_test.go:1684: Error Trace: /home/mike/dev/kubernetes-autoscaler/cluster-autoscaler/core/scaleup/orchestrator/orchestrator_test.go:1684 Error: Received unexpected error: could not compute total resources: No node info for: ng1 Test: TestScaleUpToMeetNodeGroupMinSize orchestrator_test.go:1685: Error Trace: /home/mike/dev/kubernetes-autoscaler/cluster-autoscaler/core/scaleup/orchestrator/orchestrator_test.go:1685 Error: Should be true Test: TestScaleUpToMeetNodeGroupMinSize orchestrator_test.go:1686: Error Trace: /home/mike/dev/kubernetes-autoscaler/cluster-autoscaler/core/scaleup/orchestrator/orchestrator_test.go:1686 Error: Not equal: expected: 1 actual : 0 Test: TestScaleUpToMeetNodeGroupMinSize panic: runtime error: index out of range [0] with length 0 [recovered] panic: runtime error: index out of range [0] with length 0 goroutine 81 [running]: testing.tRunner.func1.2({0x2acda20, 0xc0003fd5a8}) /home/mike/sdk/go1.24.0/src/testing/testing.go:1734 +0x21c testing.tRunner.func1() /home/mike/sdk/go1.24.0/src/testing/testing.go:1737 +0x35e panic({0x2acda20?, 0xc0003fd5a8?}) /home/mike/sdk/go1.24.0/src/runtime/panic.go:787 +0x132 k8s.io/autoscaler/cluster-autoscaler/core/scaleup/orchestrator.TestScaleUpToMeetNodeGroupMinSize(0xc000d16e00) /home/mike/dev/kubernetes-autoscaler/cluster-autoscaler/core/scaleup/orchestrator/orchestrator_test.go:1687 +0x11fa testing.tRunner(0xc000d16e00, 0x2e31b28) /home/mike/sdk/go1.24.0/src/testing/testing.go:1792 +0xf4 created by testing.(*T).Run in goroutine 1 /home/mike/sdk/go1.24.0/src/testing/testing.go:1851 +0x413 FAIL k8s.io/autoscaler/cluster-autoscaler/core/scaleup/orchestrator 0.044s --- FAIL: TestDeltaForNode (0.00s) panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x246c737] goroutine 98 [running]: testing.tRunner.func1.2({0x27e4140, 0x4c18f50}) /home/mike/sdk/go1.24.0/src/testing/testing.go:1734 +0x21c testing.tRunner.func1() /home/mike/sdk/go1.24.0/src/testing/testing.go:1737 +0x35e panic({0x27e4140?, 0x4c18f50?}) /home/mike/sdk/go1.24.0/src/runtime/panic.go:787 +0x132 k8s.io/autoscaler/cluster-autoscaler/simulator/framework.(*NodeInfo).Node(...) /home/mike/dev/kubernetes-autoscaler/cluster-autoscaler/simulator/framework/infos.go:66 k8s.io/autoscaler/cluster-autoscaler/core/scaleup/resource.(*Manager).DeltaForNode(0xc000ab1ce0, 0xc000aa6008, 0x0, {0x30df800, 0xc000703700}) /home/mike/dev/kubernetes-autoscaler/cluster-autoscaler/core/scaleup/resource/manager.go:64 +0x57 k8s.io/autoscaler/cluster-autoscaler/core/scaleup/resource.TestDeltaForNode(0xc000103500) /home/mike/dev/kubernetes-autoscaler/cluster-autoscaler/core/scaleup/resource/manager_test.go:79 +0x5e5 testing.tRunner(0xc000103500, 0x2ddcaa0) /home/mike/sdk/go1.24.0/src/testing/testing.go:1792 +0xf4 created by testing.(*T).Run in goroutine 1 /home/mike/sdk/go1.24.0/src/testing/testing.go:1851 +0x413 FAIL k8s.io/autoscaler/cluster-autoscaler/core/scaleup/resource 0.029s ? k8s.io/autoscaler/cluster-autoscaler/core/test [no test files] ok k8s.io/autoscaler/cluster-autoscaler/core/utils 0.025s FAIL

actually, looking at this failure again. i think it's due to my change in the test.

elmiko · 2025-10-16T21:08:38Z

In any case, IMO the most readable change would be to:

Start passing allNodes instead of readyNodes to TemplateNodeInfoProvider.Process() without changing the signature. This is what the interface definition suggests anyway.

At the beginning of MixedTemplateNodeInfoProvider.Process(), group the passed allNodes into good and bad candidates utilizing isNodeGoodTemplateCandidate(). Then iterate over the good ones in the first loop, and over the bad ones in the last loop.

i can put together a patch like this and give it some tests.

k8s-ci-robot requested review from vadasambar and x13n September 10, 2025 20:21

k8s-ci-robot added the area/cluster-autoscaler label Sep 10, 2025

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/needs-area labels Sep 10, 2025

elmiko force-pushed the unschedulable-nodes-fix branch from a0ebb28 to 3270172 Compare October 2, 2025 20:50

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 2, 2025

elmiko changed the title ~~WIP update to include unschedulable nodes~~ update node info processors to include unschedulable nodes Oct 2, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2025

elmiko commented Oct 3, 2025

View reviewed changes

elmiko force-pushed the unschedulable-nodes-fix branch from 3270172 to cb2649a Compare October 3, 2025 16:37

elmiko force-pushed the unschedulable-nodes-fix branch from cb2649a to fd53c0b Compare October 3, 2025 16:59

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025

elmiko force-pushed the unschedulable-nodes-fix branch from fd53c0b to 906a939 Compare October 8, 2025 18:44

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 8, 2025

jackfrancis reviewed Oct 16, 2025

View reviewed changes

update node info processors to include unschedulable nodes #8520

Are you sure you want to change the base?

update node info processors to include unschedulable nodes #8520

Conversation

elmiko commented Sep 10, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Sep 10, 2025

Uh oh!

elmiko commented Sep 10, 2025

Uh oh!

elmiko commented Oct 2, 2025

Uh oh!

elmiko commented Oct 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elmiko commented Oct 3, 2025

Uh oh!

elmiko commented Oct 3, 2025

Uh oh!

elmiko commented Oct 3, 2025

Uh oh!

elmiko commented Oct 8, 2025

Uh oh!

elmiko commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elmiko commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants