test: use custom gpu node config for processor tests #8607

jackfrancis · 2025-10-03T19:00:27Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR removes a reference to the GCE cloud provider from the gpu processor test library. As an exercise to better understand the current UT I did a "verbose refactor", and then added an additional DRA test which goes through the basic DRA driver filtering outcomes.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-10-03T19:00:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jackfrancis · 2025-10-03T19:02:07Z

@elmiko this should be the solution to (finally) unblock #8583

Don't worry about grok'ing all of the test foo. cc @towca @BigDarkClown for that part (🙏)

cc @sbueringer in case you're tracking #8583

elmiko

i don't know this test overly well but this is making sense to me.

towca · 2025-10-07T13:15:46Z

cluster-autoscaler/processors/customresources/gpu_processor_test.go

-	expectedReadiness[nodeDirectXReady.Name] = true
-
-	nodeDirectXUnready := &apiv1.Node{
+	// Here we add a vanilla NotReady node (no GPU or other device labels or status conditions)


This is more to ensure that our filter function doesn't affect no-GPU Nodes at all (same for the "ready vanilla" case).

towca · 2025-10-07T14:03:17Z

cluster-autoscaler/processors/customresources/gpu_processor_test.go

-		},
+// TestFilterOutNodesWithUnreadyResourcesDRA tests that FilterOutNodesWithUnreadyResources
+// does the right thing based on DRA configuration present in the node.
+func TestFilterOutNodesWithUnreadyResourcesDRA(t *testing.T) {


Oof, this grew quite complex 😅. Your intuitions about the test behavior are mostly correct, but if we're refactoring, IMO rewriting this into the table-based approach would be much clearer.

For example, we could have a test case like this:

type testCase struct { // Maps keyed by node name allNodes map[string]*apiv1.Node readyNodes map[string]*apiv1.Node wantNodesWithUnreadyOverride map[string]*apiv1.Node }

Test code would be something like:

gotAllNodes, gotReadyNodes := processor.FilterOutNodesWithUnreadyResources(ctx, toList(tc.allNodes), toList(tc.readyNodes), nil) gotAllNodesSet, gotReadyNodesSet := toSet(gotAllNodes), toSet(gotReadyNodes) // Keyed by node name assert that gotAllNodesSet has the same keys as tc.allNodes assert that gotReadyNodesSet has the same keys as (tc.readyNodes - tc.wantNodesWithUnreadyOverride) for each node in gotReadyNodesSet: assert that the node is identical to tc.readyNodes[node.Name] for each node in gotAllNodesSet: if the node is not in tc.wantNodesWithUnreadyOverride, assert that it's identical to tc.allNodes[node.Name] else assert that the node has the expected not-Ready condition (and that there's only 1 condition for readiness), and is otherwise identical to tc.allNodes[node.Name]

And we'd have these test cases based on the current ones:

GPU label present, condition ready, nvidia.com/gpu resource 0 -> overwritten to unready

GPU label present, condition ready, nvidia.com/gpu resource 1 -> no overwrites
3., 4. -> same as 1. and 2. but for the directX resource

GPU label present, condition ready, no nvidia/directX resource -> overwritten to unready

No GPU label, condition ready, no nvidia/directX resource -> no overwrites

No GPU label, condition unready, no nvidia/directX resource -> no overwrites

And then we can just add an additional case to cover the DRA part:

GPU label present, condition ready, no nvidia/directX resource, GetNodeGpuConfig indicates DRA -> no overwrites

As for the test provider setup, IMO it'd make the most sense to set up just 1 provider object, and have the custom GetNodeGpuConfig we're registering behave differently based on the Node it gets. This will keep the setup code and the test cases simple. E.g. We could have the method return different responses based on if the passed Node has "dra" in its name.

WDYT?

ACK, I'll re-write, thanks for the detailed notes on a quick initial attempt!

@towca see my commit 2 as an attempt to refactor into a single test case (we can decompose the actual cases after we're confident this is the right structure)

thoughts?

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area labels Oct 3, 2025

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler labels Oct 3, 2025

k8s-ci-robot requested review from aleksandra-malinowska and vadasambar October 3, 2025 19:00

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/needs-area labels Oct 3, 2025

jackfrancis mentioned this pull request Oct 3, 2025

CAS: move DRA consts go into core #8595

Closed

elmiko reviewed Oct 3, 2025

View reviewed changes

towca reviewed Oct 7, 2025

View reviewed changes

jackfrancis added 2 commits October 10, 2025 16:24

test: use custom gpu node config for processor tests

e62968f

refactor test case

5c9c8cc

jackfrancis force-pushed the gpu-processor-test-dra-refactor branch from ebf2c74 to 5c9c8cc Compare October 13, 2025 19:32

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: use custom gpu node config for processor tests #8607

test: use custom gpu node config for processor tests #8607

Uh oh!

jackfrancis commented Oct 3, 2025

Uh oh!

k8s-ci-robot commented Oct 3, 2025

Uh oh!

jackfrancis commented Oct 3, 2025

Uh oh!

elmiko left a comment

Uh oh!

towca Oct 7, 2025

Uh oh!

towca Oct 7, 2025

Uh oh!

jackfrancis Oct 7, 2025

Uh oh!

jackfrancis Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

test: use custom gpu node config for processor tests #8607

Are you sure you want to change the base?

test: use custom gpu node config for processor tests #8607

Uh oh!

Conversation

jackfrancis commented Oct 3, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Oct 3, 2025

Uh oh!

jackfrancis commented Oct 3, 2025

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

towca Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

towca Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

jackfrancis Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

jackfrancis Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants