Skip to content

Conversation

jackfrancis
Copy link
Contributor

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR removes a reference to the GCE cloud provider from the gpu processor test library. As an exercise to better understand the current UT I did a "verbose refactor", and then added an additional DRA test which goes through the basic DRA driver filtering outcomes.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area labels Oct 3, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler labels Oct 3, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/needs-area labels Oct 3, 2025
@jackfrancis
Copy link
Contributor Author

@elmiko this should be the solution to (finally) unblock #8583

Don't worry about grok'ing all of the test foo. cc @towca @BigDarkClown for that part (🙏)

cc @sbueringer in case you're tracking #8583

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know this test overly well but this is making sense to me.

expectedReadiness[nodeDirectXReady.Name] = true

nodeDirectXUnready := &apiv1.Node{
// Here we add a vanilla NotReady node (no GPU or other device labels or status conditions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more to ensure that our filter function doesn't affect no-GPU Nodes at all (same for the "ready vanilla" case).

},
// TestFilterOutNodesWithUnreadyResourcesDRA tests that FilterOutNodesWithUnreadyResources
// does the right thing based on DRA configuration present in the node.
func TestFilterOutNodesWithUnreadyResourcesDRA(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, this grew quite complex 😅. Your intuitions about the test behavior are mostly correct, but if we're refactoring, IMO rewriting this into the table-based approach would be much clearer.

For example, we could have a test case like this:

type testCase struct { // Maps keyed by node name
	allNodes                     map[string]*apiv1.Node
	readyNodes                   map[string]*apiv1.Node
	wantNodesWithUnreadyOverride map[string]*apiv1.Node
}

Test code would be something like:

gotAllNodes, gotReadyNodes := processor.FilterOutNodesWithUnreadyResources(ctx, toList(tc.allNodes), toList(tc.readyNodes), nil)
gotAllNodesSet, gotReadyNodesSet := toSet(gotAllNodes), toSet(gotReadyNodes) // Keyed by node name

assert that gotAllNodesSet has the same keys as tc.allNodes
assert that gotReadyNodesSet has the same keys as (tc.readyNodes - tc.wantNodesWithUnreadyOverride)

for each node in gotReadyNodesSet:
  assert that the node is identical to tc.readyNodes[node.Name]

for each node in gotAllNodesSet:
  if the node is not in tc.wantNodesWithUnreadyOverride, assert that it's identical to tc.allNodes[node.Name]
  else assert that the node has the expected not-Ready condition (and that there's only 1 condition for readiness), and is otherwise identical to tc.allNodes[node.Name]

And we'd have these test cases based on the current ones:

  1. GPU label present, condition ready, nvidia.com/gpu resource 0 -> overwritten to unready
  2. GPU label present, condition ready, nvidia.com/gpu resource 1 -> no overwrites
    3., 4. -> same as 1. and 2. but for the directX resource
  3. GPU label present, condition ready, no nvidia/directX resource -> overwritten to unready
  4. No GPU label, condition ready, no nvidia/directX resource -> no overwrites
  5. No GPU label, condition unready, no nvidia/directX resource -> no overwrites

And then we can just add an additional case to cover the DRA part:

  1. GPU label present, condition ready, no nvidia/directX resource, GetNodeGpuConfig indicates DRA -> no overwrites

As for the test provider setup, IMO it'd make the most sense to set up just 1 provider object, and have the custom GetNodeGpuConfig we're registering behave differently based on the Node it gets. This will keep the setup code and the test cases simple. E.g. We could have the method return different responses based on if the passed Node has "dra" in its name.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, I'll re-write, thanks for the detailed notes on a quick initial attempt!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@towca see my commit 2 as an attempt to refactor into a single test case (we can decompose the actual cases after we're confident this is the right structure)

thoughts?

@jackfrancis jackfrancis force-pushed the gpu-processor-test-dra-refactor branch from ebf2c74 to 5c9c8cc Compare October 13, 2025 19:32
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants