Add capacity buffers scale up events #8621

abdelrahman882 · 2025-10-07T16:08:18Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add events to capacity buffers objects when buffer causes scale up or fails to do so.

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

In pod list processor we inject fake pods with capacity buffer fake pod annotation
in status processor we filter out these pods so we don't create events for fake pods

This changes is mainly caching fake pods to buffer objects in step 1. then in step 2 we use that mapping to get the buffer object from fake pods objects and create scale up events for the buffers

Does this PR introduce a user-facing change?

Add capacity buffers scale up events

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

k8s-ci-robot · 2025-10-07T16:08:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: abdelrahman882
Once this PR has been reviewed and has the lgtm label, please assign aleksandra-malinowska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cluster-autoscaler/processors/capacitybuffer/scale_up_status_processor.go

jackfrancis · 2025-10-07T17:23:30Z

cluster-autoscaler/processors/status/eventing_scale_up_processor.go

 // relevant events for pods depending on their post scale-up status.
 func (p *EventingScaleUpStatusProcessor) Process(context *context.AutoscalingContext, status *ScaleUpStatus) {
-	consideredNodeGroupsMap := nodeGroupListToMapById(status.ConsideredNodeGroups)
+	consideredNodeGroupsMap := NodeGroupListToMapById(status.ConsideredNodeGroups)


merge conflict pro tip here:

https://github.com/kubernetes/autoscaler/pull/7664/files#diff-dff3f90613cb7ed44657714ac89219e12699ae87dde45b8d91b9fce4fda5321e

(your change here didn't actually conflict w/ the autoscalingCtx *ca_context.AutoscalingContext change but FYI)

jackfrancis · 2025-10-07T17:25:04Z

cluster-autoscaler/processors/capacitybuffer/scale_up_status_processor_test.go

+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	v1alpha1 "k8s.io/autoscaler/cluster-autoscaler/apis/capacitybuffer/autoscaling.x-k8s.io/v1alpha1"
 	"k8s.io/autoscaler/cluster-autoscaler/context"
+	"k8s.io/autoscaler/cluster-autoscaler/processors/nodegroupset"


merge conflict resolution pro tip:

https://github.com/kubernetes/autoscaler/pull/7664/files#diff-2218808dd5a372319540d4ffc872066fbd11bf4b74ef2c786fc33387b8d46c5cR24

git doesn't know how to resolve the surrounding import additions on top of the change to the "k8s.io/autoscaler/cluster-autoscaler/context" import alias change

x13n · 2025-10-16T09:10:31Z

cluster-autoscaler/main.go

+			bufferPodInjector := cbprocessor.NewCapacityBufferPodListProcessor(capacitybufferClient, []string{common.ActiveProvisioningStrategy}, buffersPodsRegistry)
 			podListProcessor = pods.NewCombinedPodListProcessor([]pods.PodListProcessor{bufferPodInjector, podListProcessor})
-			opts.Processors.ScaleUpStatusProcessor = status.NewCombinedScaleUpStatusProcessor([]status.ScaleUpStatusProcessor{cbprocessor.NewFakePodsScaleUpStatusProcessor(), opts.Processors.ScaleUpStatusProcessor})
+			opts.Processors.ScaleUpStatusProcessor = status.NewCombinedScaleUpStatusProcessor([]status.ScaleUpStatusProcessor{cbprocessor.NewFakePodsScaleUpStatusProcessor(buffersPodsRegistry), opts.Processors.ScaleUpStatusProcessor})


nit: Can you break up this line? It is very long.

x13n · 2025-10-16T09:15:41Z

cluster-autoscaler/processors/status/eventing_scale_up_processor.go


-func nodeGroupListToMapById(nodeGroups []cloudprovider.NodeGroup) map[string]cloudprovider.NodeGroup {
+// NodeGroupListToMapById returns a map of node group ID to nonode group
+func NodeGroupListToMapById(nodeGroups []cloudprovider.NodeGroup) map[string]cloudprovider.NodeGroup {


There is nothing specific to eventing processor about this function. WDYT about moving it next to NodeGroup interface? I.e. here:

autoscaler/cluster-autoscaler/cloudprovider/cloud_provider.go

Lines 262 to 263 in 34cfc53

}

x13n · 2025-10-16T09:28:42Z

cluster-autoscaler/processors/capacitybuffer/pod_list_processor.go

+func (p *CapacityBufferPodListProcessor) updateCapacityBufferRegistry(fakePods []*apiv1.Pod, buffer *v1alpha1.CapacityBuffer) {
+	p.buffersRegistry.fakePodsUIDToBuffer = make(map[string]*v1alpha1.CapacityBuffer, len(fakePods))
+	for _, fakePod := range fakePods {
+		if p.buffersRegistry != nil {


This check every iteration doesn't do anything. If p.buffersRegistry == nil, you'll get a panic 2 lines earlier, trying to assign something to p.buffersRegistry.fakePodsUIDToBuffer.
If it is a valid state of CapacityBufferPodListProcessor to have nil there, it should be checked here once, when entering the function. If it is not a valid state, it should be checked NewCapacityBufferPodListProcessor and an error should be returned there.

x13n · 2025-10-16T09:39:33Z

cluster-autoscaler/processors/capacitybuffer/pod_list_processor.go

 	totalFakePods := []*apiv1.Pod{}
 	for _, buffer := range buffers {
 		fakePods := p.provision(buffer)
+		p.updateCapacityBufferRegistry(fakePods, buffer)


This is not a comment to this PR per se, but adding a note since you're modifying this code: why do we need to recalculate all that every loop? This is O(N) in terms of # of fake pods and requires costly deep copying, which will lead to performance issues at large scale. Ideally, CA would only watch for changes to buffers and templates and recalculate the set of fake pods when there is a need for that.

I'm not saying to refactor it all in this PR, but it should be fixed as a follow up.

x13n · 2025-10-16T09:52:11Z

cluster-autoscaler/processors/capacitybuffer/scale_up_status_processor.go

-	filteredPodsSources := make([]T, 0)
-	removedPods := make([]*apiv1.Pod, 0)
+// filterOutFakeUnschedulablePods filters out NoScaleUpInfo for capacity buffers fake pods
+func filterOutFakeUnschedulablePods(noScaleUpInfo []status.NoScaleUpInfo) ([]status.NoScaleUpInfo, []status.NoScaleUpInfo) {


nit: Why duplicate the existing generics based logic?

cluster-autoscaler/processors/capacitybuffer/scale_up_status_processor.go

x13n · 2025-10-16T10:05:02Z

cluster-autoscaler/processors/capacitybuffer/scale_up_status_processor.go

+		}
+
+		for _, bufferInfo := range buffersInfo {
+			context.Recorder.Event(bufferInfo.buffer, apiv1.EventTypeNormal, "NotTriggerScaleUp",


You can use Eventf instead of Event to avoid calling Sprintf directly.

x13n · 2025-10-16T10:19:27Z

cluster-autoscaler/processors/capacitybuffer/scale_up_status_processor_test.go

+				},
+				PodsAwaitEvaluation: []*apiv1.Pod{},
+			},
+			expectedTriggeredScaleUp:    2,


WDYT about changing this from int to a slice of error strings so that it is possible to validate the message content as well?

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area labels Oct 7, 2025

k8s-ci-robot requested review from feiskyer and vadasambar October 7, 2025 16:08

k8s-ci-robot added the area/cluster-autoscaler label Oct 7, 2025

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/needs-area labels Oct 7, 2025

abdelrahman882 force-pushed the buffers-events branch from 1b2a75d to 7dee2b6 Compare October 7, 2025 16:11

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025

jackfrancis reviewed Oct 7, 2025

View reviewed changes

cluster-autoscaler/processors/capacitybuffer/scale_up_status_processor.go Show resolved Hide resolved

jackfrancis reviewed Oct 7, 2025

View reviewed changes

Add capacity buffers scale up events

469e2c9

abdelrahman882 force-pushed the buffers-events branch from 7dee2b6 to 469e2c9 Compare October 9, 2025 09:01

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 9, 2025

x13n requested changes Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add capacity buffers scale up events #8621

Add capacity buffers scale up events #8621

abdelrahman882 commented Oct 7, 2025

Uh oh!

k8s-ci-robot commented Oct 7, 2025

Uh oh!

Uh oh!

jackfrancis Oct 7, 2025

Uh oh!

jackfrancis Oct 7, 2025

Uh oh!

x13n Oct 16, 2025

Uh oh!

x13n Oct 16, 2025

Uh oh!

x13n Oct 16, 2025

Uh oh!

x13n Oct 16, 2025

Uh oh!

x13n Oct 16, 2025

Uh oh!

Uh oh!

x13n Oct 16, 2025

Uh oh!

x13n Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add capacity buffers scale up events #8621

Are you sure you want to change the base?

Add capacity buffers scale up events #8621

Conversation

abdelrahman882 commented Oct 7, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Oct 7, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants