Skip to content

Conversation

abdelrahman882
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add events to capacity buffers objects when buffer causes scale up or fails to do so.

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

  1. In pod list processor we inject fake pods with capacity buffer fake pod annotation
  2. in status processor we filter out these pods so we don't create events for fake pods

This changes is mainly caching fake pods to buffer objects in step 1. then in step 2 we use that mapping to get the buffer object from fake pods objects and create scale up events for the buffers

Does this PR introduce a user-facing change?

Add capacity buffers scale up events

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area labels Oct 7, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: abdelrahman882
Once this PR has been reviewed and has the lgtm label, please assign aleksandra-malinowska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/needs-area labels Oct 7, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025
// relevant events for pods depending on their post scale-up status.
func (p *EventingScaleUpStatusProcessor) Process(context *context.AutoscalingContext, status *ScaleUpStatus) {
consideredNodeGroupsMap := nodeGroupListToMapById(status.ConsideredNodeGroups)
consideredNodeGroupsMap := NodeGroupListToMapById(status.ConsideredNodeGroups)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge conflict pro tip here:

https://github.com/kubernetes/autoscaler/pull/7664/files#diff-dff3f90613cb7ed44657714ac89219e12699ae87dde45b8d91b9fce4fda5321e

(your change here didn't actually conflict w/ the autoscalingCtx *ca_context.AutoscalingContext change but FYI)

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
v1alpha1 "k8s.io/autoscaler/cluster-autoscaler/apis/capacitybuffer/autoscaling.x-k8s.io/v1alpha1"
"k8s.io/autoscaler/cluster-autoscaler/context"
"k8s.io/autoscaler/cluster-autoscaler/processors/nodegroupset"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge conflict resolution pro tip:

https://github.com/kubernetes/autoscaler/pull/7664/files#diff-2218808dd5a372319540d4ffc872066fbd11bf4b74ef2c786fc33387b8d46c5cR24

git doesn't know how to resolve the surrounding import additions on top of the change to the "k8s.io/autoscaler/cluster-autoscaler/context" import alias change

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 9, 2025
bufferPodInjector := cbprocessor.NewCapacityBufferPodListProcessor(capacitybufferClient, []string{common.ActiveProvisioningStrategy}, buffersPodsRegistry)
podListProcessor = pods.NewCombinedPodListProcessor([]pods.PodListProcessor{bufferPodInjector, podListProcessor})
opts.Processors.ScaleUpStatusProcessor = status.NewCombinedScaleUpStatusProcessor([]status.ScaleUpStatusProcessor{cbprocessor.NewFakePodsScaleUpStatusProcessor(), opts.Processors.ScaleUpStatusProcessor})
opts.Processors.ScaleUpStatusProcessor = status.NewCombinedScaleUpStatusProcessor([]status.ScaleUpStatusProcessor{cbprocessor.NewFakePodsScaleUpStatusProcessor(buffersPodsRegistry), opts.Processors.ScaleUpStatusProcessor})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you break up this line? It is very long.


func nodeGroupListToMapById(nodeGroups []cloudprovider.NodeGroup) map[string]cloudprovider.NodeGroup {
// NodeGroupListToMapById returns a map of node group ID to nonode group
func NodeGroupListToMapById(nodeGroups []cloudprovider.NodeGroup) map[string]cloudprovider.NodeGroup {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing specific to eventing processor about this function. WDYT about moving it next to NodeGroup interface? I.e. here:

func (p *CapacityBufferPodListProcessor) updateCapacityBufferRegistry(fakePods []*apiv1.Pod, buffer *v1alpha1.CapacityBuffer) {
p.buffersRegistry.fakePodsUIDToBuffer = make(map[string]*v1alpha1.CapacityBuffer, len(fakePods))
for _, fakePod := range fakePods {
if p.buffersRegistry != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check every iteration doesn't do anything. If p.buffersRegistry == nil, you'll get a panic 2 lines earlier, trying to assign something to p.buffersRegistry.fakePodsUIDToBuffer.
If it is a valid state of CapacityBufferPodListProcessor to have nil there, it should be checked here once, when entering the function. If it is not a valid state, it should be checked NewCapacityBufferPodListProcessor and an error should be returned there.

totalFakePods := []*apiv1.Pod{}
for _, buffer := range buffers {
fakePods := p.provision(buffer)
p.updateCapacityBufferRegistry(fakePods, buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a comment to this PR per se, but adding a note since you're modifying this code: why do we need to recalculate all that every loop? This is O(N) in terms of # of fake pods and requires costly deep copying, which will lead to performance issues at large scale. Ideally, CA would only watch for changes to buffers and templates and recalculate the set of fake pods when there is a need for that.

I'm not saying to refactor it all in this PR, but it should be fixed as a follow up.

filteredPodsSources := make([]T, 0)
removedPods := make([]*apiv1.Pod, 0)
// filterOutFakeUnschedulablePods filters out NoScaleUpInfo for capacity buffers fake pods
func filterOutFakeUnschedulablePods(noScaleUpInfo []status.NoScaleUpInfo) ([]status.NoScaleUpInfo, []status.NoScaleUpInfo) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why duplicate the existing generics based logic?

}

for _, bufferInfo := range buffersInfo {
context.Recorder.Event(bufferInfo.buffer, apiv1.EventTypeNormal, "NotTriggerScaleUp",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Eventf instead of Event to avoid calling Sprintf directly.

},
PodsAwaitEvaluation: []*apiv1.Pod{},
},
expectedTriggeredScaleUp: 2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about changing this from int to a slice of error strings so that it is possible to validate the message content as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants