Skip to content

Conversation

@Karthik-K-N
Copy link

@Karthik-K-N Karthik-K-N commented Apr 17, 2023

  • One-line PR description: Node Resource Hot Plug
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 17, 2023
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 17, 2023
@Karthik-K-N Karthik-K-N mentioned this pull request Apr 17, 2023
4 tasks
@Karthik-K-N Karthik-K-N changed the title Dynamic node resize KEP-3953: Dynamic node resize Apr 17, 2023
@bart0sh
Copy link
Contributor

bart0sh commented Apr 17, 2023

/assign @mrunalp @SergeyKanzhelev @klueska

@kad
Copy link
Member

kad commented Apr 28, 2023

/cc

@ffromani
Copy link
Contributor

/cc

@k8s-ci-robot k8s-ci-robot requested a review from ffromani May 18, 2023 07:29
@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 22, 2023
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 23, 2023
@fmuyassarov
Copy link
Member

/cc

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRR shadow:

The PRR looks good for alpha.
Thank you for answering more than you needed.

We still need sig approval but the PRR is looking good.

@Karthik-K-N
Copy link
Author

PRR shadow:

The PRR looks good for alpha. Thank you for answering more than you needed.

We still need sig approval but the PRR is looking good.

Thank you for the review, I have updated the KEP address your review comments and Yes looking for SIG-Node review as well.

- https://github.com/kubernetes/kubernetes/issues/125579
- https://github.com/kubernetes/kubernetes/issues/127793

Hence, it is necessary to handle capacity updates gracefully across the cluster, rather than resetting the cluster components to achieve the same outcome.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is worth acknowledging the (slowly?) growing baremetal userbase. There is a lively subset of users using kube on baremetal, often for critical usecases (easy example: telcos). On baremetal, restarting a node takes nontrivial amount of time (minutes on big machines). Restarting the kubelet in these cases causes nontrivial disruptions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out , I have added it to the KEP.


#### Story 2

As a Kubernetes Application Developer, I want the kernel to optimize system performance by making better use of local resources when a node is resized, so that my applications run faster with fewer disruptions. This is achieved when there are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On which cases me as Kubernetes Application Developer may I would do that? this feels to me more like a correction of provisioned resources. This feels me more like we are trying to emphasize vertical scalability vs horizontal scalability? Perhaps we want to explore the interaction with in-place pod resize? (not sure you did below)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this story is loosely tied to Application Developer, I have updated the story to relate with the Application Performance Analyst.

Comment on lines +169 to +180
As a Cluster administrator, I want to resize a Kubernetes node dynamically, so that I can quickly hot plug resources without waiting for new nodes to join the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still very sympatethic with the proposal and I personally like it, but still I feel this very angle is not strong enough. Yes, we have bugs. Yes, these are annoying. But echoing the above comment from @thockin , these are bugs we should fix anyway and these are improvements we should have anyway. A safer and faster kubelet restart is a win also if we implement resource hotplug


### Notes/Constraints/Caveats (Optional)

### Risks and Mitigations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point. I think the assumption is that adding resources is purely additive (e.g. cpuids don't change, you get more cpuids like appending to a slice, existing ones keep their meaning). This should be called out explicitly though as assumption (If it's indeed an assumption)

2. Identify Nodes Affected by Hotplug:
* By flagging a Node as being impacted by hotplug, the Cluster Autoscaler could revert to a less reliable but more conservative "scale from 0 nodes" logic.

Given that this KEP and autoscaler are inter-related, the above approaches were discussed in the community with relevant stakeholders, and have decided approaching this problem through the former route.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess "former" is appoach 1? let's call it out unambiguously ("approaching this problem using approach 1")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, Explicitly mentioned it now to avoid confusion.

is lesser than the initial capacity of the node. This is only to point at the fact that the resources have shrunk on the node and may need attention/intervention.

Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration.
In this case, valid configuration refers to a state which can either be previous hot-plug capacity or the initial capacity in case there was no history of hotplug.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a kubelet restart make the node transition to Ready again?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It will transition to Ready state though we store Node's Initial Allocatable Values in Node object once the kubelet restarts the initial values will be current values of Node.
This is interim solution till we support hotunplug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, it is be possible that at this point some of the pods get "removed" from the node if they don't fit anymore

@deads2k
Copy link
Contributor

deads2k commented Oct 7, 2025

Thank you for handling the hot unplug case.

PRR lgtm, rest is up to the sig.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: deads2k, Karthik-K-N
Once this PR has been reviewed and has the lgtm label, please ask for approval from mrunalp. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-authored-by: kishen-v <[email protected]>
@marquiz
Copy link
Contributor

marquiz commented Oct 10, 2025

I think the assumption is that adding resources is purely additive (e.g. cpuids don't change, you get more cpuids like appending to a slice, existing ones keep their meaning). This should be called out explicitly though as assumption (If it's indeed an assumption)

Good remark. Maybe we should check if any cpu (ids) have been or memory of a (numa)node has decreased, at least if the topologymanager and friends have been enabled(?)

Comment on lines +169 to +180
As a Cluster administrator, I want to resize a Kubernetes node dynamically, so that I can quickly hot plug resources without waiting for new nodes to join the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think the biggest gap is the timing and mechanism of the kubelet restart to have it work seemlessly. Like, in cloud environment a node admin can scale up the node, but would need to know when to restart the kubelet. the cloud sdk could do so, but it doesn't necessarily know kube is running there. on the bare metal side, If someone goes to a rack and hot plugs something, there is nothing that would react to that. Having kubelet be reactive to these changes means less manual work.

I do think in principle the kubelet could just be restarted though. It just wouldn't be seemless

This is a good point and should be captured in the KEP.

As the hot-unplug events are not completely handled in this KEP, in such cases, it is imperative to move the node to the NotReady state when the current capacity of the node
is lesser than the initial capacity of the node. This is only to point at the fact that the resources have shrunk on the node and may need attention/intervention.

Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of a hot-unplug event, I’d be curious to understand (and think we should capture here) what would happen to already running workloads if there are no longer enough resources available to accommodate them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of already running workloads and if there are not enough resources available to accommodate them post hot-unplug, the workload may tend to under perform or transition to "Pending" state or get migrated to a suitable node which meets the workload`s resource requirement.

updated the same in KEP as well. Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pods that don't "fit in" anymore will be removed from the node. AFAIU, currently with kubelet restart, the kubelet iterates over pods from oldest to newest and kicks out pods that don't fit. I think with this KEP we have the opportunity to do improve the heuristics, taking into account pod priority, pod QoS class etc.

Co-authored-by: kishen-v <[email protected]>
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 15, 2025
`min(0, 1000 - (1000*containerMemoryRequest)/initialMemoryCapacity)`

- Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload.
- To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we going to add tests for this? it's not clear to me how we'll programatically change the resources for automated tests right now. May need to consult with SIG testing?

Copy link

@kishen-v kishen-v Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @haircommander, 
The idea we have right now for unit testing is by mock the sysfs structure and modifying the online file. We could structure the tests around the same. For example, flipping the /sys/bus/cpu/devices/cpuX/online to 0 or 1 enables or disables the CPU and the same is expected to be caught by the kubelet during a poll cycle.
Ref:
CPU Hotplug: https://docs.kernel.org/core-api/cpu_hotplug.html#using-cpu-hotplug
Memory Hotplug: https://docs.kernel.org/admin-guide/mm/memory-hotplug.html#onlining-and-offlining-memory-blocks


- Handling downsize events
- Though, there is no support through this KEP to handle an event of node-downsize, it's the onus of the cluster administrator to resize responsibly to avoid disruption as it lies out of the kubernetes realm.
- However, in a situation of downsize an error mode is returned by the kubelet and the node is marked as `NotReady`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an admin downsizes then increases resources after, does the node return to ready or stay in not ready until it's restarted? I feel the latter but should probably expand

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current PoC implementation returns the node to Ready state if the resources (or more) become available again. Node is put into NotReady states but other parts of kubelet are not aware of the change (e.g. resource managers are not re-initialized etc). This kind of matches the existing behavior of kubelet not being aware of changes in node capacity.

Example of resizing a resource capacity: 10 (Ready) -> 5 (NotReady) -> 20 (Ready) -> 10 (NotReady) -> 15 (NotReady) -> 20 (Ready).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.