-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-3953: Node Resource Hot Plug #3955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
a7bc843 to
03e927f
Compare
|
/assign @mrunalp @SergeyKanzhelev @klueska |
|
/cc |
|
/cc |
|
/cc |
13467ac to
a65903d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRR shadow:
The PRR looks good for alpha.
Thank you for answering more than you needed.
We still need sig approval but the PRR is looking good.
Thank you for the review, I have updated the KEP address your review comments and Yes looking for SIG-Node review as well. |
| - https://github.com/kubernetes/kubernetes/issues/125579 | ||
| - https://github.com/kubernetes/kubernetes/issues/127793 | ||
|
|
||
| Hence, it is necessary to handle capacity updates gracefully across the cluster, rather than resetting the cluster components to achieve the same outcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is worth acknowledging the (slowly?) growing baremetal userbase. There is a lively subset of users using kube on baremetal, often for critical usecases (easy example: telcos). On baremetal, restarting a node takes nontrivial amount of time (minutes on big machines). Restarting the kubelet in these cases causes nontrivial disruptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing out , I have added it to the KEP.
|
|
||
| #### Story 2 | ||
|
|
||
| As a Kubernetes Application Developer, I want the kernel to optimize system performance by making better use of local resources when a node is resized, so that my applications run faster with fewer disruptions. This is achieved when there are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On which cases me as Kubernetes Application Developer may I would do that? this feels to me more like a correction of provisioned resources. This feels me more like we are trying to emphasize vertical scalability vs horizontal scalability? Perhaps we want to explore the interaction with in-place pod resize? (not sure you did below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this story is loosely tied to Application Developer, I have updated the story to relate with the Application Performance Analyst.
| As a Cluster administrator, I want to resize a Kubernetes node dynamically, so that I can quickly hot plug resources without waiting for new nodes to join the cluster. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still very sympatethic with the proposal and I personally like it, but still I feel this very angle is not strong enough. Yes, we have bugs. Yes, these are annoying. But echoing the above comment from @thockin , these are bugs we should fix anyway and these are improvements we should have anyway. A safer and faster kubelet restart is a win also if we implement resource hotplug
|
|
||
| ### Notes/Constraints/Caveats (Optional) | ||
|
|
||
| ### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point. I think the assumption is that adding resources is purely additive (e.g. cpuids don't change, you get more cpuids like appending to a slice, existing ones keep their meaning). This should be called out explicitly though as assumption (If it's indeed an assumption)
| 2. Identify Nodes Affected by Hotplug: | ||
| * By flagging a Node as being impacted by hotplug, the Cluster Autoscaler could revert to a less reliable but more conservative "scale from 0 nodes" logic. | ||
|
|
||
| Given that this KEP and autoscaler are inter-related, the above approaches were discussed in the community with relevant stakeholders, and have decided approaching this problem through the former route. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess "former" is appoach 1? let's call it out unambiguously ("approaching this problem using approach 1")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, Explicitly mentioned it now to avoid confusion.
| is lesser than the initial capacity of the node. This is only to point at the fact that the resources have shrunk on the node and may need attention/intervention. | ||
|
|
||
| Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration. | ||
| In this case, valid configuration refers to a state which can either be previous hot-plug capacity or the initial capacity in case there was no history of hotplug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a kubelet restart make the node transition to Ready again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It will transition to Ready state though we store Node's Initial Allocatable Values in Node object once the kubelet restarts the initial values will be current values of Node.
This is interim solution till we support hotunplug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, it is be possible that at this point some of the pods get "removed" from the node if they don't fit anymore
00410c8 to
92c53c6
Compare
|
Thank you for handling the hot unplug case. PRR lgtm, rest is up to the sig. /approve |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: deads2k, Karthik-K-N The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Co-authored-by: kishen-v <[email protected]>
Good remark. Maybe we should check if any cpu (ids) have been or memory of a (numa)node has decreased, at least if the topologymanager and friends have been enabled(?) |
| As a Cluster administrator, I want to resize a Kubernetes node dynamically, so that I can quickly hot plug resources without waiting for new nodes to join the cluster. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think the biggest gap is the timing and mechanism of the kubelet restart to have it work seemlessly. Like, in cloud environment a node admin can scale up the node, but would need to know when to restart the kubelet. the cloud sdk could do so, but it doesn't necessarily know kube is running there. on the bare metal side, If someone goes to a rack and hot plugs something, there is nothing that would react to that. Having kubelet be reactive to these changes means less manual work.
I do think in principle the kubelet could just be restarted though. It just wouldn't be seemless
This is a good point and should be captured in the KEP.
| As the hot-unplug events are not completely handled in this KEP, in such cases, it is imperative to move the node to the NotReady state when the current capacity of the node | ||
| is lesser than the initial capacity of the node. This is only to point at the fact that the resources have shrunk on the node and may need attention/intervention. | ||
|
|
||
| Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of a hot-unplug event, I’d be curious to understand (and think we should capture here) what would happen to already running workloads if there are no longer enough resources available to accommodate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of already running workloads and if there are not enough resources available to accommodate them post hot-unplug, the workload may tend to under perform or transition to "Pending" state or get migrated to a suitable node which meets the workload`s resource requirement.
updated the same in KEP as well. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pods that don't "fit in" anymore will be removed from the node. AFAIU, currently with kubelet restart, the kubelet iterates over pods from oldest to newest and kicks out pods that don't fit. I think with this KEP we have the opportunity to do improve the heuristics, taking into account pod priority, pod QoS class etc.
Co-authored-by: kishen-v <[email protected]>
bbe64f0 to
64df6c4
Compare
64df6c4 to
5621db9
Compare
| `min(0, 1000 - (1000*containerMemoryRequest)/initialMemoryCapacity)` | ||
|
|
||
| - Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload. | ||
| - To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are we going to add tests for this? it's not clear to me how we'll programatically change the resources for automated tests right now. May need to consult with SIG testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @haircommander,
The idea we have right now for unit testing is by mock the sysfs structure and modifying the online file. We could structure the tests around the same. For example, flipping the /sys/bus/cpu/devices/cpuX/online to 0 or 1 enables or disables the CPU and the same is expected to be caught by the kubelet during a poll cycle.
Ref:
CPU Hotplug: https://docs.kernel.org/core-api/cpu_hotplug.html#using-cpu-hotplug
Memory Hotplug: https://docs.kernel.org/admin-guide/mm/memory-hotplug.html#onlining-and-offlining-memory-blocks
|
|
||
| - Handling downsize events | ||
| - Though, there is no support through this KEP to handle an event of node-downsize, it's the onus of the cluster administrator to resize responsibly to avoid disruption as it lies out of the kubernetes realm. | ||
| - However, in a situation of downsize an error mode is returned by the kubelet and the node is marked as `NotReady`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an admin downsizes then increases resources after, does the node return to ready or stay in not ready until it's restarted? I feel the latter but should probably expand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current PoC implementation returns the node to Ready state if the resources (or more) become available again. Node is put into NotReady states but other parts of kubelet are not aware of the change (e.g. resource managers are not re-initialized etc). This kind of matches the existing behavior of kubelet not being aware of changes in node capacity.
Example of resizing a resource capacity: 10 (Ready) -> 5 (NotReady) -> 20 (Ready) -> 10 (NotReady) -> 15 (NotReady) -> 20 (Ready).
Uh oh!
There was an error while loading. Please reload this page.