[EKS] [request]: Use Dynamic Resource Allocation for EKS #2314

junhwani · 2024-03-26T02:47:16Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
According to https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
It is said that Kubernetes v1.29 includes cluster-level API support for dynamic resource allocation(DRA),
For the benefit of customers using EKS, I want to be able to add the use of dynamic resource allocation(DRA) in any way

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
It is difficult to add the --feature-gates env because the kube-apiserver, kube-scheduler, kube-controller-manager and kubelet

Additional context
I think many customers want to allocate resources appropriately, so I would appreciate it if you could apply it quickly

Attachments
How to enable dynamic resource allocation(DRA)

The text was updated successfully, but these errors were encountered:

dims · 2024-11-07T20:22:48Z

Kubernetes code freeze for 1.32 is tomorrow (Friday 8th November 2024), as of #127511 the DRA feature gate is as follows:

	DynamicResourceAllocation: {
		{Version: version.MustParse("1.26"), Default: false, PreRelease: featuregate.Alpha},
		{Version: version.MustParse("1.32"), Default: false, PreRelease: featuregate.Beta},
	}

Note that the default is false for 1.32.

xref: #512

toVersus · 2024-11-10T09:01:59Z

Even if both the FeatureGate and API for Dynamic Resource Allocation (DRA) are enabled, support for the beta API from DRA GPU drivers like NVIDIA and Intel is still required. This is because there are some changes from the alpha API, making it impractical for users to use the alpha API at this stage. This support will likely be implemented as soon as Kubernetes 1.32 is released. However, there are several KEPs derived from DRA that have been proposed, so it cannot yet be considered stable.

In particular, when using the current DRA for cases that involve partitioning devices like MIGs, it deviates from the original "Dynamic" meaning of DRA.

Note: This partitioning is static. Dynamically reconfiguring a card to match demand is not part of this KEP. It's covered by the "partitionable devices" extension.

Furthermore, DRA is not yet integrated with Cluster Autoscaler and Karpenter, which makes it hard to use in production.

aleksy-zalenski · 2024-11-13T12:31:28Z

Do we have any updates regarding that? Is it possible to use the Dynamic Resource Allocation (DRA) on EKS with version 1.30 or 1.31?

pohly · 2024-11-13T13:18:16Z

Let's clarify which part of DRA this feature is about. In my opinion, it should focus on enabling the v1beta1 API in 1.32. Enabling older alpha APIs in past Kubernetes releases makes no sense anymore. Enabling alpha features in 1.32 and beyond might make sense, but is probably too risky because they are alpha.

Even if both the FeatureGate and API for Dynamic Resource Allocation (DRA) are enabled, support for the beta API from DRA GPU drivers like NVIDIA and Intel is still required.

Those will come shortly after the 1.32 release. We need the new v1beta1 API to be released before the drivers can be updated.

However, there are several KEPs derived from DRA that have been proposed, so it cannot yet be considered stable.

What got promoted to beta is the core DRA with structured parameters. Promotion to beta means:

The API is guaranteed to remain available for several releases until it either gets promoted to GA or superseded by another beta. In both cases, the v1beta1 remains available, which makes it safe to rely on it.
We will fix whatever bugs are found in the implementation and backport those fixes.

I think core DRA is now stable, even if work on additional features continues.

sftim · 2024-11-13T13:39:07Z

I don't recommend enabling DRA by default [in EKS] until it's been beta for at least one minor release.
[EKS] [request]: Use Dynamic Resource Allocation for EKS #2314 (comment)

sftim · 2024-11-13T13:45:25Z

The API is guaranteed to remain available for several releases until it either gets promoted to GA or superseded by another beta. In both cases, the v1beta1 remains available, which makes it safe to rely on it.

If one day Kubernetes drops the API, clusters that rely on the behavior get a breaking change. Beta APIs are rarely dropped but it can happen.

pohly · 2024-11-13T13:56:28Z

According to https://kubernetes.io/docs/reference/using-api/deprecation-policy/, a beta API remains available for three releases, then gets deprecated and removed after another three releases. Removing the API sooner would be a break of those stability guarantees.

Waiting one release after beta graduation makes no difference regarding API availability. The current state (API group off, feature gate off) won't change. Some additional, future bugs might be fixed, but it's also possible that those bugs won't be found unless it gets enabled.

pohly · 2024-11-13T14:09:58Z

a beta API remains available for three releases

Or perhaps more precisely, can remain available that long. It may get replaced or removed sooner, but then the "must remain available for three releases" kicks in.

We've been very careful with the API design of DRA. I don't see a reason why it should get replaced by a v1beta2. I also don't think it's likely that it gets removed outright.

junhwani · 2024-11-14T00:03:48Z

If DRA is not stable to use, is the best way to use HPA and Karpenter now?

If there is another good way, please recommend it.

toVersus · 2024-11-14T00:32:51Z

@pohly, thanks for chiming in on this discussion! I really appreciate everything you guys have done so far. My main concern is the same as what you all were discussed in kubernetes/kubernetes#127511 (comment): EKS doesn't offer a feature that lets users opt in to the FeatureGate / Beta API.

If it’s going to be enabled by default on all clusters starting with EKS 1.32, I’m against it. I think @sftim is suggesting waiting for one minor version because there’s always a chance of unexpected impacts on existing workloads. It’s fine if people who want to use it have the option, but for users needing a stable production environment, DRA is a game changer for Kubernetes with a broad scope of changes, which makes it feel a bit risky.

If DRA is not stable to use, is the best way to use HPA and Karpenter now?

I think it would be better to explain the challenges in more detail.

pohly · 2024-11-14T03:04:14Z

there’s always a chance of unexpected impacts on existing workloads.

That's the key question regarding "enabled by default". My two cents: I see the risk as pretty low, because kube-scheduler, kube-controller-manager and kubelet pretty much don't do anything related to DRA when the feature isn't used.

junhwani added the Proposed Community submitted issue label Mar 26, 2024

mikestef9 added the EKS Amazon Elastic Kubernetes Service label Mar 26, 2024

sidewinder12s mentioned this issue May 3, 2024

[EKS] [request]: EKS Alpha Clusters #2348

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EKS] [request]: Use Dynamic Resource Allocation for EKS #2314

[EKS] [request]: Use Dynamic Resource Allocation for EKS #2314

junhwani commented Mar 26, 2024 •

edited

Loading

dims commented Nov 7, 2024

toVersus commented Nov 10, 2024

aleksy-zalenski commented Nov 13, 2024

pohly commented Nov 13, 2024

sftim commented Nov 13, 2024

sftim commented Nov 13, 2024

pohly commented Nov 13, 2024

pohly commented Nov 13, 2024

junhwani commented Nov 14, 2024

toVersus commented Nov 14, 2024

pohly commented Nov 14, 2024

[EKS] [request]: Use Dynamic Resource Allocation for EKS #2314

[EKS] [request]: Use Dynamic Resource Allocation for EKS #2314

Comments

junhwani commented Mar 26, 2024 • edited Loading

Community Note

dims commented Nov 7, 2024

toVersus commented Nov 10, 2024

aleksy-zalenski commented Nov 13, 2024

pohly commented Nov 13, 2024

sftim commented Nov 13, 2024

sftim commented Nov 13, 2024

pohly commented Nov 13, 2024

pohly commented Nov 13, 2024

junhwani commented Nov 14, 2024

toVersus commented Nov 14, 2024

pohly commented Nov 14, 2024

junhwani commented Mar 26, 2024 •

edited

Loading