chore: stop adding spot requirement in consolidation #1649

leoryu · 2024-09-10T09:39:15Z

Description

The karpenter core should not add spot to nodeclaim in consolidation. Which capacity type replaced with should be controlled by provider.

How was this change tested?

I hacked the provider code and maked all spot is not avaliable to simulate the all spot machines sold out case, and triggered a consolidation by modifying the nodepool with smaller cpu requirment. The nodeclaim will replaced by smaller/cheaper on-demand machine as expected.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

linux-foundation-easycla · 2024-09-10T09:39:22Z

✅login: leoryu / (ec4bbf2)

The committers listed above are authorized under a signed CLA.

k8s-ci-robot · 2024-09-10T09:39:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: leoryu
Once this PR has been reviewed and has the lgtm label, please assign ellistarn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-09-10T09:39:24Z

Welcome @leoryu!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-09-10T09:39:25Z

Hi @leoryu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2024-09-10T09:52:32Z

Pull Request Test Coverage Report for Build 10789870773

Details

0 of 0 changed or added relevant lines in 0 files are covered.
5 unchanged lines in 3 files lost coverage.
Overall coverage decreased (-0.02%) to 80.552%

Files with Coverage Reduction	New Missed Lines	%
pkg/cloudprovider/types.go	1	89.47%
pkg/test/expectations/expectations.go	2	94.73%
pkg/scheduling/requirements.go	2	98.01%

Totals
Change from base Build 10729017085:	-0.02%
Covered Lines:	8379
Relevant Lines:	10402

💛 - Coveralls

njtran · 2024-09-20T18:20:14Z

Why is this a chore to you rather than a feature request? This would be a very very large change to our scheduling algorithm, as it's an invariant that spot prices will always be cheaper than on-demand prices. Obviously this may change between cloud providers. How is it impacting you?

leoryu · 2024-09-23T06:41:37Z

Why is this a chore to you rather than a feature request? This would be a very very large change to our scheduling algorithm, as it's an invariant that spot prices will always be cheaper than on-demand prices. Obviously this may change between cloud providers. How is it impacting you?

@njtran If the nodeclaim is for 8C 16G OD instance type, and here is an 2c 4G OD instance type (no available spot) which is cheaper and meets the all pods requirements. For this case, the karpenter will trigger the consolidation, but since the nodeclaim will be forced to set spot:

{"level":"INFO","time":"2024-09-23T06:27:22.627Z","logger":"controller","message":"disrupting nodeclaim(s) via replace, terminating 1 nodes (1 pods) {node-name}/{instance-type}/on-demand and replacing with spot node from types {instance-type},{instance-type}, {instance-type}","controller":"disruption","namespace":"","name":"","reconcileID":"4de5441d-b9a7-4089-afe0-7762b5b640cc","command-id":"6f8ef575-d844-410a-a4e0-709af27d36d8","reason":"underutilized"}

The controller will get an error:

{"level":"ERROR","time":"2024-09-23T06:27:00.080Z","logger":"controller","message":"failed launching nodeclaim","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"test-njk7n"},"namespace":"","name":"test-njk7n","reconcileID":"333a7dec-4354-4076-9015-6db7eb5f69bf","error":"insufficient capacity, all requested instance types were unavailable during launch"}

Setting spot as preferred selection makes sense, but karpenter should consider whether the spot is available or not.

github-actions · 2024-10-07T12:02:07Z

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

leoryu · 2024-10-08T02:32:21Z

Why is this a chore to you rather than a feature request? This would be a very very large change to our scheduling algorithm, as it's an invariant that spot prices will always be cheaper than on-demand prices. Obviously this may change between cloud providers. How is it impacting you?

@njtran If the nodeclaim is for 8C 16G OD instance type, and here is an 2c 4G OD instance type (no available spot) which is cheaper and meets the all pods requirements. For this case, the karpenter will trigger the consolidation, but since the nodeclaim will be forced to set spot:
{"level":"INFO","time":"2024-09-23T06:27:22.627Z","logger":"controller","message":"disrupting nodeclaim(s) via replace, terminating 1 nodes (1 pods) {node-name}/{instance-type}/on-demand and replacing with spot node from types {instance-type},{instance-type}, {instance-type}","controller":"disruption","namespace":"","name":"","reconcileID":"4de5441d-b9a7-4089-afe0-7762b5b640cc","command-id":"6f8ef575-d844-410a-a4e0-709af27d36d8","reason":"underutilized"}
The controller will get an error:
{"level":"ERROR","time":"2024-09-23T06:27:00.080Z","logger":"controller","message":"failed launching nodeclaim","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"test-njk7n"},"namespace":"","name":"test-njk7n","reconcileID":"333a7dec-4354-4076-9015-6db7eb5f69bf","error":"insufficient capacity, all requested instance types were unavailable during launch"}
Setting spot as preferred selection makes sense, but karpenter should consider whether the spot is available or not.

Hi, please check this comment at your convenience. @njtran

github-actions · 2024-10-22T12:02:16Z

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

chore: stop adding spot requirement in consolidation

ec4bbf2

k8s-ci-robot requested review from engedaam and tallaxes September 10, 2024 09:39

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Sep 10, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2024

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 22, 2024

github-actions bot added the lifecycle/closed label Nov 6, 2024

github-actions bot closed this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: stop adding spot requirement in consolidation #1649

chore: stop adding spot requirement in consolidation #1649

leoryu commented Sep 10, 2024 •

edited

Loading

linux-foundation-easycla bot commented Sep 10, 2024 •

edited

Loading

k8s-ci-robot commented Sep 10, 2024

k8s-ci-robot commented Sep 10, 2024

k8s-ci-robot commented Sep 10, 2024

coveralls commented Sep 10, 2024

njtran commented Sep 20, 2024

leoryu commented Sep 23, 2024 •

edited

Loading

github-actions bot commented Oct 7, 2024

leoryu commented Oct 8, 2024

github-actions bot commented Oct 22, 2024

chore: stop adding spot requirement in consolidation #1649

chore: stop adding spot requirement in consolidation #1649

Conversation

leoryu commented Sep 10, 2024 • edited Loading

linux-foundation-easycla bot commented Sep 10, 2024 • edited Loading

k8s-ci-robot commented Sep 10, 2024

k8s-ci-robot commented Sep 10, 2024

k8s-ci-robot commented Sep 10, 2024

coveralls commented Sep 10, 2024

Pull Request Test Coverage Report for Build 10789870773

Details

💛 - Coveralls

njtran commented Sep 20, 2024

leoryu commented Sep 23, 2024 • edited Loading

github-actions bot commented Oct 7, 2024

leoryu commented Oct 8, 2024

github-actions bot commented Oct 22, 2024

leoryu commented Sep 10, 2024 •

edited

Loading

linux-foundation-easycla bot commented Sep 10, 2024 •

edited

Loading

leoryu commented Sep 23, 2024 •

edited

Loading