Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: stop adding spot requirement in consolidation #1649

Closed

Conversation

leoryu
Copy link

@leoryu leoryu commented Sep 10, 2024

Fixes #1605

Description

The karpenter core should not add spot to nodeclaim in consolidation. Which capacity type replaced with should be controlled by provider.

How was this change tested?

I hacked the provider code and maked all spot is not avaliable to simulate the all spot machines sold out case, and triggered a consolidation by modifying the nodepool with smaller cpu requirment. The nodeclaim will replaced by smaller/cheaper on-demand machine as expected.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link

linux-foundation-easycla bot commented Sep 10, 2024

CLA Signed


The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: leoryu
Once this PR has been reviewed and has the lgtm label, please assign ellistarn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @leoryu!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Sep 10, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @leoryu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 10, 2024
@coveralls
Copy link

Pull Request Test Coverage Report for Build 10789870773

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 5 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.02%) to 80.552%

Files with Coverage Reduction New Missed Lines %
pkg/cloudprovider/types.go 1 89.47%
pkg/test/expectations/expectations.go 2 94.73%
pkg/scheduling/requirements.go 2 98.01%
Totals Coverage Status
Change from base Build 10729017085: -0.02%
Covered Lines: 8379
Relevant Lines: 10402

💛 - Coveralls

@njtran
Copy link
Contributor

njtran commented Sep 20, 2024

Why is this a chore to you rather than a feature request? This would be a very very large change to our scheduling algorithm, as it's an invariant that spot prices will always be cheaper than on-demand prices. Obviously this may change between cloud providers. How is it impacting you?

@leoryu
Copy link
Author

leoryu commented Sep 23, 2024

Why is this a chore to you rather than a feature request? This would be a very very large change to our scheduling algorithm, as it's an invariant that spot prices will always be cheaper than on-demand prices. Obviously this may change between cloud providers. How is it impacting you?

@njtran If the nodeclaim is for 8C 16G OD instance type, and here is an 2c 4G OD instance type (no available spot) which is cheaper and meets the all pods requirements. For this case, the karpenter will trigger the consolidation, but since the nodeclaim will be forced to set spot:

{"level":"INFO","time":"2024-09-23T06:27:22.627Z","logger":"controller","message":"disrupting nodeclaim(s) via replace, terminating 1 nodes (1 pods) {node-name}/{instance-type}/on-demand and replacing with spot node from types {instance-type},{instance-type}, {instance-type}","controller":"disruption","namespace":"","name":"","reconcileID":"4de5441d-b9a7-4089-afe0-7762b5b640cc","command-id":"6f8ef575-d844-410a-a4e0-709af27d36d8","reason":"underutilized"}

The controller will get an error:

{"level":"ERROR","time":"2024-09-23T06:27:00.080Z","logger":"controller","message":"failed launching nodeclaim","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"test-njk7n"},"namespace":"","name":"test-njk7n","reconcileID":"333a7dec-4354-4076-9015-6db7eb5f69bf","error":"insufficient capacity, all requested instance types were unavailable during launch"}

Setting spot as preferred selection makes sense, but karpenter should consider whether the spot is available or not.

Copy link

github-actions bot commented Oct 7, 2024

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2024
@leoryu
Copy link
Author

leoryu commented Oct 8, 2024

Why is this a chore to you rather than a feature request? This would be a very very large change to our scheduling algorithm, as it's an invariant that spot prices will always be cheaper than on-demand prices. Obviously this may change between cloud providers. How is it impacting you?

@njtran If the nodeclaim is for 8C 16G OD instance type, and here is an 2c 4G OD instance type (no available spot) which is cheaper and meets the all pods requirements. For this case, the karpenter will trigger the consolidation, but since the nodeclaim will be forced to set spot:

{"level":"INFO","time":"2024-09-23T06:27:22.627Z","logger":"controller","message":"disrupting nodeclaim(s) via replace, terminating 1 nodes (1 pods) {node-name}/{instance-type}/on-demand and replacing with spot node from types {instance-type},{instance-type}, {instance-type}","controller":"disruption","namespace":"","name":"","reconcileID":"4de5441d-b9a7-4089-afe0-7762b5b640cc","command-id":"6f8ef575-d844-410a-a4e0-709af27d36d8","reason":"underutilized"}

The controller will get an error:

{"level":"ERROR","time":"2024-09-23T06:27:00.080Z","logger":"controller","message":"failed launching nodeclaim","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"test-njk7n"},"namespace":"","name":"test-njk7n","reconcileID":"333a7dec-4354-4076-9015-6db7eb5f69bf","error":"insufficient capacity, all requested instance types were unavailable during launch"}

Setting spot as preferred selection makes sense, but karpenter should consider whether the spot is available or not.

Hi, please check this comment at your convenience. @njtran

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2024
Copy link

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 22, 2024
@github-actions github-actions bot closed this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/closed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consolidation with spot by default is not appropriate
4 participants