Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to restart prowjob when node is terminated (enabled by default) #117

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

inteon
Copy link
Member

@inteon inteon commented Apr 18, 2024

Recreate the prowjob when the node is terminated.
This prevents us from having to rerun jobs that were terminated due to a spot instance shutdown.

The pod status that we see on pods terminated due to a spot instance shutdown look like this:

status:
  message: Pod was terminated in response to imminent node shutdown.
  phase: Failed
  reason: Terminated

or

status:
  conditions:
  - message: 'PodGC: node no longer exists'
    reason: DeletionByPodGC
    status: "True"
    type: DisruptionTarget
  phase: Failed

or

status:
  conditions:
  - message: 'GCPControllerManager: node no longer exists'
    reason: DeletionByGCPControllerManager
    status: "True"
    type: DisruptionTarget
  phase: Failed

Most code in this PR is a copy of the existing ErrorOnEviction code, updated for ErrorOnTermination.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 18, 2024
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 18, 2024
Copy link

netlify bot commented Apr 18, 2024

Deploy Preview for k8s-prow ready!

Name Link
🔨 Latest commit 9c6dd67
🔍 Latest deploy log https://app.netlify.com/sites/k8s-prow/deploys/66c84689385eeb0008126004
😎 Deploy Preview https://deploy-preview-117--k8s-prow.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@inteon inteon force-pushed the option_recreate_prowjob_on_termination branch from db2fd80 to cb50220 Compare April 18, 2024 13:45
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 18, 2024
@jihoon-seo
Copy link
Contributor

Hello @inteon
AFAIK Plank was deprecated.
Since this PR is adding codes under 'pkg/plank/',
I wonder whether:

  1. Adding codes under 'pkg/plank/' is okay regardless of the deprecation of Plank; or
  2. There exists another method to achieve the same goal, adding codes to directories other than 'pkg/plank/'.

Please take a look! 😃

@inteon
Copy link
Member Author

inteon commented Apr 18, 2024

Hello @inteon AFAIK Plank was deprecated. Since this PR is adding codes under 'pkg/plank/', I wonder whether:

  1. Adding codes under 'pkg/plank/' is okay regardless of the deprecation of Plank; or
  2. There exists another method to achieve the same goal, adding codes to directories other than 'pkg/plank/'.

Please take a look! 😃

The code still seems to be used in prow-controller-manager:

if err := plank.Add(mgr, buildClusterManagers, knownClusters, cfg, opener, o.totURL, o.selector); err != nil {
logrus.WithError(err).Fatal("Failed to add plank to manager")
}

@inteon inteon force-pushed the option_recreate_prowjob_on_termination branch from cb50220 to da96973 Compare April 20, 2024 11:58
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: inteon
Once this PR has been reviewed and has the lgtm label, please assign petr-muller for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jihoon-seo
Copy link
Contributor

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 22, 2024
@inteon
Copy link
Member Author

inteon commented May 7, 2024

Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot).
It works very well, this change has removed a lot of flaky failures due to spot instance shutdowns!

@droslean
Copy link
Contributor

droslean commented May 7, 2024

Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot). It works very well, this change has removed a lot of flaky failures due to spot instance shutdowns!

This change seems like a nice feature but I am not sure if you are covering all of the generic cases of those pod terminations that are not only related to GCP. Also, this issue can be resolved by adding a pod disruption budget to your cluster.

In my opinion, we shouldn't enable this feature by default. Instead, we can configure specific reasons for recreating the prowjobs, which will allow users to be more specific about their infrastructure.

@inteon inteon force-pushed the option_recreate_prowjob_on_termination branch from 4cb9105 to 0bf2380 Compare June 13, 2024 15:10
@droslean
Copy link
Contributor

I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold.

@petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input?

@inteon inteon force-pushed the option_recreate_prowjob_on_termination branch from 0bf2380 to e416d6f Compare June 14, 2024 15:30
@inteon
Copy link
Member Author

inteon commented Jun 14, 2024

I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold.

@petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input?

I updated the PR and added a RetryCount which is now incremented every time the pod is re-created (it also counts other retries that were already present in code). There will be a hard failure after 3 retries have been reached (we might want to make this a variable in the future).

@inteon inteon force-pushed the option_recreate_prowjob_on_termination branch from e416d6f to 6a85753 Compare June 17, 2024 11:53
@inteon
Copy link
Member Author

inteon commented Jul 1, 2024

Blocked by #196 since CRD is too large for my setup.

@stevekuznetsov
Copy link
Contributor

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

@droslean
Copy link
Contributor

droslean commented Jul 1, 2024

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

// On GCP, before a new spot instance is started, the old pods are garbage
// collected (if they have not been already by the Kubernetes PodGC):
// https://github.com/kubernetes/cloud-provider-gcp/blob/25e5dcc715781316bc5e39f8b17c0d5b313453f7/cmd/gcp-controller-manager/node_csr_approver.go#L1035-L1058
if condition.Reason == "DeletionByGCPControllerManager" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is GCP related only, why not make the reasons configurable too, to allow the user to add more cases based on the infrastructure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the settings configurable: 648600a

@inteon
Copy link
Member Author

inteon commented Jul 1, 2024

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution.

@droslean
Copy link
Contributor

droslean commented Jul 1, 2024

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution.

I would prefer to let the user to decide the list of reasons to restart the prowjob and the retry count limit. For example, in my infrastructure, we run costly jobs, and this feature can potentially increase the cost since its rerunning them by default for those specific reasons. Your solution is good, but I would prefer to make it configurable so the user won't be limited on hardcoded termination reasons and retry limits. @stevekuznetsov WDYT?

@stevekuznetsov
Copy link
Contributor

Getting the config semantic right might be hard but I'm for it.

@inteon inteon force-pushed the option_recreate_prowjob_on_termination branch from 648600a to 5cc1fc8 Compare July 1, 2024 16:15
@BenTheElder
Copy link
Member

Yeah, I share some concerns about the restart implications.

For example, with 5k node scale tests, we may prefer to simply take the failure and leave boskos to cleanup rather than attempt to start another run, and yet with the many many CI jobs we have it would be difficult to properly identify and opt out all of the relevant jobs.

cc @aojea @pohly

Also, even as a GCP employee, I think we should prefer to use portable Kubernetes, but I guess this is at least somewhat more portable now ... do any of the other major vendors with spot instances set a similar status that can be matched or do we need a different mechanism entirely?

What's the use case where the next periodic attempt and/or /retest is not sufficient?
Are you using the automatic /retest commenter? It's clunky but it has done the job ~OK.

@BenTheElder
Copy link
Member

I suspect for most jobs this is better, if bounded, but it's still a surprising behavior change and there's limited bandwidth to go comb through jobs and opt them in/out.

For Kubernetes we could probably opt-out anything reference the boskos scalability pools and be in OK shape.

I don't think we want another best practice field that everyone has to opt in though either .... (like decorate: true which should ideally not be redundant eventually)

What if we have a global option to enable this, in addition to per job opt-out? We could wait to turn this on in until we've opted out any jobs with cost concerns?

I'm also not sure about having a single retrycount is the best approach, but I haven't put much thought into it yet. E.G. failure to schedule is pretty distinct from the other modes (Thought I think we already handle that separately?)

@inteon inteon force-pushed the option_recreate_prowjob_on_termination branch from 5cc1fc8 to 0695225 Compare July 9, 2024 19:36
@inteon inteon force-pushed the option_recreate_prowjob_on_termination branch from 0695225 to 9c6dd67 Compare August 23, 2024 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants