Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate CI to prow (test-infra) #10682

Closed
1 of 2 tasks
VannTen opened this issue Dec 1, 2023 · 15 comments
Closed
1 of 2 tasks

Migrate CI to prow (test-infra) #10682

VannTen opened this issue Dec 1, 2023 · 15 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@VannTen
Copy link
Contributor

VannTen commented Dec 1, 2023

What would you like to be added:
Migrate all or our CI jobs (PR and periodics) to the prow instance managed by the Kubernetes community (prow.k8s.io)

This is motivated by looking at old slack discussions which seems to suggest it's a long term plan to do just that.

Why is this needed:

  • PR author can re-trigger flaky tests with /retest, without re-pushing (which looses /lgtm label)
  • /ok-to-test actually does something
  • We don't manage the CI as much (well, at least not prow. The build cluster itself not so sure) and we can drop the whole Gitlab (I assume it's used only for that)
  • We can use tekton pipelines (but it's thoroughly undocumented 😆 ) (not available on prow.k8s.io)

Potential disadvantages / problems:

  • cost of migration
  • some things are impossible / much harder with prow (I don't think so, but maybe provisioning machines/VM ?) than Gitlab CI
  • we might have to use boskos instead of kubevirt to provision VMs (ongoing discussion)

How can we accomplish this:

From what I understand, prowjobs can use skip_report to essentially run without any effect on the PR, just to test how much they work => so we can mirror the test done in gitlab-ci for a time, ensure that it works, then switch them to required and remove gitlab-ci.

We might do it in stages, aka, repeat that process a few times with a subset of our CI jobs each time.

Opinions ?
I'd be particularly interested in the requirements of kubespray CI, to understand if this is possible at all.

@floryut @MrFreezeex @yankay

Pinging some approvers from https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes-sigs/kubespray/OWNERS:
@Miouge1 @ant31 @mirwan @mattymo @riverzhang

Some unknowns:

  • [] can prow handle "dependency" between jobs ? -> not directly I think we have to use tekton
  • do we need to provide our own cluster or can we use some common test-infra clusters ? -> Clusters have to be community owned. Possibly we (kubespray maintainers) could maintain a community (-> does that mean other projects goes there too ?) cluster on equinix
  • permissions for provisionning jobs -> service account needs access to kubevirt api somehow (if I read the CI right).

/assign

@VannTen VannTen added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 1, 2023
@VannTen
Copy link
Contributor Author

VannTen commented Dec 1, 2023

I also think it would ease #10681

@VannTen
Copy link
Contributor Author

VannTen commented Dec 8, 2023

For those interested, discussion is ongoing at the linked test-infra issue.

TL;DR (as of now):

  • tekton prow integration is not enabled
  • there is no "Bring your own cluster", as in, clusters used by test-infra needs to be community owned (I guess there is some kind of SRE team which should have access ?)
  • we would have to switch the "full matrix tests" to periodics (aka, not on every PR, but up to multiples times per day) -> I think that would be roughly the "deploy-part2"

Latest comment hints to the kubespray team maintaining the equinix cluster and prow scheduling jobs on it, but I'm not sure yet how that influence the previous point.

All in all, the tradeoffs are not as favorable to using test-infra/prow as I thought initially.

@yankay
Copy link
Member

yankay commented Dec 12, 2023

Thanks @VannTen

Another piece of information is that the docker-machine runner runs the molecule test. The docker-machine runner has been deprecated. https://gitlab.com/gitlab-org/gitlab/-/issues/341856

So it would be migrated sooner or later :-)

@VannTen
Copy link
Contributor Author

VannTen commented Dec 12, 2023 via email

@ant31
Copy link
Contributor

ant31 commented Dec 12, 2023

first the total number of active (== scheduled) jobs is limited to 500 by Gitlab.com. => since we have roughly 80 jobs in the pipeline
The gitlab-pipeline would return in error, and could be retried (assuming we fix /retry).

cleanup, and this one tried to do it simultaneously, thus failing the delete

It's unrelated to gitlab, should need to be addressed regardless. Should check it doesn't exist = success

The docker-machine runner

I think it can be moved to a kubernetes-runner like other jobs.

@VannTen
Copy link
Contributor Author

VannTen commented Dec 15, 2023

It's unrelated to gitlab, should need to be addressed regardless. Should check it doesn't exist = success

Yeah. Alternatively I was wondering if we could leverage ownerRefs to rely on k8s garbagecollection (like, the job pods would be the owner of the kubevirts VM ? Need to think about it 🤔 )

first the total number of active (== scheduled) jobs is limited to 500 by Gitlab.com. => since we have roughly 80 jobs in the pipeline

The gitlab-pipeline would return in error, and could be retried (assuming we fix /retry).

The abiity does not change much for that particular problem (it's merely more convenient than amending and force pushing) : as long as the CI is busy with other pipelines, new ones are gonna fail, and you have to wait until the current ones are finished (which can be several hours). If gitlab was willing to create the pipelines and let them wait until they'd be picked by the runner, that would be fine, they would eventually run.
Apart from bypassing Gitlab.com, I don't think we can do much on that.

=> would mean either:

  1. test-infra/prow
  2. tekton based pipelines (I've taken a look after you mentionned that on Slack, and it looks like it would not be that hard. Tekton has a lot of building blocks in the form of Task on the tekton hub, and you have a github events watcher that allows you to trigger stuff conditionnaly on "ChangedFiles" for instance)

@ant31
Copy link
Contributor

ant31 commented Dec 18, 2023

If gitlab was willing to create the pipelines and let them wait until they'd be picked by the runner, that would be fine, they would eventually run.

I wouldn't make it more complex than it is, just try to delete, if it's already deleted then it's a success.

If gitlab was willing to create the pipelines and let them wait until they'd be picked by the runner, that would be fine, they would eventually run.

failfast-ci can do the retry, especially if there error code is explicit it will queue again the job without any manual action, it can even report back to github that the CI is too busy and it will start later.

@VannTen
Copy link
Contributor Author

VannTen commented Dec 18, 2023 via email

@ant31
Copy link
Contributor

ant31 commented Dec 18, 2023

probably small code change, need to parse the error code from gitlab (assuming it's explicit) and then "retry" the job instead of discard. can also always retry like N time and it will be just a setting then

@ant31
Copy link
Contributor

ant31 commented Dec 22, 2023

I've started to refactor failfast-ci and I'll try to include missing features pointed above:

  • /retest on comment from 'owners' if made from a approver/reviewer
  • auto-retry pipeline on Gitlab failures (e.g gitlab is down or too many jobs scheduled)
  • test only when label "ok-to-test" is present

Not sure when I'll be finish as I'm refreshing many parts of the code, but someday :)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2024
@VannTen
Copy link
Contributor Author

VannTen commented Mar 21, 2024 via email

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 21, 2024
@ant31
Copy link
Contributor

ant31 commented Oct 29, 2024

@VannTen I think most of the wanted features are now implemeted and it runs much smoother.

I suggest we switch to discuss of improvement over the current CI and closing the topic of moving to prow.

@VannTen
Copy link
Contributor Author

VannTen commented Oct 31, 2024 via email

@k8s-ci-robot
Copy link
Contributor

@VannTen: Closing this issue.

In response to this:

Agreed, let's close this.
I'm might open a meta / pinned issue later to list quirks/flakes, in particular those which requires workflow adjustments from contributors.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

5 participants