Migrate CI to prow (test-infra) #10682

VannTen · 2023-12-01T09:20:51Z

What would you like to be added:
Migrate all or our CI jobs (PR and periodics) to the prow instance managed by the Kubernetes community (prow.k8s.io)

This is motivated by looking at old slack discussions which seems to suggest it's a long term plan to do just that.

Why is this needed:

PR author can re-trigger flaky tests with /retest, without re-pushing (which looses /lgtm label)
/ok-to-test actually does something
We don't manage the CI as much (well, at least not prow. The build cluster itself not so sure) and we can drop the whole Gitlab (I assume it's used only for that)
~~We can use tekton pipelines (but it's thoroughly undocumented 😆 )~~ (not available on prow.k8s.io)

Potential disadvantages / problems:

cost of migration
some things are impossible / much harder with prow (I don't think so, but maybe provisioning machines/VM ?) than Gitlab CI
we might have to use boskos instead of kubevirt to provision VMs (ongoing discussion)

How can we accomplish this:

From what I understand, prowjobs can use skip_report to essentially run without any effect on the PR, just to test how much they work => so we can mirror the test done in gitlab-ci for a time, ensure that it works, then switch them to required and remove gitlab-ci.

We might do it in stages, aka, repeat that process a few times with a subset of our CI jobs each time.

Opinions ?
I'd be particularly interested in the requirements of kubespray CI, to understand if this is possible at all.

@floryut @MrFreezeex @yankay

Pinging some approvers from https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes-sigs/kubespray/OWNERS:
@Miouge1 @ant31 @mirwan @mattymo @riverzhang

Some unknowns:

~~[] can prow handle "dependency" between jobs ? -> not directly I think we have to use tekton~~
do we need to provide our own cluster or can we use some common test-infra clusters ? -> Clusters have to be community owned. Possibly we (kubespray maintainers) could maintain a community (-> does that mean other projects goes there too ?) cluster on equinix
permissions for provisionning jobs -> service account needs access to kubevirt api somehow (if I read the CI right).

/assign

The text was updated successfully, but these errors were encountered:

VannTen · 2023-12-01T09:21:31Z

I also think it would ease #10681

VannTen · 2023-12-08T15:29:27Z

For those interested, discussion is ongoing at the linked test-infra issue.

TL;DR (as of now):

tekton prow integration is not enabled
there is no "Bring your own cluster", as in, clusters used by test-infra needs to be community owned (I guess there is some kind of SRE team which should have access ?)
we would have to switch the "full matrix tests" to periodics (aka, not on every PR, but up to multiples times per day) -> I think that would be roughly the "deploy-part2"

Latest comment hints to the kubespray team maintaining the equinix cluster and prow scheduling jobs on it, but I'm not sure yet how that influence the previous point.

All in all, the tradeoffs are not as favorable to using test-infra/prow as I thought initially.

yankay · 2023-12-12T05:53:30Z

Thanks @VannTen

Another piece of information is that the docker-machine runner runs the molecule test. The docker-machine runner has been deprecated. https://gitlab.com/gitlab-org/gitlab/-/issues/341856

So it would be migrated sooner or later :-)

VannTen · 2023-12-12T08:43:56Z

Yeah, I had noticed that as well. Yesterday there was quite a lot of PR ( o/ ) and I noticed several problems our CI has to deal with that: - first the total number of active (== scheduled) jobs is limited to 500 by Gitlab.com. => since we have roughly 80 jobs in the pipeline, that means we can only test 5 PRs at once top. And since the runs are pretty long, that's not super good. - second, it seams we can have some kind of race in resources deletion: see this: https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/5726642749 => I think it could be the following scenario: one CI runs did the cleanup, and this one tried to do it simultaneously, thus failing the delete So, if we don't switch to prow.k8s.io, we'll have to address this somehow.

ant31 · 2023-12-12T15:14:59Z

first the total number of active (== scheduled) jobs is limited to 500 by Gitlab.com. => since we have roughly 80 jobs in the pipeline
The gitlab-pipeline would return in error, and could be retried (assuming we fix /retry).

cleanup, and this one tried to do it simultaneously, thus failing the delete

It's unrelated to gitlab, should need to be addressed regardless. Should check it doesn't exist = success

The docker-machine runner

I think it can be moved to a kubernetes-runner like other jobs.

VannTen · 2023-12-15T08:57:48Z

It's unrelated to gitlab, should need to be addressed regardless. Should check it doesn't exist = success

Yeah. Alternatively I was wondering if we could leverage ownerRefs to rely on k8s garbagecollection (like, the job pods would be the owner of the kubevirts VM ? Need to think about it 🤔 )

first the total number of active (== scheduled) jobs is limited to 500 by Gitlab.com. => since we have roughly 80 jobs in the pipeline

The gitlab-pipeline would return in error, and could be retried (assuming we fix /retry).

The abiity does not change much for that particular problem (it's merely more convenient than amending and force pushing) : as long as the CI is busy with other pipelines, new ones are gonna fail, and you have to wait until the current ones are finished (which can be several hours). If gitlab was willing to create the pipelines and let them wait until they'd be picked by the runner, that would be fine, they would eventually run.
Apart from bypassing Gitlab.com, I don't think we can do much on that.

=> would mean either:

test-infra/prow
tekton based pipelines (I've taken a look after you mentionned that on Slack, and it looks like it would not be that hard. Tekton has a lot of building blocks in the form of Task on the tekton hub, and you have a github events watcher that allows you to trigger stuff conditionnaly on "ChangedFiles" for instance)

ant31 · 2023-12-18T09:00:16Z

If gitlab was willing to create the pipelines and let them wait until they'd be picked by the runner, that would be fine, they would eventually run.

I wouldn't make it more complex than it is, just try to delete, if it's already deleted then it's a success.

If gitlab was willing to create the pipelines and let them wait until they'd be picked by the runner, that would be fine, they would eventually run.

failfast-ci can do the retry, especially if there error code is explicit it will queue again the job without any manual action, it can even report back to github that the CI is too busy and it will start later.

VannTen · 2023-12-18T09:03:51Z

failfast-ci can do the retry, especially if there error code is explicit it will queue again the job without any action, it can even report back to github that the CI is too busy and it will start later.

That would be great. Is that just something to configure which is not enabled currently ?

ant31 · 2023-12-18T09:07:27Z

probably small code change, need to parse the error code from gitlab (assuming it's explicit) and then "retry" the job instead of discard. can also always retry like N time and it will be just a setting then

ant31 · 2023-12-22T15:57:03Z

I've started to refactor failfast-ci and I'll try to include missing features pointed above:

/retest on comment from 'owners' if made from a approver/reviewer
auto-retry pipeline on Gitlab failures (e.g gitlab is down or too many jobs scheduled)
test only when label "ok-to-test" is present

Not sure when I'll be finish as I'm refreshing many parts of the code, but someday :)

k8s-triage-robot · 2024-03-21T16:22:25Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

VannTen · 2024-03-21T18:31:53Z

/lifecycle frozen /remove-lifecycle stale -- Max Gautier

ant31 · 2024-10-29T09:09:07Z

@VannTen I think most of the wanted features are now implemeted and it runs much smoother.

I suggest we switch to discuss of improvement over the current CI and closing the topic of moving to prow.

VannTen · 2024-10-31T08:30:36Z

Agreed, let's close this. I'm might open a meta / pinned issue later to list quirks/flakes, in particular those which requires workflow adjustments from contributors. /close

k8s-ci-robot · 2024-10-31T08:30:42Z

@VannTen: Closing this issue.

In response to this:

Agreed, let's close this.
I'm might open a meta / pinned issue later to list quirks/flakes, in particular those which requires workflow adjustments from contributors.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

VannTen added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 1, 2023

k8s-ci-robot assigned VannTen Dec 1, 2023

VannTen mentioned this issue Dec 1, 2023

Questions on migrating kubespray CI to test-infra kubernetes/test-infra#31351

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2024

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 21, 2024

k8s-ci-robot closed this as completed Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate CI to prow (test-infra) #10682

Migrate CI to prow (test-infra) #10682

VannTen commented Dec 1, 2023 •

edited

Loading

VannTen commented Dec 1, 2023

VannTen commented Dec 8, 2023

yankay commented Dec 12, 2023 •

edited

Loading

VannTen commented Dec 12, 2023 via email

ant31 commented Dec 12, 2023 •

edited

Loading

VannTen commented Dec 15, 2023

ant31 commented Dec 18, 2023 •

edited

Loading

VannTen commented Dec 18, 2023 via email

ant31 commented Dec 18, 2023 •

edited

Loading

ant31 commented Dec 22, 2023

k8s-triage-robot commented Mar 21, 2024

VannTen commented Mar 21, 2024 via email

ant31 commented Oct 29, 2024

VannTen commented Oct 31, 2024 via email

k8s-ci-robot commented Oct 31, 2024

Migrate CI to prow (test-infra) #10682

Migrate CI to prow (test-infra) #10682

Comments

VannTen commented Dec 1, 2023 • edited Loading

VannTen commented Dec 1, 2023

VannTen commented Dec 8, 2023

yankay commented Dec 12, 2023 • edited Loading

VannTen commented Dec 12, 2023 via email

ant31 commented Dec 12, 2023 • edited Loading

VannTen commented Dec 15, 2023

ant31 commented Dec 18, 2023 • edited Loading

VannTen commented Dec 18, 2023 via email

ant31 commented Dec 18, 2023 • edited Loading

ant31 commented Dec 22, 2023

k8s-triage-robot commented Mar 21, 2024

VannTen commented Mar 21, 2024 via email

ant31 commented Oct 29, 2024

VannTen commented Oct 31, 2024 via email

k8s-ci-robot commented Oct 31, 2024

VannTen commented Dec 1, 2023 •

edited

Loading

yankay commented Dec 12, 2023 •

edited

Loading

ant31 commented Dec 12, 2023 •

edited

Loading

ant31 commented Dec 18, 2023 •

edited

Loading

ant31 commented Dec 18, 2023 •

edited

Loading