GitHub Actions CI contention #237

briantist · 2023-06-04T18:24:42Z

Summary

As more ansible collections within the ansible-collections organization use GitHub Actions as their CI, we're seeing increased contention for CI runners. This has been painful in the past, even for some single collections, but is getting worse, primarily because concurrency limits are at the account, not repository, level.

See:

This means that all repositories in this org are sharing the 20 concurrent job limit. To use my small collection as an example, a single PR to community.hashi_vault generates 27 jobs for CI, plus 4 for docs build. If there are multiple PRs, or even several commits within a PR before the previous runs finish, the wait times grow and grow. This is after recently removing two older versions of ansible-core from the CI matrix.

Doing releases, which end up with many runs (from the release PR, the push from merging the release PR, the push of the tag, etc.) ends up taking like 2 hours when the actual steps take minutes.

New collections continue to default to GHA, and there seems to be a push to move existing CI out of Zuul and Azure into GHA (just two examples):

I think this is a good thing, but we will see this problem compound as a result.

Suggestion

As shown in the billing page linked above, it is possible to get more concurrency with a non-free GitHub plan.

If we can get the ansible-collections organization (at a minimum) into a non-free plan, we can get increased concurrency, which would be a huge improvement for everyone. The more concurrency, the better.

This probably has to be done by RedHat since they own the organization. My hope is that they already have a non-free account that the org could be moved into, and won't require a new sign-up/agreement or whatever, but I really don't know any details about that.

Potential cost

Disregarding the cost of a non-free plan itself, GitHub runners have per-minute rates:

https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions

Public repositories can use GHA for free, and so are not billed per-minute rates, and I think this still applies to public repos in an paid account.

Based on a number of assumptions that need to be confirmed, there's a chance that we can get increased concurrency without any additional spend. The assumptions are:

there is already a paid account where the organization can be moved into, and that this will not increase the cost of that plan
public repositories in the paid account will still not need to pay for GHA hosted runners
the additional concurrency limits granted by the paid plan will also apply to the free runners in public repos

All of these assumptions need to be checked.

I really cannot overstate how helpful this will be for the community.

The text was updated successfully, but these errors were encountered:

felixfontein · 2023-06-04T19:01:02Z

Some of the AZP based collections (community.general, community.crypto, community.docker) now also make more use of GHA since we no longer should use AZP for EOL versions of ansible-core/ansible-base/Ansible 2.9, some of whose these collections still supports. This increases the general CI load for all collections in gh.com/ansible-collections.

(And yes, CI feels incredibly slow in all the collections I'm maintaining currently, especially if there is more than one PR - in the same or across multiple collections -, but also if there is a single PR, but probably others in other collections I don't actively watch.)

mariolenz · 2023-06-05T16:36:35Z

The integration tests for community.vmware running on Zuul take hours, so I can live with GH actions taking several minutes. But I understand your problem. Especially since it looks like there will be even more tests moved to GH action. Quoting from ansible-collections/community.vmware#1746:

We are also in the process of migrating the other collections the team manages to Github Actions and while we don't have a migration plan for vmware.vmware_rest's CI yet we will need to evaluate this in the long term. The complexity of the Zuul platform is not meeting our needs for other collections and we'd like to reduce the amount that we depend on this system and have to maintain expertise in it.

briantist · 2023-06-05T16:48:17Z

The integration tests for community.vmware running on Zuul take hours, so I can live with GH actions taking several minutes. But I understand your problem. Especially since it looks like there will be even more tests moved to GH action.

I definitely am not advocating for not moving things to GHA; it's certainly an improvement in this case. And I am less concerned with how long a particular test takes to run on GHA once it starts, the issue is with how long the run itself will be queued due to hitting our concurrency limit.

As another example, some of my individual job runs take 10+ minutes (while the whole of CI can take 30+); this is partially because the overall test time would be longer if I made the tests more parallel, due to queueing times. If we had massive concurrency (like the 500 limit we'd get with an enterprise account) I would restructure my CI runs and they would complete much faster overall, and be simpler in design.

So the issue does not just result in longer, painful times, it can also redirect our limited engineering time and effort toward working around the problem, effort that could be better spent on the collections themselves, often resulting in more complex and harder to maintain CI.

cybette · 2023-06-05T17:02:11Z

Appreciate the write-up and additional details everyone! Just a note to say this is on our radar and we're looking into it.

mariolenz · 2023-06-05T17:08:57Z

I definitely am not advocating for not moving things to GHA; it's certainly an improvement in this case.

I didn't say you're advocating against moving things to GHA. After all, you've said:

New collections continue to default to GHA, and there seems to be a push to move existing CI out of Zuul and Azure into GHA (just two examples):
* [Move some tests to GH Actions ansible-collections/community.vmware#1747](https://github.com/ansible-collections/community.vmware/pull/1747)

* [Add github action for sanity and unit tests ansible-collections/amazon.aws#1393](https://github.com/ansible-collections/amazon.aws/pull/1393)
I think this is a good thing, but we will see this problem compound as a result.

What I wanted to say is: It looks like RH plans to move CI jobs from their own Zuul CI to GH actions, which makes the current situation even worse.

GregSutcliffe · 2023-06-28T16:04:14Z

Just wanted to put an update here because I figure it might come up again soon. We've been talking about a few options internally, and I want to make sure we get early input before "suggestions" start looking like "decisions" :)

There's a couple of options we could go for:

The Community Team could pay for a non-free GitHub Enterprise (probably at the Team level as Enterprise is significantly more) account for the Ansible Community
- Pros:
  - Concurrent runners are just for this org
  - We can have the Steering Committee as org admins
  - Maintainers for repos can be "outside collaborators" which will keep costs down
  - Pretty cheap ($4 / seat)
- Cons:
  - Only upgrades us to 3000 mins/month (from 2000 for a free org)
  - Another org & account to maintain (minor burden)

This is pretty easy to justify, but is a 50% increase in CI minutes enough? If we feel the 50k minutes of Enterprise is needed, how many seats do we think we need (my estimate is ~20)?

Another option we're thinking about is self-hosted runners:

Pros:
- Can be added to a free org, so no direct monetary cost
- Allows a way for interested companies / people to "donate" hardware to the project (easier than donating money)
- Gives us some flexibility in running intensive collections on specific hosts
Cons:
- Need to source the runners
- Need to administer the runners

@Spredzy @leogallego did I miss anything?

Very interested to hear some thoughts on either of these, or other possible solutions, please let us know!

leogallego · 2023-06-28T17:01:37Z

Great summary @GregSutcliffe, I would only include that there is the option to "Add Action Runners" for an additional cost, even if we use the Team plan. At the time of writing this comment:

GitHub-managed Standard 2 cores machine with default GitHub images.
Ubuntu Linux, $0.48 /hr
Microsoft Windows, $0.96 /hr
macOS, $4.80 /hr

felixfontein · 2023-06-28T19:09:32Z

I think the by far cheapest thing is to use self-hosted runners - if you ignore the administration costs.

The main problem with self-hosted runners is (IMO) making them sufficiently secure. Having a system which creates a new VM for every CI job and shuts down the VM at the end of the run is the safest option (and also what GitHub itself is doing), but such a system doesn't just set itself up (not sure how much GH's public software does). Also we definitely want some caching of docker/podman images, since almost every ansible-test run pulls at least one image, and you want to avoid the extreme amounts of traffic this can easily cost.

Another option might be using a third-party CI service (I don't really have experience with any, so I won't name anything here but AZP which is already used by some collections and by ansible-core itself).

briantist · 2023-06-28T20:03:11Z

@GregSutcliffe thanks very much for looking into all this.

I want to first clarify the "minutes" in GHA, since I had a different understanding of how they work.

My understanding is that per-minute rates apply only to use within private repositories, and there are no per-minute charges for public repositories.

The number of minutes "included" in a plan only counts toward billable minutes (larger runners, or any runners in private repos). Similarly, this is what the multipliers refer to for macos and windows runners (they apply to the number of included minutes, not to the billing rate).

Evidence toward that:

even with a completely free account, not even an org, we are not limited to 2000 minutes per month on public repositories
on the page you linked, under number of minutes, it also says "free for public repositories"
it's also clarified on this page: https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration

GitHub Actions usage is free for standard GitHub-hosted runners in public repositories, and for self-hosted runners. For private repositories, each GitHub account receives a certain amount of free minutes and storage for use with GitHub-hosted runners, depending on the product used with the account. Any usage beyond the included amounts is controlled by spending limits. For more information

With this being the case, it's not number of minutes that we need to worry about, our issue is purely about the number of concurrent jobs that can be running at a given time, so it's basically the only metric we have to consider imo.

It's also possible I am not understanding this correctly... but if I am, then upgrading the plan should by far be the most straightforward option.

Re: the idea of using self-hosted runners: I personally think that the work involved with far exceed the benefit unless we've (you've?) engineered a pretty solid implementation with feature parity to GH-hosted runners, keeping in mind that this covers more than just linux (though if we covered linux it would take care of the bulk of jobs, so still helpful).

I think the by far cheapest thing is to use self-hosted runners - if you ignore the administration costs.

Emphasis on the last bit.

That being said, I do like the "donation" option, but that can be done now without the need for making any changes to the GitHub plan. It's more of a policy thing so it feels like a conversation we could have, but unrelated to this.

Self-hosted runners do allow for larger/faster runners, and increased scaling (potentially), but I think it hinges entirely on having ephemeral runner pools.

I personally think it's a better cost-benefit and experience to go ahead and pay GitHub for more capacity, but that's easy to say when it's not my money ;)

If we did want to look at automating self-hosted ephemeral runners, I've seen this service around, but I am not affiliated and have never used it myself: https://cirun.io/

Andersson007 · 2023-06-29T08:37:05Z

In addition, created a PR against collection_template GHA matrix template to make future maintainers aware of the limitations ansible-collections/collection_template#62. FYI

GregSutcliffe · 2023-06-29T09:30:06Z

Thanks @briantist - I had missed that nuance. I think I agree that self-hosted runners are more work than we'd like, but I didn't just want to present a single fait-accompli "option" :)

I had a look at the billing data for ansible-collections, and at least according to GitHub, we've used exactly 0 minutes of CI:

I suspect this is because public repos don't count, as discussed above, so it's not logged - but that makes it hard to know what we've actually used. However, if we're sure that a Team account will help, then I think that's a fairly low-cost option anyway. Thoughts? I'll have a dig to see if I can get more "real" data, but if anyone already knows how, get in touch ;)

briantist · 2023-06-29T09:46:19Z

Indeed showing 0 for usage is accurate for billing reasons, but not helpful data!

I'm not sure where to see actual usage in aggregate, but on a per-run level, you can look at any GHA run and click the Usage link in the lower left below the jobs. That will break down per-minute usage of each job and then give a total for the run, both for actual and billable usage. If that were available at a repository or organization level it would be nice, but I haven't been able to find it on projects I have more access to.

If you have a contact or rep at GitHub, it would be great to confirm our understanding of it all, and maybe they know how to get better data too.

If we're right about only needing to worry about concurrency, then based on the screenshot in my original post, a team plan triples it from 20 to 60 concurrent jobs, which I think would be a big noticeable improvement!

Re: self-hosted, thank you for including that option, you're right to include that in consideration and offer alternatives.

Thanks very much for all of this!

Andersson007 · 2023-06-29T10:09:15Z

Also added a corresponding note to Collection requirements ansible/ansible-documentation#40, FYI

mariolenz · 2023-07-01T14:41:18Z

Sorry, I'm no expert on GHA but does this:

https://github.com/ansible-collections/collection_template/blob/21bae2b5c56c6e758e4f2780646b8277e1ad5ed9/.github/workflows/ansible-test.yml#L286-L287

mean that tests are run against ansible-core 2.15 with Python 2.7? Does 2.15 even support 2.7 still? I mean on the controller node. I might be wrong, but if I'm not this would be a test we can get rid of. And maybe there are some more.

briantist · 2023-07-01T14:55:39Z

Sorry, I'm no expert on GHA but does this:

https://github.com/ansible-collections/collection_template/blob/21bae2b5c56c6e758e4f2780646b8277e1ad5ed9/.github/workflows/ansible-test.yml#L286-L287

mean that tests are run against ansible-core 2.15 with Python 2.7? Does 2.15 even support 2.7 still? I mean on the controller node. I might be wrong, but if I'm not this would be a test we can get rid of. And maybe there are some more.

Thanks, good question @mariolenz yes, on targets 2.7 is still supported in 2.15: https://docs.ansible.com/ansible/latest/reference_appendices/release_and_maintenance.html#support-life

I think you won't find all that many collections testing that combination. It's a good thing to raise, and we should still look to reduce where we can, but the larger issue will not be meaningfully solved by trimming a few jobs here and there.

While the issue of start times in particular is getting worse due to more collections using GHA, contention has been a problem for a long time even for single collections like mine just due to having more than 20 jobs per run. Increased concurrency will be a big quality of life improvement and increase in velocity.

mariolenz · 2023-07-01T15:43:24Z

Thanks, good question @mariolenz yes, on targets 2.7 is still supported in 2.15: https://docs.ansible.com/ansible/latest/reference_appendices/release_and_maintenance.html#support-life

A lot of collections don't ssh somewhere and run there, but run on the controller and connect to a remote API. So they only need to run the tests against Python versions ansible-core supports on the controller node. Just wanted to mention it.

While the issue of start times in particular is getting worse due to more collections using GHA, contention has been a problem for a long time even for single collections like mine just due to having more than 20 jobs per run. Increased concurrency will be a big quality of life improvement and increase in velocity.

I understand the problem, and my suggestion wouldn't help much. Still, I thought I should mention that there might be some opportunity for improvements. Slight improvements, though... nothing to really fix the basic problem.

briantist · 2023-07-01T17:49:14Z

A lot of collections don't ssh somewhere and run there, but run on the controller and connect to a remote API. So they only need to run the tests against Python versions ansible-core supports on the controller node. Just wanted to mention it.

Interesting, I didn't know that! I thought the majority of collections tested against containers.

mariolenz · 2023-07-02T14:11:01Z

@briantist Take dellemc.openmanage as an example:

Dell OpenManage Ansible Modules allows data center and IT administrators to use RedHat Ansible to automate and orchestrate the configuration, deployment, and update of Dell PowerEdge Servers and modular infrastructure by leveraging the management automation capabilities in-built into the Integrated Dell Remote Access Controller (iDRAC), OpenManage Enterprise (OME) and OpenManage Enterprise Modular (OMEM).

The classic Ansible approach to copy the module to the target and run it there doesn't apply. I'm pretty sure it's technically impossible to run Python code on iDRAC and quite sure that it's at least not supported with OME (I don't know OMEM, though). So what those modules do is running on the controller talking to a remote API. Since there's no code executed on the target, there's no need to test against target Python versions.

I don't know how many collections work like this. But I think there are quite a lot. I should say that all collections dealing with cloud infrastructure don't run the modules directly "on" the cloud target, they talk to an API.

There are quite some collections that automate things where the "natural" approach is to call an API (a lot of storage arrays, firewalls, network devices...) because you can't run python code on the target or this isn't supported or at lease not best practice.

Classic Ansible:

Run Playbook
Copy modules to target
Run modules on target

"API collections":

Run Playbook
Run modules on controller node (delegate_to: localhost)
One or more API calls

At least that's the usual workflow for me when using community.vmware. Of course, you can delegate to another host (not localhost / the controller node) but I don't know the Python requirements in this case: Controller node or target requirements for the Python version?

felixfontein · 2023-07-02T16:19:09Z

A lot of collections don't ssh somewhere and run there, but run on the controller and connect to a remote API. So they only need to run the tests against Python versions ansible-core supports on the controller node. Just wanted to mention it.

Interesting, I didn't know that! I thought the majority of collections tested against containers.

These are (usually) also tested in containers, it's just that no specific target container is needed (resp. target = controller).

(Also there are some special cases where also such modules are run on target != controller, namely when the machine/API you need to talk to isn't reachable from your machine, but only through some jump host. Then you can run ansible-playbook on your machine, while these modules run on the jump host :) I guess that isn't very common though - in fact this is probably very rare.)

felixfontein · 2023-07-02T16:21:39Z

mean that tests are run against ansible-core 2.15 with Python 2.7? Does 2.15 even support 2.7 still? I mean on the controller node. I might be wrong, but if I'm not this would be a test we can get rid of. And maybe there are some more.

This is needed for all collections that have content that is intended to run on a target (and don't have a restriction on the target Python that disallows 2.7). For all these collections, it could be needed. (You don't have to test every single supported ansible-core release with a Python 2.7 target, but at least some; whether 2.7 belongs to the list is up for the collection maintainers to decide).

briantist · 2023-07-02T17:37:58Z

A lot of collections don't ssh somewhere and run there, but run on the controller and connect to a remote API. So they only need to run the tests against Python versions ansible-core supports on the controller node. Just wanted to mention it.

Interesting, I didn't know that! I thought the majority of collections tested against containers.

These are (usually) also tested in containers, it's just that no specific target container is needed (resp. target = controller).

(Also there are some special cases where also such modules are run on target != controller, namely when the machine/API you need to talk to isn't reachable from your machine, but only through some jump host. Then you can run ansible-playbook on your machine, while these modules run on the jump host :) I guess that isn't very common though - in fact this is probably very rare.)

I know about these, I just didn't know that the number was "a lot" ;)

mariolenz · 2023-07-07T15:25:04Z

@briantist I didn't have a closer look at how many collections don't run on the target, but on the controller and talk to an API. And anyway, "a lot" isn't really defined. But I should say we're talking about 10 to 20% of the collections in the community package. Maybe a bit more, but not less. However, this is just a guess from my side.

briantist · 2023-07-07T16:00:59Z

Thanks @mariolenz , really appreciate the info!

briantist · 2023-08-17T19:33:34Z

Hi @GregSutcliffe , wondering if there's any news on this?

GregSutcliffe · 2023-08-25T08:27:57Z

Oof, I lost track of this with all the other fun we've been having. Apologies!

I've re-read the posts I missed, but I don't see anything that changes the current plan, which I believe is:

Ask GH for some help to understand our numbers
Look at getting a Team plan in place to help with the concurrency

I'll find out who our GH contact is and get in touch with them.

briantist · 2023-09-20T14:29:32Z

Hey @GregSutcliffe , I know things have been busy with the Ansible forum rollout and such, just want to check in on this again because it's still quite an issue.

GregSutcliffe · 2023-10-02T12:15:56Z

Apologies, indeed it has been a busy month. I have just emailed a contact at GH, likely they are not the right person to speak to but they should be able to help me speak to whoever is. Will update once I know more, apologies for the delay.

GregSutcliffe · 2023-10-05T09:35:07Z

So, I have news :)

GitHub have kindly upgraded up to the Team plan for this org, which gives us 50 concurrent jobs, instead of 20. Hopefully, that will help things to feel better right away. Seats should not be an issue, we have enough to cover all the org members, plus a bit of headroom, so I'm not worried about that.

We're also looking into why the usage report doesn't actually report usage (billing and usage are not the same thing). We'll give it a week or so on the new plan to see if data starts to come through, and then I'll check in with GitHub again if not. Once we have usage data, we'll have the tools to check what's going on if we start to hit issues again.

Thanks for your patience folks! Sorry it took so long, that's entirely on me - and obviously, big thanks to GitHub for the upgrade.

felixfontein · 2023-10-05T10:17:22Z

@GregSutcliffe that's really awesome news! :)

briantist · 2023-10-05T13:55:33Z

🎉🎊🥳

@GregSutcliffe amazing! thank you so much! I can confirm I was able to run my CI today (30 jobs?) with no queued jobs, so the higher concurrency is definitely in effect.

@felixfontein I'm especially interested in your anecdotal experiences since you see so many more runs than I do, in many different collections

felixfontein · 2023-10-06T21:26:18Z

I don't have much anecdotal experience yet, but so far GHA feels a lot smoother than before.

Andersson007 · 2023-10-09T07:29:00Z

Great new!
So closing the issue. If anyone thinks this topic needs more discussion, just reopen it
thanks much to everyone!

felixfontein added the discussion_topic label Jun 4, 2023

felixfontein mentioned this issue Jun 21, 2023

Community Working Group Meeting Agenda 2023 ansible/community#679

Closed

GregSutcliffe added the next_meeting Topics that needs to be discussed in the next Community Meeting label Jun 28, 2023

Andersson007 mentioned this issue Jun 29, 2023

Update GHA matrix template ansible-collections/collection_template#62

Merged

Andersson007 mentioned this issue Jun 29, 2023

collection requirements: Add a note about CI jobs limits ansible/ansible-documentation#40

Merged

samccann removed the next_meeting Topics that needs to be discussed in the next Community Meeting label Jul 19, 2023

felixfontein mentioned this issue Aug 31, 2023

Sanity tests use Python version, they should not do that ansible-network/github_actions#120

Open

jooola mentioned this issue Sep 25, 2023

ci: add unit tests to azure pipelines ansible-collections/hetzner.hcloud#325

Merged

Andersson007 closed this as completed Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Actions CI contention #237

GitHub Actions CI contention #237

briantist commented Jun 4, 2023

felixfontein commented Jun 4, 2023

mariolenz commented Jun 5, 2023

briantist commented Jun 5, 2023

cybette commented Jun 5, 2023

mariolenz commented Jun 5, 2023

GregSutcliffe commented Jun 28, 2023

leogallego commented Jun 28, 2023

felixfontein commented Jun 28, 2023

briantist commented Jun 28, 2023 •

edited

Loading

Andersson007 commented Jun 29, 2023

GregSutcliffe commented Jun 29, 2023

briantist commented Jun 29, 2023

Andersson007 commented Jun 29, 2023

mariolenz commented Jul 1, 2023

briantist commented Jul 1, 2023

mariolenz commented Jul 1, 2023

briantist commented Jul 1, 2023

mariolenz commented Jul 2, 2023

felixfontein commented Jul 2, 2023

felixfontein commented Jul 2, 2023

briantist commented Jul 2, 2023

mariolenz commented Jul 7, 2023

briantist commented Jul 7, 2023

briantist commented Aug 17, 2023

GregSutcliffe commented Aug 25, 2023

briantist commented Sep 20, 2023

GregSutcliffe commented Oct 2, 2023 •

edited

Loading

GregSutcliffe commented Oct 5, 2023 •

edited

Loading

felixfontein commented Oct 5, 2023

briantist commented Oct 5, 2023 •

edited

Loading

felixfontein commented Oct 6, 2023

Andersson007 commented Oct 9, 2023

GitHub Actions CI contention #237

GitHub Actions CI contention #237

Comments

briantist commented Jun 4, 2023

Summary

Suggestion

Potential cost

felixfontein commented Jun 4, 2023

mariolenz commented Jun 5, 2023

briantist commented Jun 5, 2023

cybette commented Jun 5, 2023

mariolenz commented Jun 5, 2023

GregSutcliffe commented Jun 28, 2023

leogallego commented Jun 28, 2023

felixfontein commented Jun 28, 2023

briantist commented Jun 28, 2023 • edited Loading

Andersson007 commented Jun 29, 2023

GregSutcliffe commented Jun 29, 2023

briantist commented Jun 29, 2023

Andersson007 commented Jun 29, 2023

mariolenz commented Jul 1, 2023

briantist commented Jul 1, 2023

mariolenz commented Jul 1, 2023

briantist commented Jul 1, 2023

mariolenz commented Jul 2, 2023

felixfontein commented Jul 2, 2023

felixfontein commented Jul 2, 2023

briantist commented Jul 2, 2023

mariolenz commented Jul 7, 2023

briantist commented Jul 7, 2023

briantist commented Aug 17, 2023

GregSutcliffe commented Aug 25, 2023

briantist commented Sep 20, 2023

GregSutcliffe commented Oct 2, 2023 • edited Loading

GregSutcliffe commented Oct 5, 2023 • edited Loading

felixfontein commented Oct 5, 2023

briantist commented Oct 5, 2023 • edited Loading

felixfontein commented Oct 6, 2023

Andersson007 commented Oct 9, 2023

briantist commented Jun 28, 2023 •

edited

Loading

GregSutcliffe commented Oct 2, 2023 •

edited

Loading

GregSutcliffe commented Oct 5, 2023 •

edited

Loading

briantist commented Oct 5, 2023 •

edited

Loading