-
Notifications
You must be signed in to change notification settings - Fork 9
GitHub Actions CI contention #237
Comments
Some of the AZP based collections (community.general, community.crypto, community.docker) now also make more use of GHA since we no longer should use AZP for EOL versions of ansible-core/ansible-base/Ansible 2.9, some of whose these collections still supports. This increases the general CI load for all collections in gh.com/ansible-collections. (And yes, CI feels incredibly slow in all the collections I'm maintaining currently, especially if there is more than one PR - in the same or across multiple collections -, but also if there is a single PR, but probably others in other collections I don't actively watch.) |
The integration tests for community.vmware running on Zuul take hours, so I can live with GH actions taking several minutes. But I understand your problem. Especially since it looks like there will be even more tests moved to GH action. Quoting from ansible-collections/community.vmware#1746:
|
I definitely am not advocating for not moving things to GHA; it's certainly an improvement in this case. And I am less concerned with how long a particular test takes to run on GHA once it starts, the issue is with how long the run itself will be queued due to hitting our concurrency limit. As another example, some of my individual job runs take 10+ minutes (while the whole of CI can take 30+); this is partially because the overall test time would be longer if I made the tests more parallel, due to queueing times. If we had massive concurrency (like the 500 limit we'd get with an enterprise account) I would restructure my CI runs and they would complete much faster overall, and be simpler in design. So the issue does not just result in longer, painful times, it can also redirect our limited engineering time and effort toward working around the problem, effort that could be better spent on the collections themselves, often resulting in more complex and harder to maintain CI. |
Appreciate the write-up and additional details everyone! Just a note to say this is on our radar and we're looking into it. |
I didn't say you're advocating against moving things to GHA. After all, you've said:
What I wanted to say is: It looks like RH plans to move CI jobs from their own Zuul CI to GH actions, which makes the current situation even worse. |
Just wanted to put an update here because I figure it might come up again soon. We've been talking about a few options internally, and I want to make sure we get early input before "suggestions" start looking like "decisions" :) There's a couple of options we could go for:
This is pretty easy to justify, but is a 50% increase in CI minutes enough? If we feel the 50k minutes of Enterprise is needed, how many seats do we think we need (my estimate is ~20)? Another option we're thinking about is self-hosted runners:
@Spredzy @leogallego did I miss anything? Very interested to hear some thoughts on either of these, or other possible solutions, please let us know! |
Great summary @GregSutcliffe, I would only include that there is the option to "Add Action Runners" for an additional cost, even if we use the Team plan. At the time of writing this comment:
|
I think the by far cheapest thing is to use self-hosted runners - if you ignore the administration costs. The main problem with self-hosted runners is (IMO) making them sufficiently secure. Having a system which creates a new VM for every CI job and shuts down the VM at the end of the run is the safest option (and also what GitHub itself is doing), but such a system doesn't just set itself up (not sure how much GH's public software does). Also we definitely want some caching of docker/podman images, since almost every ansible-test run pulls at least one image, and you want to avoid the extreme amounts of traffic this can easily cost. Another option might be using a third-party CI service (I don't really have experience with any, so I won't name anything here but AZP which is already used by some collections and by ansible-core itself). |
@GregSutcliffe thanks very much for looking into all this. I want to first clarify the "minutes" in GHA, since I had a different understanding of how they work. My understanding is that per-minute rates apply only to use within private repositories, and there are no per-minute charges for public repositories. The number of minutes "included" in a plan only counts toward billable minutes (larger runners, or any runners in private repos). Similarly, this is what the multipliers refer to for macos and windows runners (they apply to the number of included minutes, not to the billing rate). Evidence toward that:
With this being the case, it's not number of minutes that we need to worry about, our issue is purely about the number of concurrent jobs that can be running at a given time, so it's basically the only metric we have to consider imo. It's also possible I am not understanding this correctly... but if I am, then upgrading the plan should by far be the most straightforward option. Re: the idea of using self-hosted runners: I personally think that the work involved with far exceed the benefit unless we've (you've?) engineered a pretty solid implementation with feature parity to GH-hosted runners, keeping in mind that this covers more than just linux (though if we covered linux it would take care of the bulk of jobs, so still helpful).
Emphasis on the last bit. That being said, I do like the "donation" option, but that can be done now without the need for making any changes to the GitHub plan. It's more of a policy thing so it feels like a conversation we could have, but unrelated to this. Self-hosted runners do allow for larger/faster runners, and increased scaling (potentially), but I think it hinges entirely on having ephemeral runner pools. I personally think it's a better cost-benefit and experience to go ahead and pay GitHub for more capacity, but that's easy to say when it's not my money ;) If we did want to look at automating self-hosted ephemeral runners, I've seen this service around, but I am not affiliated and have never used it myself: https://cirun.io/ |
In addition, created a PR against collection_template GHA matrix template to make future maintainers aware of the limitations ansible-collections/collection_template#62. FYI |
Thanks @briantist - I had missed that nuance. I think I agree that self-hosted runners are more work than we'd like, but I didn't just want to present a single fait-accompli "option" :) I had a look at the billing data for I suspect this is because public repos don't count, as discussed above, so it's not logged - but that makes it hard to know what we've actually used. However, if we're sure that a Team account will help, then I think that's a fairly low-cost option anyway. Thoughts? I'll have a dig to see if I can get more "real" data, but if anyone already knows how, get in touch ;) |
Indeed showing 0 for usage is accurate for billing reasons, but not helpful data! I'm not sure where to see actual usage in aggregate, but on a per-run level, you can look at any GHA run and click the Usage link in the lower left below the jobs. That will break down per-minute usage of each job and then give a total for the run, both for actual and billable usage. If that were available at a repository or organization level it would be nice, but I haven't been able to find it on projects I have more access to. If you have a contact or rep at GitHub, it would be great to confirm our understanding of it all, and maybe they know how to get better data too. If we're right about only needing to worry about concurrency, then based on the screenshot in my original post, a team plan triples it from 20 to 60 concurrent jobs, which I think would be a big noticeable improvement! Re: self-hosted, thank you for including that option, you're right to include that in consideration and offer alternatives. Thanks very much for all of this! |
Also added a corresponding note to Collection requirements ansible/ansible-documentation#40, FYI |
Sorry, I'm no expert on GHA but does this: mean that tests are run against ansible-core 2.15 with Python 2.7? Does 2.15 even support 2.7 still? I mean on the controller node. I might be wrong, but if I'm not this would be a test we can get rid of. And maybe there are some more. |
Thanks, good question @mariolenz yes, on targets 2.7 is still supported in 2.15: https://docs.ansible.com/ansible/latest/reference_appendices/release_and_maintenance.html#support-life I think you won't find all that many collections testing that combination. It's a good thing to raise, and we should still look to reduce where we can, but the larger issue will not be meaningfully solved by trimming a few jobs here and there. While the issue of start times in particular is getting worse due to more collections using GHA, contention has been a problem for a long time even for single collections like mine just due to having more than 20 jobs per run. Increased concurrency will be a big quality of life improvement and increase in velocity. |
A lot of collections don't ssh somewhere and run there, but run on the controller and connect to a remote API. So they only need to run the tests against Python versions ansible-core supports on the controller node. Just wanted to mention it.
I understand the problem, and my suggestion wouldn't help much. Still, I thought I should mention that there might be some opportunity for improvements. Slight improvements, though... nothing to really fix the basic problem. |
Interesting, I didn't know that! I thought the majority of collections tested against containers. |
@briantist Take dellemc.openmanage as an example:
The classic Ansible approach to copy the module to the target and run it there doesn't apply. I'm pretty sure it's technically impossible to run Python code on iDRAC and quite sure that it's at least not supported with OME (I don't know OMEM, though). So what those modules do is running on the controller talking to a remote API. Since there's no code executed on the target, there's no need to test against target Python versions. I don't know how many collections work like this. But I think there are quite a lot. I should say that all collections dealing with cloud infrastructure don't run the modules directly "on" the cloud target, they talk to an API. There are quite some collections that automate things where the "natural" approach is to call an API (a lot of storage arrays, firewalls, network devices...) because you can't run python code on the target or this isn't supported or at lease not best practice. Classic Ansible:
"API collections":
At least that's the usual workflow for me when using community.vmware. Of course, you can delegate to another host (not localhost / the controller node) but I don't know the Python requirements in this case: Controller node or target requirements for the Python version? |
These are (usually) also tested in containers, it's just that no specific target container is needed (resp. target = controller). (Also there are some special cases where also such modules are run on target != controller, namely when the machine/API you need to talk to isn't reachable from your machine, but only through some jump host. Then you can run ansible-playbook on your machine, while these modules run on the jump host :) I guess that isn't very common though - in fact this is probably very rare.) |
This is needed for all collections that have content that is intended to run on a target (and don't have a restriction on the target Python that disallows 2.7). For all these collections, it could be needed. (You don't have to test every single supported ansible-core release with a Python 2.7 target, but at least some; whether 2.7 belongs to the list is up for the collection maintainers to decide). |
I know about these, I just didn't know that the number was "a lot" ;) |
@briantist I didn't have a closer look at how many collections don't run on the target, but on the controller and talk to an API. And anyway, "a lot" isn't really defined. But I should say we're talking about 10 to 20% of the collections in the community package. Maybe a bit more, but not less. However, this is just a guess from my side. |
Thanks @mariolenz , really appreciate the info! |
Hi @GregSutcliffe , wondering if there's any news on this? |
Oof, I lost track of this with all the other fun we've been having. Apologies! I've re-read the posts I missed, but I don't see anything that changes the current plan, which I believe is:
I'll find out who our GH contact is and get in touch with them. |
Hey @GregSutcliffe , I know things have been busy with the Ansible forum rollout and such, just want to check in on this again because it's still quite an issue. |
Apologies, indeed it has been a busy month. I have just emailed a contact at GH, likely they are not the right person to speak to but they should be able to help me speak to whoever is. Will update once I know more, apologies for the delay. |
So, I have news :) GitHub have kindly upgraded up to the Team plan for this org, which gives us 50 concurrent jobs, instead of 20. Hopefully, that will help things to feel better right away. Seats should not be an issue, we have enough to cover all the org members, plus a bit of headroom, so I'm not worried about that. We're also looking into why the usage report doesn't actually report usage (billing and usage are not the same thing). We'll give it a week or so on the new plan to see if data starts to come through, and then I'll check in with GitHub again if not. Once we have usage data, we'll have the tools to check what's going on if we start to hit issues again. Thanks for your patience folks! Sorry it took so long, that's entirely on me - and obviously, big thanks to GitHub for the upgrade. |
@GregSutcliffe that's really awesome news! :) |
🎉🎊🥳 @GregSutcliffe amazing! thank you so much! I can confirm I was able to run my CI today (30 jobs?) with no queued jobs, so the higher concurrency is definitely in effect. @felixfontein I'm especially interested in your anecdotal experiences since you see so many more runs than I do, in many different collections |
I don't have much anecdotal experience yet, but so far GHA feels a lot smoother than before. |
Great new! |
Summary
As more ansible collections within the
ansible-collections
organization use GitHub Actions as their CI, we're seeing increased contention for CI runners. This has been painful in the past, even for some single collections, but is getting worse, primarily because concurrency limits are at the account, not repository, level.See:
This means that all repositories in this org are sharing the 20 concurrent job limit. To use my small collection as an example, a single PR to
community.hashi_vault
generates 27 jobs for CI, plus 4 for docs build. If there are multiple PRs, or even several commits within a PR before the previous runs finish, the wait times grow and grow. This is after recently removing two older versions of ansible-core from the CI matrix.Doing releases, which end up with many runs (from the release PR, the push from merging the release PR, the push of the tag, etc.) ends up taking like 2 hours when the actual steps take minutes.
New collections continue to default to GHA, and there seems to be a push to move existing CI out of Zuul and Azure into GHA (just two examples):
I think this is a good thing, but we will see this problem compound as a result.
Suggestion
As shown in the billing page linked above, it is possible to get more concurrency with a non-free GitHub plan.
If we can get the
ansible-collections
organization (at a minimum) into a non-free plan, we can get increased concurrency, which would be a huge improvement for everyone. The more concurrency, the better.This probably has to be done by RedHat since they own the organization. My hope is that they already have a non-free account that the org could be moved into, and won't require a new sign-up/agreement or whatever, but I really don't know any details about that.
Potential cost
Disregarding the cost of a non-free plan itself, GitHub runners have per-minute rates:
Public repositories can use GHA for free, and so are not billed per-minute rates, and I think this still applies to public repos in an paid account.
Based on a number of assumptions that need to be confirmed, there's a chance that we can get increased concurrency without any additional spend. The assumptions are:
All of these assumptions need to be checked.
I really cannot overstate how helpful this will be for the community.
The text was updated successfully, but these errors were encountered: