-
Notifications
You must be signed in to change notification settings - Fork 662
feat: Add last chance check before orphan termination #4595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nality for incorrectly tagged runners
@stuartp44 in case the PR is in WIP, can you mark the PR as draft? |
The PR is actually ready to be review, the build failures where unrelated to my changes but I have fixed them nevertheless. @npalm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the PR. I only was able to a partial review so far. I have checked the Lambda code (excluding th tests). Also not tested a deployment.
This solution is solving the problem only for JIT enabled runners, but for non JIT the problem remains. Assuming the chances is less since typically less runners will be created. Do you have thoughts about alternatives? I think we should find a way / place to document this limitation.
I have made some comments, but need more time to go over the PR. Will ping you once ready, but laready sharing the feedback so far.
lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Outdated
Show resolved
Hide resolved
lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Outdated
Show resolved
Hide resolved
lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Outdated
Show resolved
Hide resolved
lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Outdated
Show resolved
Hide resolved
lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Outdated
Show resolved
Hide resolved
Co-authored-by: Niek Palm <[email protected]>
Co-authored-by: Niek Palm <[email protected]>
Co-authored-by: Niek Palm <[email protected]>
Co-authored-by: Niek Palm <[email protected]>
…mports in scale-down.ts
Can you share a scenario how to test this as well? |
…fringe on other checks
My test was to allow the scale-up lambda create a runner and then tag it with ghr:orphan and then watch the next run of the scale down lambda. It will test the current state of the runner against the GH API and only if its offline but busy will it terminate the runner (this can be done by terminating the GH action runner process in the system under test, which I didn't do!) else it will remove the tag. For everything else, it should follow the previous process. My only concern is if the runner is offline and idle, I am not sure if me putting anymore logic in encroaches on other areas the code should deal with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances control-plane Lambdas by adding EC2 tag removal, JIT runner support, and corresponding test coverage.
- Extend IAM policy to permit tag deletions
- Implement
untag
and last-chance orphan checks in scale-down - Add tagging of runner IDs and test updates for JIT scenarios
Reviewed Changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
modules/runners/policies/lambda-scale-down.json | Allow ec2:DeleteTags and tag:UntagResources in IAM |
lambdas/functions/control-plane/src/scale-runners/scale-up.ts | Add runner-ID tagging in JIT flow |
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts | Mock and test new tag import |
lambdas/functions/control-plane/src/scale-runners/scale-down.ts | Introduce untag , last-chance orphan logic |
lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts | Cover JIT untag/terminate scenarios |
lambdas/functions/control-plane/src/aws/runners.ts | Export new untag function and capture runnerId tag |
lambdas/functions/control-plane/src/aws/runners.test.ts | Test listing of JIT instances and untag |
Comments suppressed due to low confidence (1)
lambdas/functions/control-plane/src/aws/runners.test.ts:280
- [nitpick] The test description 'removing extra tag' is ambiguous; rename it to something like 'should remove orphan tag from runner' for clarity.
it('removing extra tag', async () => {
Will non-JIT runners also be targeted as part of this PR? |
@avni-ef, no, as far as I am aware the JIT instances generate a runnerid (because they are pre registered) and add that via the scale-up lambda to the EC2 instance. We then look for this in the scale-down lambda and if that is present in the tags, we do the last chance check. Without that, its the current way of dealing with orphans which is tag and on the next run, terminate. |
Could we not get the runner-id via the start script for non JIT, once the runner is being registered? And store this in a tag? |
Thank you @stuartp44 @npalm for the quick response! |
With Orphan runners in this module we mean running aws instances not registered in GitHub. The scale-up acts based on running ec2 instances and try to remove them (scale-down) and clean orphan runners (could be caused by registration failures). The scale-down lambda is not cleaning registeried runners. In normal flow runners will be first removed from GitHub (de-register) and next terminated. For ephemeral runners the de-registration shoudl happen automacitally. Having offline runners in GitHub should not cause any side effects. |
Thanks for clarifying @npalm, I now understand that I'm looking at a different problem, I'll reply on this new issue |
Just a small nudge to get a review on this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stuartp44 looks good in general. I made some minor comments. I have also create a PR on top of this one #4667. This will also fix the same problem for non JIT runners. Please can you check the change and merge it in the PR. I have tested the PR to check the runner id is added for non JIT (linux /windows).
I also struggeld with testing the setup. I only checked the normal flows remains working. Which all look good.
lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Outdated
Show resolved
Hide resolved
⚠️ This PR is on top of #4539 This pull request is an enhancement on top of #4539. The PR also introduces tagging via the start scripts for non-JIT configured instances. And it adds a state diagram to the ever growing complexity of the state diagram below terminating instances. ### Instance Tagging Enhancements: * Add function to tag runners with runner id for non JIT instances. Functions are designed to ignore errors to avoid causing a crash of the runner registration process. * side effect is that the instance is allowed to create the tag `ghr:github_runner_id` the instance is allowed to create the tag only on itself. ### Added docs * State diagram for scale-down added <img width="2084" height="3840" alt="Untitled diagram _ Mermaid Chart-2025-07-20-140732" src="https://github.com/user-attachments/assets/e88af647-98e5-46cb-8da4-28110c608d8d" /> --------- Co-authored-by: github-aws-runners-pr|bot <github-aws-runners-pr[bot]@users.noreply.github.com>
@stuartp44 thx for all the work! |
🤖 I have created a release *beep* *boop* --- ## [6.7.0](v6.6.2...v6.7.0) (2025-07-22) ### Features * Add last chance check before orphan termination ([#4595](#4595)) ([7cbd9e0](7cbd9e0)) ### Bug Fixes * [#4391](#4391) ([7cbd9e0](7cbd9e0)) * **lambda:** bump form-data from 2.5.1 to 2.5.5 in /lambdas ([#4672](#4672)) ([0ec1c9a](0ec1c9a)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
This pull request introduces enhancements to the AWS Lambda functions for scaling GitHub Actions runners. Key changes include the addition of functionality to manage runner tags (tagging and untagging).
ghr:github_runner_id
during registrartion in the scale-up lambdaghr:github_runner_id
is added in in the start script.ghr:github_runner_id
.fix: #4391
Functionality Enhancements:
Tag Management:
untag
function to remove tags from EC2 instances. (lambdas/functions/control-plane/src/aws/runners.ts
, lambdas/functions/control-plane/src/aws/runners.tsR117-R122)lambdas/functions/control-plane/src/scale-runners/scale-down.ts
, lambdas/functions/control-plane/src/scale-runners/scale-down.tsL197-R261)JIT Runner Support:
lambdas/functions/control-plane/src/scale-runners/scale-down.ts
, lambdas/functions/control-plane/src/scale-runners/scale-down.tsL197-R261)runnerId
to runner information for better identification and handling of JIT runners. (lambdas/functions/control-plane/src/aws/runners.ts
, lambdas/functions/control-plane/src/aws/runners.tsR95)Testing Improvements:
Expanded Test Coverage:
untag
functionality. (lambdas/functions/control-plane/src/aws/runners.test.ts
, lambdas/functions/control-plane/src/aws/runners.test.tsL232-R295)lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts
, lambdas/functions/control-plane/src/scale-runners/scale-down.test.tsR354-R405)Mock Updates:
untag
function for testing purposes. (lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts
, lambdas/functions/control-plane/src/scale-runners/scale-down.test.tsR36)These changes improve the scalability and robustness of the control-plane functions, ensuring better handling of dynamic runner configurations and edge cases.