Skip to content

feat: Add last chance check before orphan termination #4595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Jul 21, 2025

Conversation

stuartp44
Copy link
Contributor

@stuartp44 stuartp44 commented May 22, 2025

This pull request introduces enhancements to the AWS Lambda functions for scaling GitHub Actions runners. Key changes include the addition of functionality to manage runner tags (tagging and untagging).

  • For JIT-instances the GitHub runner id is is added in the tag ghr:github_runner_id during registrartion in the scale-up lambda
  • For non-JIT-instances the tag ghr:github_runner_id is added in in the start script.
  • The IAM permission for the runner instances now allows to create only for the instance self to create the tag ghr:github_runner_id.
  • The scale down function is using the runner id to efficient check the status of a runner.
  • Orphan tags are removed when a runner seems to be active (again).

fix: #4391

Functionality Enhancements:

Testing Improvements:

These changes improve the scalability and robustness of the control-plane functions, ensuring better handling of dynamic runner configurations and edge cases.

@stuartp44 stuartp44 requested a review from a team as a code owner May 22, 2025 08:25
@stuartp44 stuartp44 requested a review from Copilot May 22, 2025 08:29
Copilot

This comment was marked as outdated.

@stuartp44 stuartp44 changed the title Add last chance check before termination for JIT instances feat: Add last chance check before termination for JIT instances May 22, 2025
@stuartp44 stuartp44 changed the title feat: Add last chance check before termination for JIT instances feat: Add last chance check before orphan termination for JIT instances May 22, 2025
@npalm
Copy link
Member

npalm commented May 22, 2025

@stuartp44 in case the PR is in WIP, can you mark the PR as draft?

@stuartp44
Copy link
Contributor Author

@stuartp44 in case the PR is in WIP, can you mark the PR as draft?

The PR is actually ready to be review, the build failures where unrelated to my changes but I have fixed them nevertheless. @npalm

@npalm npalm self-requested a review May 23, 2025 13:56
Copy link
Member

@npalm npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the PR. I only was able to a partial review so far. I have checked the Lambda code (excluding th tests). Also not tested a deployment.

This solution is solving the problem only for JIT enabled runners, but for non JIT the problem remains. Assuming the chances is less since typically less runners will be created. Do you have thoughts about alternatives? I think we should find a way / place to document this limitation.

I have made some comments, but need more time to go over the PR. Will ping you once ready, but laready sharing the feedback so far.

@stuartp44 stuartp44 marked this pull request as draft June 3, 2025 15:49
@npalm
Copy link
Member

npalm commented Jul 9, 2025

@npalm this is now ready for re-review. I have tested it against a test deployment and it works fine.

Can you share a scenario how to test this as well?

@stuartp44
Copy link
Contributor Author

stuartp44 commented Jul 9, 2025

My test was to allow the scale-up lambda create a runner and then tag it with ghr:orphan and then watch the next run of the scale down lambda. It will test the current state of the runner against the GH API and only if its offline but busy will it terminate the runner (this can be done by terminating the GH action runner process in the system under test, which I didn't do!) else it will remove the tag. For everything else, it should follow the previous process. My only concern is if the runner is offline and idle, I am not sure if me putting anymore logic in encroaches on other areas the code should deal with.

@npalm npalm requested a review from Copilot July 9, 2025 19:21
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances control-plane Lambdas by adding EC2 tag removal, JIT runner support, and corresponding test coverage.

  • Extend IAM policy to permit tag deletions
  • Implement untag and last-chance orphan checks in scale-down
  • Add tagging of runner IDs and test updates for JIT scenarios

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
modules/runners/policies/lambda-scale-down.json Allow ec2:DeleteTags and tag:UntagResources in IAM
lambdas/functions/control-plane/src/scale-runners/scale-up.ts Add runner-ID tagging in JIT flow
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts Mock and test new tag import
lambdas/functions/control-plane/src/scale-runners/scale-down.ts Introduce untag, last-chance orphan logic
lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts Cover JIT untag/terminate scenarios
lambdas/functions/control-plane/src/aws/runners.ts Export new untag function and capture runnerId tag
lambdas/functions/control-plane/src/aws/runners.test.ts Test listing of JIT instances and untag
Comments suppressed due to low confidence (1)

lambdas/functions/control-plane/src/aws/runners.test.ts:280

  • [nitpick] The test description 'removing extra tag' is ambiguous; rename it to something like 'should remove orphan tag from runner' for clarity.
  it('removing extra tag', async () => {

@avni-ef
Copy link

avni-ef commented Jul 10, 2025

@stuartp44 thx for the work. did a partial review. Made several comments. I think it would be better to keep the main control flow in a single fucntion, especiaaly the termination calls.

Besides this the PR only solve the problem for JIT enabled runners, not targetting general runners. This also required to ensure the limitation get document.

I still need to check a deployment and walk through the test code. do you know any way to reproduce the problem?

Will non-JIT runners also be targeted as part of this PR?

@stuartp44
Copy link
Contributor Author

@avni-ef, no, as far as I am aware the JIT instances generate a runnerid (because they are pre registered) and add that via the scale-up lambda to the EC2 instance. We then look for this in the scale-down lambda and if that is present in the tags, we do the last chance check. Without that, its the current way of dealing with orphans which is tag and on the next run, terminate.

@npalm
Copy link
Member

npalm commented Jul 10, 2025

@avni-ef, no, as far as I am aware the JIT instances generate a runnerid (because they are pre registered) and add that via the scale-up lambda to the EC2 instance. We then look for this in the scale-down lambda and if that is present in the tags, we do the last chance check. Without that, its the current way of dealing with orphans which is tag and on the next run, terminate.

Could we not get the runner-id via the start script for non JIT, once the runner is being registered? And store this in a tag?

@avni-ef
Copy link

avni-ef commented Jul 10, 2025

Thank you @stuartp44 @npalm for the quick response!
I'd appreciate if you could confirm my assumption:
When the runners (JIT/non-JIT) are removed from AWS (spots terminated), they'll still be considered as orphan runners on Github runners pool? As in, the scale-up Lambda won't try to remove them from the Github runners pool as they no longer exist on AWS side.
(Just trying to identify the root cause of the issue we're experiencing with jobs being canceled)

@npalm
Copy link
Member

npalm commented Jul 14, 2025

Thank you @stuartp44 @npalm for the quick response! I'd appreciate if you could confirm my assumption: When the runners (JIT/non-JIT) are removed from AWS (spots terminated), they'll still be considered as orphan runners on Github runners pool? As in, the scale-up Lambda won't try to remove them from the Github runners pool as they no longer exist on AWS side. (Just trying to identify the root cause of the issue we're experiencing with jobs being canceled)

With Orphan runners in this module we mean running aws instances not registered in GitHub. The scale-up acts based on running ec2 instances and try to remove them (scale-down) and clean orphan runners (could be caused by registration failures). The scale-down lambda is not cleaning registeried runners. In normal flow runners will be first removed from GitHub (de-register) and next terminated. For ephemeral runners the de-registration shoudl happen automacitally. Having offline runners in GitHub should not cause any side effects.

@avni-ef
Copy link

avni-ef commented Jul 14, 2025

Thank you @stuartp44 @npalm for the quick response! I'd appreciate if you could confirm my assumption: When the runners (JIT/non-JIT) are removed from AWS (spots terminated), they'll still be considered as orphan runners on Github runners pool? As in, the scale-up Lambda won't try to remove them from the Github runners pool as they no longer exist on AWS side. (Just trying to identify the root cause of the issue we're experiencing with jobs being canceled)

With Orphan runners in this module we mean running aws instances not registered in GitHub. The scale-up acts based on running ec2 instances and try to remove them (scale-down) and clean orphan runners (could be caused by registration failures). The scale-down lambda is not cleaning registeried runners. In normal flow runners will be first removed from GitHub (de-register) and next terminated. For ephemeral runners the de-registration shoudl happen automacitally. Having offline runners in GitHub should not cause any side effects.

Thanks for clarifying @npalm, I now understand that I'm looking at a different problem, I'll reply on this new issue

@stuartp44
Copy link
Contributor Author

Just a small nudge to get a review on this please

Copy link
Member

@npalm npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuartp44 looks good in general. I made some minor comments. I have also create a PR on top of this one #4667. This will also fix the same problem for non JIT runners. Please can you check the change and merge it in the PR. I have tested the PR to check the runner id is added for non JIT (linux /windows).

I also struggeld with testing the setup. I only checked the normal flows remains working. Which all look good.

stuartp44 and others added 5 commits July 21, 2025 17:48
⚠️ This PR is on top of #4539

This pull request is an enhancement on top of #4539. The PR also
introduces tagging via the start scripts for non-JIT configured
instances. And it adds a state diagram to the ever growing complexity of
the state diagram below terminating instances.

### Instance Tagging Enhancements:
* Add function to tag runners with runner id for non JIT instances.
Functions are designed to ignore errors to avoid causing a crash of the
runner registration process.
* side effect is that the instance is allowed to create the tag
`ghr:github_runner_id` the instance is allowed to create the tag only on
itself.

### Added docs
* State diagram for scale-down added

<img width="2084" height="3840" alt="Untitled diagram _ Mermaid
Chart-2025-07-20-140732"
src="https://github.com/user-attachments/assets/e88af647-98e5-46cb-8da4-28110c608d8d"
/>

---------

Co-authored-by: github-aws-runners-pr|bot <github-aws-runners-pr[bot]@users.noreply.github.com>
@npalm npalm requested a review from a team as a code owner July 21, 2025 19:08
@npalm npalm changed the title feat: Add last chance check before orphan termination for JIT instances feat: Add last chance check before orphan termination Jul 21, 2025
@npalm npalm merged commit 7cbd9e0 into main Jul 21, 2025
44 checks passed
@npalm npalm deleted the stu/add_tag_plus_check branch July 21, 2025 19:52
@npalm
Copy link
Member

npalm commented Jul 21, 2025

@stuartp44 thx for all the work!

npalm pushed a commit that referenced this pull request Jul 22, 2025
🤖 I have created a release *beep* *boop*
---


##
[6.7.0](v6.6.2...v6.7.0)
(2025-07-22)


### Features

* Add last chance check before orphan termination
([#4595](#4595))
([7cbd9e0](7cbd9e0))


### Bug Fixes

*
[#4391](#4391)
([7cbd9e0](7cbd9e0))
* **lambda:** bump form-data from 2.5.1 to 2.5.5 in /lambdas
([#4672](#4672))
([0ec1c9a](0ec1c9a))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pagination Data Slippage Issue causing EC2 instance to be scaled down EC2 instance runners marked as orphan and deleted even when the job is running.
4 participants