-
Notifications
You must be signed in to change notification settings - Fork 712
Description
Description
When enable_job_queued_check = true (the default) and scale_up_reserved_concurrent_executions is lower than the number of queued messages per runner type, SQS delivers messages in multiple batches separated by the visibility timeout (default 90 seconds). Runners created by the first batch register with GitHub and get assigned jobs from the second batch via label matching. When the second batch is processed, isJobQueued returns false for those jobs, and the Lambda silently acknowledges the SQS message — no runner is created, and the originally-queued job (which was supposed to be handled by batch 1's runner) is left without a runner.
Root Cause
How the race occurs
Consider a queue with 10 messages and concurrency=5:
-
T+0s (Batch 1): Lambda event source mapping polls SQS, delivers 5 messages to 5 concurrent Lambda invocations. Each checks
isJobQueued→ all 10 jobs are still queued → creates 5 EC2 instances with JIT configs. -
T+60-90s: The 5 EC2 instances boot, install the runner agent, and register with GitHub. GitHub assigns queued jobs to them via label matching —
generateRunnerJitconfigForOrgdoesn't bind a runner to a specific job ID, it creates a runner with labels and GitHub assigns any matching queued job when the runner connects. GitHub may assign ANY 5 of the 10 queued jobs, not necessarily the ones from batch 1's messages. -
T+90s (Batch 2): The remaining 5 SQS messages become visible (visibility timeout expires). Lambda processes them, calls
isJobQueuedfor each job ID. But some of these jobs were already assigned to runners from batch 1 and are nowin_progress.isJobQueuedreturnsfalse→ the Lambda logs "No runner will be created, job is not queued" and acknowledges the SQS message → the job that IS still queued (from batch 1's message) gets no runner.
Why the check is fundamentally racy
The isJobQueued check assumes a 1:1 mapping between SQS messages and job assignments, but generateRunnerJitconfigForOrg creates runners with labels, and GitHub assigns jobs to runners based on label matching, not based on which SQS message triggered the runner. This means:
- Runner from message A can be assigned job B
- When message B is later processed, job B is
in_progress→ skipped - Job A (which was supposed to be handled by message A's runner) may still be queued but has no message left to trigger a runner
CloudWatch evidence
For a single runner type (a64-g-l-v4-s) with 10 messages in the queue and concurrency=5:
# Batch 1: 5 invocations, 5 instances created
20:22:54 Received events → Created i-0ca32c28beb745876
20:22:54 Received events → Created i-0ee3045b3bd7cf8a7
20:22:57 Received events → Created i-02299ee1b72ee12fe
20:22:58 Received events → Created i-0a834c17ed90833c7
20:22:58 Received events → Created i-003ee7bbe1c07a3d2
# 90-second gap (SQS visibility timeout)
# Batch 2: 5 invocations, only 1 instance created, 4 jobs dropped
20:24:27 Received events → No runner will be created, job is not queued.
20:24:27 Received events → No runner will be created, job is not queued.
20:24:28 Received events → Current runners: 0, launching 1 runner
20:24:28 Received events → No runner will be created, job is not queued.
20:24:29 Received events → No runner will be created, job is not queued.
Across all 12 runner types, this pattern produced exactly 29 dropped jobs out of 120 total (matching the observed 91/120 success rate).
SQS queue configuration confirming the gap
VisibilityTimeout: 90
DelaySeconds: 5
BatchSize: 1
Impact
- Jobs silently dropped: The "not queued" log is at INFO level with no metric or alarm. The job is simply not fulfilled.
- Scales with burst size: The more messages exceed the concurrency limit, the more jobs are lost.
- Not recoverable: The SQS message is acknowledged, so there's no retry. The
job_retrymechanism doesn't help because it only fires after a successful scale-up.
Environment
- Module version:
~> 7.3 - GitHub: Enterprise Cloud with Data Residency (
ghe.com) enable_jit_config = true,enable_ephemeral_runners = trueenable_job_queued_check = true(default)batch_size = 1- SQS visibility timeout: 90s
- Multi-runner module with 12+ runner type configurations
Suggested Fixes
Option A: Disable isJobQueued check
Set enable_job_queued_check = false. Accept the small risk of creating a runner for an already-handled job. The runner will self-terminate when no job is available. This is the simplest and most reliable fix.
Option B: Move the check after runner creation
Check job status after creating the EC2 instance but before writing JIT config. If the job is no longer queued, terminate the instance immediately. This avoids the race because the instance hasn't registered yet.
Option C: Check for ANY queued jobs, not a specific job
Instead of checking whether the specific job from this message is queued, check whether there are ANY queued jobs matching this runner type's labels. If yes, proceed with runner creation. This makes the check resilient to the label-matching race.
Additionally: Increase concurrency
Ensure scale_up_reserved_concurrent_executions is high enough that all messages per queue are processed in a single SQS polling batch. This prevents the 90-second gap entirely. However, this alone doesn't fix the fundamental race — it just makes it less likely.
Our Workaround
We have applied:
enable_job_queued_check = falsescale_up_reserved_concurrent_executions = 30isJobQueuedwrapped in try/catch — API errors assume the job is still queued rather than silently dropping it