Skip to content

isJobQueued check races with SQS visibility timeout and label-based runner assignment, silently dropping jobs #5026

@vegardx

Description

@vegardx

Description

When enable_job_queued_check = true (the default) and scale_up_reserved_concurrent_executions is lower than the number of queued messages per runner type, SQS delivers messages in multiple batches separated by the visibility timeout (default 90 seconds). Runners created by the first batch register with GitHub and get assigned jobs from the second batch via label matching. When the second batch is processed, isJobQueued returns false for those jobs, and the Lambda silently acknowledges the SQS message — no runner is created, and the originally-queued job (which was supposed to be handled by batch 1's runner) is left without a runner.

Root Cause

How the race occurs

Consider a queue with 10 messages and concurrency=5:

  1. T+0s (Batch 1): Lambda event source mapping polls SQS, delivers 5 messages to 5 concurrent Lambda invocations. Each checks isJobQueued → all 10 jobs are still queued → creates 5 EC2 instances with JIT configs.

  2. T+60-90s: The 5 EC2 instances boot, install the runner agent, and register with GitHub. GitHub assigns queued jobs to them via label matchinggenerateRunnerJitconfigForOrg doesn't bind a runner to a specific job ID, it creates a runner with labels and GitHub assigns any matching queued job when the runner connects. GitHub may assign ANY 5 of the 10 queued jobs, not necessarily the ones from batch 1's messages.

  3. T+90s (Batch 2): The remaining 5 SQS messages become visible (visibility timeout expires). Lambda processes them, calls isJobQueued for each job ID. But some of these jobs were already assigned to runners from batch 1 and are now in_progress. isJobQueued returns false → the Lambda logs "No runner will be created, job is not queued" and acknowledges the SQS message → the job that IS still queued (from batch 1's message) gets no runner.

Why the check is fundamentally racy

The isJobQueued check assumes a 1:1 mapping between SQS messages and job assignments, but generateRunnerJitconfigForOrg creates runners with labels, and GitHub assigns jobs to runners based on label matching, not based on which SQS message triggered the runner. This means:

  • Runner from message A can be assigned job B
  • When message B is later processed, job B is in_progress → skipped
  • Job A (which was supposed to be handled by message A's runner) may still be queued but has no message left to trigger a runner

CloudWatch evidence

For a single runner type (a64-g-l-v4-s) with 10 messages in the queue and concurrency=5:

# Batch 1: 5 invocations, 5 instances created
20:22:54 Received events → Created i-0ca32c28beb745876
20:22:54 Received events → Created i-0ee3045b3bd7cf8a7
20:22:57 Received events → Created i-02299ee1b72ee12fe
20:22:58 Received events → Created i-0a834c17ed90833c7
20:22:58 Received events → Created i-003ee7bbe1c07a3d2

# 90-second gap (SQS visibility timeout)

# Batch 2: 5 invocations, only 1 instance created, 4 jobs dropped
20:24:27 Received events → No runner will be created, job is not queued.
20:24:27 Received events → No runner will be created, job is not queued.
20:24:28 Received events → Current runners: 0, launching 1 runner
20:24:28 Received events → No runner will be created, job is not queued.
20:24:29 Received events → No runner will be created, job is not queued.

Across all 12 runner types, this pattern produced exactly 29 dropped jobs out of 120 total (matching the observed 91/120 success rate).

SQS queue configuration confirming the gap

VisibilityTimeout: 90
DelaySeconds: 5
BatchSize: 1

Impact

  • Jobs silently dropped: The "not queued" log is at INFO level with no metric or alarm. The job is simply not fulfilled.
  • Scales with burst size: The more messages exceed the concurrency limit, the more jobs are lost.
  • Not recoverable: The SQS message is acknowledged, so there's no retry. The job_retry mechanism doesn't help because it only fires after a successful scale-up.

Environment

  • Module version: ~> 7.3
  • GitHub: Enterprise Cloud with Data Residency (ghe.com)
  • enable_jit_config = true, enable_ephemeral_runners = true
  • enable_job_queued_check = true (default)
  • batch_size = 1
  • SQS visibility timeout: 90s
  • Multi-runner module with 12+ runner type configurations

Suggested Fixes

Option A: Disable isJobQueued check

Set enable_job_queued_check = false. Accept the small risk of creating a runner for an already-handled job. The runner will self-terminate when no job is available. This is the simplest and most reliable fix.

Option B: Move the check after runner creation

Check job status after creating the EC2 instance but before writing JIT config. If the job is no longer queued, terminate the instance immediately. This avoids the race because the instance hasn't registered yet.

Option C: Check for ANY queued jobs, not a specific job

Instead of checking whether the specific job from this message is queued, check whether there are ANY queued jobs matching this runner type's labels. If yes, proceed with runner creation. This makes the check resilient to the label-matching race.

Additionally: Increase concurrency

Ensure scale_up_reserved_concurrent_executions is high enough that all messages per queue are processed in a single SQS polling batch. This prevents the 90-second gap entirely. However, this alone doesn't fix the fundamental race — it just makes it less likely.

Our Workaround

We have applied:

  1. enable_job_queued_check = false
  2. scale_up_reserved_concurrent_executions = 30
  3. isJobQueued wrapped in try/catch — API errors assume the job is still queued rather than silently dropping it

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions