`isJobQueued` check races with SQS visibility timeout and label-based runner assignment, silently dropping jobs

## Description

When `enable_job_queued_check = true` (the default) and `scale_up_reserved_concurrent_executions` is lower than the number of queued messages per runner type, SQS delivers messages in multiple batches separated by the visibility timeout (default 90 seconds). Runners created by the first batch register with GitHub and get assigned jobs from the second batch via label matching. When the second batch is processed, `isJobQueued` returns `false` for those jobs, and the Lambda silently acknowledges the SQS message — no runner is created, and the originally-queued job (which was supposed to be handled by batch 1's runner) is left without a runner.

## Root Cause

### How the race occurs

Consider a queue with 10 messages and `concurrency=5`:

1. **T+0s (Batch 1)**: Lambda event source mapping polls SQS, delivers 5 messages to 5 concurrent Lambda invocations. Each checks `isJobQueued` → all 10 jobs are still queued → creates 5 EC2 instances with JIT configs.

2. **T+60-90s**: The 5 EC2 instances boot, install the runner agent, and register with GitHub. GitHub assigns queued jobs to them via **label matching** — `generateRunnerJitconfigForOrg` doesn't bind a runner to a specific job ID, it creates a runner with labels and GitHub assigns any matching queued job when the runner connects. GitHub may assign ANY 5 of the 10 queued jobs, not necessarily the ones from batch 1's messages.

3. **T+90s (Batch 2)**: The remaining 5 SQS messages become visible (visibility timeout expires). Lambda processes them, calls `isJobQueued` for each job ID. But some of these jobs were already assigned to runners from batch 1 and are now `in_progress`. `isJobQueued` returns `false` → the Lambda logs "No runner will be created, job is not queued" and acknowledges the SQS message → **the job that IS still queued (from batch 1's message) gets no runner**.

### Why the check is fundamentally racy

The `isJobQueued` check assumes a 1:1 mapping between SQS messages and job assignments, but `generateRunnerJitconfigForOrg` creates runners with **labels**, and GitHub assigns jobs to runners based on label matching, not based on which SQS message triggered the runner. This means:

- Runner from message A can be assigned job B
- When message B is later processed, job B is `in_progress` → skipped
- Job A (which was supposed to be handled by message A's runner) may still be queued but has no message left to trigger a runner

### CloudWatch evidence

For a single runner type (`a64-g-l-v4-s`) with 10 messages in the queue and `concurrency=5`:

```
# Batch 1: 5 invocations, 5 instances created
20:22:54 Received events → Created i-0ca32c28beb745876
20:22:54 Received events → Created i-0ee3045b3bd7cf8a7
20:22:57 Received events → Created i-02299ee1b72ee12fe
20:22:58 Received events → Created i-0a834c17ed90833c7
20:22:58 Received events → Created i-003ee7bbe1c07a3d2

# 90-second gap (SQS visibility timeout)

# Batch 2: 5 invocations, only 1 instance created, 4 jobs dropped
20:24:27 Received events → No runner will be created, job is not queued.
20:24:27 Received events → No runner will be created, job is not queued.
20:24:28 Received events → Current runners: 0, launching 1 runner
20:24:28 Received events → No runner will be created, job is not queued.
20:24:29 Received events → No runner will be created, job is not queued.
```

Across all 12 runner types, this pattern produced exactly 29 dropped jobs out of 120 total (matching the observed 91/120 success rate).

### SQS queue configuration confirming the gap

```
VisibilityTimeout: 90
DelaySeconds: 5
BatchSize: 1
```

## Impact

- **Jobs silently dropped**: The "not queued" log is at INFO level with no metric or alarm. The job is simply not fulfilled.
- **Scales with burst size**: The more messages exceed the concurrency limit, the more jobs are lost.
- **Not recoverable**: The SQS message is acknowledged, so there's no retry. The `job_retry` mechanism doesn't help because it only fires after a successful scale-up.

## Environment

- Module version: `~> 7.3`
- GitHub: Enterprise Cloud with Data Residency (`ghe.com`)
- `enable_jit_config = true`, `enable_ephemeral_runners = true`
- `enable_job_queued_check = true` (default)
- `batch_size = 1`
- SQS visibility timeout: 90s
- Multi-runner module with 12+ runner type configurations

## Suggested Fixes

### Option A: Disable `isJobQueued` check

Set `enable_job_queued_check = false`. Accept the small risk of creating a runner for an already-handled job. The runner will self-terminate when no job is available. This is the simplest and most reliable fix.

### Option B: Move the check after runner creation

Check job status after creating the EC2 instance but before writing JIT config. If the job is no longer queued, terminate the instance immediately. This avoids the race because the instance hasn't registered yet.

### Option C: Check for ANY queued jobs, not a specific job

Instead of checking whether the specific job from this message is queued, check whether there are ANY queued jobs matching this runner type's labels. If yes, proceed with runner creation. This makes the check resilient to the label-matching race.

### Additionally: Increase concurrency

Ensure `scale_up_reserved_concurrent_executions` is high enough that all messages per queue are processed in a single SQS polling batch. This prevents the 90-second gap entirely. However, this alone doesn't fix the fundamental race — it just makes it less likely.

## Our Workaround

We have applied:
1. `enable_job_queued_check = false`
2. `scale_up_reserved_concurrent_executions = 30`
3. `isJobQueued` wrapped in try/catch — API errors assume the job is still queued rather than silently dropping it


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`isJobQueued` check races with SQS visibility timeout and label-based runner assignment, silently dropping jobs #5026

Description

Root Cause

How the race occurs

Why the check is fundamentally racy

CloudWatch evidence

SQS queue configuration confirming the gap

Impact

Environment

Suggested Fixes

Option A: Disable `isJobQueued` check

Option B: Move the check after runner creation

Option C: Check for ANY queued jobs, not a specific job

Additionally: Increase concurrency

Our Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

isJobQueued check races with SQS visibility timeout and label-based runner assignment, silently dropping jobs #5026

Description

Description

Root Cause

How the race occurs

Why the check is fundamentally racy

CloudWatch evidence

SQS queue configuration confirming the gap

Impact

Environment

Suggested Fixes

Option A: Disable isJobQueued check

Option B: Move the check after runner creation

Option C: Check for ANY queued jobs, not a specific job

Additionally: Increase concurrency

Our Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`isJobQueued` check races with SQS visibility timeout and label-based runner assignment, silently dropping jobs #5026

Option A: Disable `isJobQueued` check