-
Notifications
You must be signed in to change notification settings - Fork 712
Description
Description
When the scale-up Lambda encounters a non-ScaleError exception (such as a transient HTTP 404 from the GitHub App installation token endpoint), it logs the error but returns an empty batchItemFailures response. This tells SQS that all messages in the batch were processed successfully, causing them to be permanently deleted from the queue. The corresponding workflow_job events are silently lost — no EC2 instances are launched, no JIT configs are created, and no retry messages are published. The affected jobs remain queued in GitHub until they time out (24 hours).
Root Cause
In lambda.ts, the scaleUpHandler catch block distinguishes between ScaleError (EC2 fleet creation failures) and all other exceptions:
} catch (e) {
if (e instanceof ScaleError) {
// ✅ Returns batch failures → SQS retries these messages
batchItemFailures.push(...e.toBatchItemFailures(sqsMessages));
logger.warn(`${e.detailedMessage} A retry will be attempted via SQS.`, { error: e });
} else {
// ❌ Logs error but returns EMPTY batchItemFailures → SQS deletes all messages
logger.error(
`Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, ignoring batch`,
{ error: e },
);
}
return { batchItemFailures };
}When a non-ScaleError exception occurs (e.g., HttpError from Octokit), batchItemFailures remains empty, and SQS treats all messages as successfully processed. The phrase "ignoring batch" in the log is misleading — the messages aren't just ignored temporarily, they're permanently lost.
There are two failure modes depending on when the error occurs:
- Before EC2 instances are created (e.g., during
createGithubInstallationAuth()): No orphaned instances, but the webhook events for those jobs are gone and the jobs will never get a runner. - After some instances are created but before JIT config is written (e.g., during
createStartRunnerConfig()): Instances boot with no JIT config in SSM. Thestart-runner.shpolls SSM for some time, then gives up and self-terminates.
Impact
- Jobs permanently lost: Affected
workflow_jobevents are deleted from SQS with no retry. Jobs remainqueuedin GitHub for up to 24 hours before timing out. - No observability: The error log says "ignoring batch" but there's no metric, alarm, or dead-letter queue to catch this. The
job_retrymechanism is never triggered because the retry message is only published after a successful scale-up. - Wasted EC2 instances: If the failure occurs after instance creation, those instances boot with no SSM config and eventually self-terminate.
How to Trigger
Any non-ScaleError exception during scaleUp() will trigger this. The most common cause we've observed is a transient HTTP 404 from the GitHub App installation token endpoint under concurrent load (see related issue: JWT collisions).
Environment
- Module version:
~> 7.3 - GitHub: Enterprise Cloud with Data Residency (
ghe.com) enable_jit_config = true,enable_ephemeral_runners = true
Suggested Fixes
Fix 1: Return all messages as batch failures on unhandled errors
The generic catch block should return all messages as batchItemFailures instead of acknowledging them:
} catch (e) {
if (e instanceof ScaleError) {
batchItemFailures.push(...e.toBatchItemFailures(sqsMessages));
logger.warn(`${e.detailedMessage} A retry will be attempted via SQS.`, { error: e });
} else {
logger.error(
`Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, returning batch for retry`,
{ error: e },
);
batchItemFailures.push(
...sqsMessages.map(({ messageId }) => ({ itemIdentifier: messageId })),
);
}
return { batchItemFailures };
}The SQS redrive policy / maxReceiveCount will eventually move persistently failing messages to the DLQ, preventing infinite retries.
Fix 2: Enable Octokit rate limit retry
The onAbuseLimit and onRateLimit handlers in createAuth() log the rate limit but don't return true, so Octokit never retries:
onRateLimit: (retryAfter, options, octokit, retryCount) => {
logger.warn(`Rate limit hit, retrying after ${retryAfter} seconds`);
return true; // ← was missing
},
onSecondaryRateLimit: (retryAfter, options, octokit, retryCount) => {
logger.warn(`Secondary rate limit hit, retrying after ${retryAfter} seconds`);
return true; // ← was missing
},Our Workaround
We have patched the generic catch to return all messages as batchItemFailures and enabled Octokit rate limit retry by returning true from the handlers.