Skip to content

Scale-up Lambda silently drops entire SQS batch on non-ScaleError exceptions #5024

@vegardx

Description

@vegardx

Description

When the scale-up Lambda encounters a non-ScaleError exception (such as a transient HTTP 404 from the GitHub App installation token endpoint), it logs the error but returns an empty batchItemFailures response. This tells SQS that all messages in the batch were processed successfully, causing them to be permanently deleted from the queue. The corresponding workflow_job events are silently lost — no EC2 instances are launched, no JIT configs are created, and no retry messages are published. The affected jobs remain queued in GitHub until they time out (24 hours).

Root Cause

In lambda.ts, the scaleUpHandler catch block distinguishes between ScaleError (EC2 fleet creation failures) and all other exceptions:

} catch (e) {
    if (e instanceof ScaleError) {
      // ✅ Returns batch failures → SQS retries these messages
      batchItemFailures.push(...e.toBatchItemFailures(sqsMessages));
      logger.warn(`${e.detailedMessage} A retry will be attempted via SQS.`, { error: e });
    } else {
      // ❌ Logs error but returns EMPTY batchItemFailures → SQS deletes all messages
      logger.error(
        `Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, ignoring batch`,
        { error: e },
      );
    }
    return { batchItemFailures };
}

When a non-ScaleError exception occurs (e.g., HttpError from Octokit), batchItemFailures remains empty, and SQS treats all messages as successfully processed. The phrase "ignoring batch" in the log is misleading — the messages aren't just ignored temporarily, they're permanently lost.

There are two failure modes depending on when the error occurs:

  • Before EC2 instances are created (e.g., during createGithubInstallationAuth()): No orphaned instances, but the webhook events for those jobs are gone and the jobs will never get a runner.
  • After some instances are created but before JIT config is written (e.g., during createStartRunnerConfig()): Instances boot with no JIT config in SSM. The start-runner.sh polls SSM for some time, then gives up and self-terminates.

Impact

  • Jobs permanently lost: Affected workflow_job events are deleted from SQS with no retry. Jobs remain queued in GitHub for up to 24 hours before timing out.
  • No observability: The error log says "ignoring batch" but there's no metric, alarm, or dead-letter queue to catch this. The job_retry mechanism is never triggered because the retry message is only published after a successful scale-up.
  • Wasted EC2 instances: If the failure occurs after instance creation, those instances boot with no SSM config and eventually self-terminate.

How to Trigger

Any non-ScaleError exception during scaleUp() will trigger this. The most common cause we've observed is a transient HTTP 404 from the GitHub App installation token endpoint under concurrent load (see related issue: JWT collisions).

Environment

  • Module version: ~> 7.3
  • GitHub: Enterprise Cloud with Data Residency (ghe.com)
  • enable_jit_config = true, enable_ephemeral_runners = true

Suggested Fixes

Fix 1: Return all messages as batch failures on unhandled errors

The generic catch block should return all messages as batchItemFailures instead of acknowledging them:

} catch (e) {
    if (e instanceof ScaleError) {
      batchItemFailures.push(...e.toBatchItemFailures(sqsMessages));
      logger.warn(`${e.detailedMessage} A retry will be attempted via SQS.`, { error: e });
    } else {
      logger.error(
        `Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, returning batch for retry`,
        { error: e },
      );
      batchItemFailures.push(
        ...sqsMessages.map(({ messageId }) => ({ itemIdentifier: messageId })),
      );
    }
    return { batchItemFailures };
}

The SQS redrive policy / maxReceiveCount will eventually move persistently failing messages to the DLQ, preventing infinite retries.

Fix 2: Enable Octokit rate limit retry

The onAbuseLimit and onRateLimit handlers in createAuth() log the rate limit but don't return true, so Octokit never retries:

onRateLimit: (retryAfter, options, octokit, retryCount) => {
  logger.warn(`Rate limit hit, retrying after ${retryAfter} seconds`);
  return true; // ← was missing
},
onSecondaryRateLimit: (retryAfter, options, octokit, retryCount) => {
  logger.warn(`Secondary rate limit hit, retrying after ${retryAfter} seconds`);
  return true; // ← was missing
},

Our Workaround

We have patched the generic catch to return all messages as batchItemFailures and enabled Octokit rate limit retry by returning true from the handlers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions