Scale-up Lambda silently drops entire SQS batch on non-ScaleError exceptions

## Description

When the scale-up Lambda encounters a non-`ScaleError` exception (such as a transient HTTP 404 from the GitHub App installation token endpoint), it logs the error but returns an **empty `batchItemFailures`** response. This tells SQS that all messages in the batch were processed successfully, causing them to be permanently deleted from the queue. The corresponding `workflow_job` events are silently lost — no EC2 instances are launched, no JIT configs are created, and no retry messages are published. The affected jobs remain `queued` in GitHub until they time out (24 hours).

## Root Cause

In `lambda.ts`, the `scaleUpHandler` catch block distinguishes between `ScaleError` (EC2 fleet creation failures) and all other exceptions:

```typescript
} catch (e) {
    if (e instanceof ScaleError) {
      // ✅ Returns batch failures → SQS retries these messages
      batchItemFailures.push(...e.toBatchItemFailures(sqsMessages));
      logger.warn(`${e.detailedMessage} A retry will be attempted via SQS.`, { error: e });
    } else {
      // ❌ Logs error but returns EMPTY batchItemFailures → SQS deletes all messages
      logger.error(
        `Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, ignoring batch`,
        { error: e },
      );
    }
    return { batchItemFailures };
}
```

When a non-`ScaleError` exception occurs (e.g., `HttpError` from Octokit), `batchItemFailures` remains empty, and SQS treats all messages as successfully processed. The phrase "ignoring batch" in the log is misleading — the messages aren't just ignored temporarily, they're permanently lost.

There are two failure modes depending on when the error occurs:

- **Before EC2 instances are created** (e.g., during `createGithubInstallationAuth()`): No orphaned instances, but the webhook events for those jobs are gone and the jobs will never get a runner.
- **After some instances are created but before JIT config is written** (e.g., during `createStartRunnerConfig()`): Instances boot with no JIT config in SSM. The `start-runner.sh` polls SSM for some time, then gives up and self-terminates. 

## Impact

- **Jobs permanently lost**: Affected `workflow_job` events are deleted from SQS with no retry. Jobs remain `queued` in GitHub for up to 24 hours before timing out.
- **No observability**: The error log says "ignoring batch" but there's no metric, alarm, or dead-letter queue to catch this. The `job_retry` mechanism is never triggered because the retry message is only published after a successful scale-up.
- **Wasted EC2 instances**: If the failure occurs after instance creation, those instances boot with no SSM config and eventually self-terminate.

## How to Trigger

Any non-`ScaleError` exception during `scaleUp()` will trigger this. The most common cause we've observed is a transient HTTP 404 from the GitHub App installation token endpoint under concurrent load (see related issue: JWT collisions).

## Environment

- Module version: `~> 7.3`
- GitHub: Enterprise Cloud with Data Residency (`ghe.com`)
- `enable_jit_config = true`, `enable_ephemeral_runners = true`

## Suggested Fixes

### Fix 1: Return all messages as batch failures on unhandled errors

The generic catch block should return all messages as `batchItemFailures` instead of acknowledging them:

```typescript
} catch (e) {
    if (e instanceof ScaleError) {
      batchItemFailures.push(...e.toBatchItemFailures(sqsMessages));
      logger.warn(`${e.detailedMessage} A retry will be attempted via SQS.`, { error: e });
    } else {
      logger.error(
        `Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, returning batch for retry`,
        { error: e },
      );
      batchItemFailures.push(
        ...sqsMessages.map(({ messageId }) => ({ itemIdentifier: messageId })),
      );
    }
    return { batchItemFailures };
}
```

The SQS redrive policy / maxReceiveCount will eventually move persistently failing messages to the DLQ, preventing infinite retries.

### Fix 2: Enable Octokit rate limit retry

The `onAbuseLimit` and `onRateLimit` handlers in `createAuth()` log the rate limit but don't return `true`, so Octokit never retries:

```typescript
onRateLimit: (retryAfter, options, octokit, retryCount) => {
  logger.warn(`Rate limit hit, retrying after ${retryAfter} seconds`);
  return true; // ← was missing
},
onSecondaryRateLimit: (retryAfter, options, octokit, retryCount) => {
  logger.warn(`Secondary rate limit hit, retrying after ${retryAfter} seconds`);
  return true; // ← was missing
},
```

## Our Workaround

We have patched the generic catch to return all messages as `batchItemFailures` and enabled Octokit rate limit retry by returning `true` from the handlers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale-up Lambda silently drops entire SQS batch on non-ScaleError exceptions #5024

Description

Root Cause

Impact

How to Trigger

Environment

Suggested Fixes

Fix 1: Return all messages as batch failures on unhandled errors

Fix 2: Enable Octokit rate limit retry

Our Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scale-up Lambda silently drops entire SQS batch on non-ScaleError exceptions #5024

Description

Description

Root Cause

Impact

How to Trigger

Environment

Suggested Fixes

Fix 1: Return all messages as batch failures on unhandled errors

Fix 2: Enable Octokit rate limit retry

Our Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions