fix(batch-queue): Batch items that hit the environment queue size limit now fast-fail

ericallam · ericallam · commit d99721156f93 · 2026-04-09T22:34:54.000+01:00
diff --git a/.server-changes/batch-fast-fail-queue-size-limit.md b/.server-changes/batch-fast-fail-queue-size-limit.md
@@ -0,0 +1,130 @@
+---
+area: webapp
+type: fix
+---
+
+Batch items that hit the environment queue size limit now fast-fail without
+retries and without creating pre-failed TaskRuns.
+
+## Why
+
+When a customer fills their environment's queue (default 2.5M) and keeps
+pushing batch triggers, every batch item was hitting `ServiceValidationError`
+from `validateQueueLimits`, looping through 6 exponential-backoff retries
+(~63s per item), and then creating a pre-failed `TaskRun` for each item on the
+final attempt — bringing its attempt, trace events, and a `BatchTaskRunError`
+row along for the ride.
+
+At customer-overload scale (one tenant pushing ~1 batch/s × ~10 items each
+against a paused/full queue) this:
+
+1. filled `redis_alt` with hundreds of thousands of never-completing
+   `engine:batch:*` keys, because items kept bouncing between the FairQueue
+   and the in-flight hash,
+2. pinned the batch worker on doomed retries instead of processing healthy
+   batches, and
+3. created enormous volumes of pre-failed `TaskRun` / `BatchTaskRunError`
+   rows that serve no customer purpose (the items were never going to
+   trigger — the customer just needs to fix their queue).
+
+## What changed
+
+- New `QueueSizeLimitExceededError` subclass of `ServiceValidationError` —
+  thrown by `runEngine/services/triggerTask.server.ts` instead of the generic
+  validation error, so callers can detect this specific overload condition.
+- `ProcessBatchItemCallback` result gains an optional `skipRetries?: boolean`
+  flag. When true, the `BatchQueue` records the failure immediately regardless
+  of attempt number, bypassing the FairQueue retry ladder.
+- The batch process-item callback in `runEngineHandlers.server.ts` detects
+  `QueueSizeLimitExceededError` and returns
+  `{ success: false, errorCode: "QUEUE_SIZE_LIMIT_EXCEEDED", skipRetries: true }`
+  **without** calling `triggerFailedTaskService` — no pre-failed TaskRun is
+  created for these items.
+- The batch completion callback collapses per-item `BatchTaskRunError` writes
+  into a single aggregate row when every failure shares the same
+  `QUEUE_SIZE_LIMIT_EXCEEDED` error code, bounding DB writes to O(batches)
+  instead of O(items) during overload events.
+
+Other error types (transient trigger failures, environment not found, etc.)
+retain the existing retry + pre-failed-run behavior.
+
+## Test plan
+
+### Unit tests
+
+New `skipRetries on failed items` suite in
+`internal-packages/run-engine/src/batch-queue/tests/index.test.ts`:
+
+```bash
+cd internal-packages/run-engine
+pnpm run test ./src/batch-queue/tests/index.test.ts --run
+```
+
+Covers:
+
+- `skipRetries: true` from the callback → item called exactly once, not
+  `maxAttempts` times.
+- Regression guard: when `skipRetries` is not set, the retry ladder still
+  fires (items called `maxAttempts` times).
+- Per-item mixing within one batch: even-indexed items fast-fail, odd-indexed
+  items exhaust the retry ladder — all correctly tracked.
+
+### Manual e2e (local, against `references/hello-world`)
+
+Done before merge — reproduces the Centralize-style overload against the
+local dev stack.
+
+Setup:
+
+1. Add `MAXIMUM_DEPLOYED_QUEUE_SIZE=2` to the webapp's `.env.local` (or
+   whatever file your local webapp reads — this caps the deployed queue at
+   just 2 items).
+2. `pnpm run dev --filter webapp`
+3. `cd references/hello-world && pnpm exec trigger dev`
+4. Temporarily add a blocking task to `references/hello-world/src/trigger/`:
+
+   ```ts
+   export const sleepyTask = task({
+     id: "sleepy-task",
+     run: async () => {
+       await new Promise((r) => setTimeout(r, 10 * 60 * 1000));
+     },
+   });
+   ```
+
+5. Trigger `sleepy-task` twice individually via the dashboard or MCP so the
+   queue is holding 2 items and hits the cap.
+
+Exercise the fix:
+
+6. Trigger a batch with 5 items of `sleepy-task` via
+   `mcp__trigger__trigger_task` (or the batch API directly).
+
+Expected observations (all must be true):
+
+- [ ] Dashboard: the new batch transitions to `ABORTED` within a second or
+      two — it does **not** sit in `PROCESSING` for a minute+.
+- [ ] DB: `BatchTaskRun` row has
+      `status='ABORTED'`, `failedRunCount=5`, `successfulRunCount=0`.
+- [ ] DB: **exactly one** `BatchTaskRunError` row for the batch
+      (`SELECT COUNT(*) FROM "BatchTaskRunError" WHERE "batchTaskRunId"=…`),
+      with the error text mentioning `"5 items in this batch failed with
+      the same error"` and `errorCode='QUEUE_SIZE_LIMIT_EXCEEDED'`.
+- [ ] DB: **no new `TaskRun` rows** were created for the batch items
+      (compare `SELECT COUNT(*) FROM "TaskRun" WHERE "batchId"=…` before
+      and after — should stay 0).
+- [ ] Webapp logs: one
+      `"[BatchQueue] Batch item rejected: queue size limit reached"` per
+      item at `warn` level, **no**
+      `"[BatchQueue] Failed to trigger batch item"` error lines, **no**
+      `"TriggerFailedTaskService"` log lines for the batch items.
+- [ ] Redis (via `redis-cli` against the local redis instance backing the
+      batch queue): `engine:batch:<batchId>:*` keys are gone after the batch
+      finalizes, and so are the
+      `engine:batch:queue:env:<envId>:batch:<batchId>*` keys.
+
+Clean up:
+
+7. Cancel the two `sleepy-task` runs to unblock the queue.
+8. Remove the temporary `sleepy-task` file and the
+   `MAXIMUM_DEPLOYED_QUEUE_SIZE` override.
diff --git a/apps/webapp/app/runEngine/services/triggerTask.server.ts b/apps/webapp/app/runEngine/services/triggerTask.server.ts
@@ -41,7 +41,7 @@ import type {
   TriggerTaskRequest,
   TriggerTaskValidator,
 } from "../types";
-import { ServiceValidationError } from "~/v3/services/common.server";
+import { QueueSizeLimitExceededError, ServiceValidationError } from "~/v3/services/common.server";
 
 class NoopTriggerRacepointSystem implements TriggerRacepointSystem {
   async waitForRacepoint(options: { racepoint: TriggerRacepoints; id: string }): Promise<void> {
@@ -271,8 +271,9 @@ export class RunEngineTriggerTaskService {
         );
 
         if (!queueSizeGuard.ok) {
-          throw new ServiceValidationError(
+          throw new QueueSizeLimitExceededError(
             `Cannot trigger ${taskId} as the queue size limit for this environment has been reached. The maximum size is ${queueSizeGuard.maximumSize}`,
+            queueSizeGuard.maximumSize ?? 0,
             undefined,
             "warn"
           );
diff --git a/apps/webapp/app/services/realtime/mintRunToken.server.ts b/apps/webapp/app/services/realtime/mintRunToken.server.ts
@@ -0,0 +1,41 @@
+import { generateJWT as internal_generateJWT } from "@trigger.dev/core/v3";
+import { extractJwtSigningSecretKey } from "./jwtAuth.server";
+
+type Environment = Parameters<typeof extractJwtSigningSecretKey>[0];
+
+export type MintRunTokenOptions = {
+  /** Include the input-stream write scope (needed for steering messages from the playground). */
+  includeInputStreamWrite?: boolean;
+  /** Token expiration. Defaults to "1h". */
+  expirationTime?: string;
+};
+
+/**
+ * Mint a run-scoped public access token (JWT) for browser subscription to a
+ * run's realtime streams.
+ *
+ * Used by:
+ * - The playground action to give a freshly triggered chat session a token.
+ * - The run details page to let the agent view subscribe to the chat stream
+ *   of an existing run (read-only).
+ */
+export async function mintRunToken(
+  environment: Environment,
+  runFriendlyId: string,
+  options: MintRunTokenOptions = {}
+): Promise<string> {
+  const scopes = [`read:runs:${runFriendlyId}`];
+  if (options.includeInputStreamWrite) {
+    scopes.push(`write:inputStreams:${runFriendlyId}`);
+  }
+
+  return internal_generateJWT({
+    secretKey: extractJwtSigningSecretKey(environment),
+    payload: {
+      sub: environment.id,
+      pub: true,
+      scopes,
+    },
+    expirationTime: options.expirationTime ?? "1h",
+  });
+}
diff --git a/apps/webapp/app/v3/runEngineHandlers.server.ts b/apps/webapp/app/v3/runEngineHandlers.server.ts
@@ -13,6 +13,7 @@ import { logger } from "~/services/logger.server";
 import { updateMetadataService } from "~/services/metadata/updateMetadataInstance.server";
 import { reportInvocationUsage } from "~/services/platform.v3.server";
 import { MetadataTooLargeError } from "~/utils/packets";
+import { QueueSizeLimitExceededError } from "~/v3/services/common.server";
 import { TriggerTaskService } from "~/v3/services/triggerTask.server";
 import { tracer } from "~/v3/tracer.server";
 import { createExceptionPropertiesFromError } from "./eventRepository/common.server";
@@ -637,6 +638,15 @@ export function registerRunEngineEventBusHandlers() {
   });
 }
 
+/**
+ * errorCode returned by the batch process-item callback when the trigger was
+ * rejected because the environment's queue is at its maximum size. The
+ * BatchQueue (via `skipRetries`) short-circuits retries for this code, and the
+ * batch completion callback collapses per-item errors into a single aggregate
+ * `BatchTaskRunError` row instead of writing one per item.
+ */
+const QUEUE_SIZE_LIMIT_EXCEEDED_ERROR_CODE = "QUEUE_SIZE_LIMIT_EXCEEDED";
+
 /**
  * Set up the BatchQueue processing callbacks.
  * These handle creating runs from batch items and completing batches.
@@ -808,6 +818,37 @@ export function setupBatchQueueCallbacks() {
         } catch (error) {
           const errorMessage = error instanceof Error ? error.message : String(error);
 
+          // Queue-size-limit rejections are a customer-overload scenario (the
+          // env's queue is at its configured max). Retrying is pointless — the
+          // same item will fail again — and creating pre-failed TaskRuns for
+          // every item of every retried batch is exactly what chews through
+          // DB capacity when a noisy tenant fills their queue. Signal the
+          // BatchQueue to skip retries and skip pre-failed run creation, and
+          // let the completion callback collapse the per-item errors into a
+          // single summary row.
+          if (error instanceof QueueSizeLimitExceededError) {
+            logger.warn("[BatchQueue] Batch item rejected: queue size limit reached", {
+              batchId,
+              friendlyId,
+              itemIndex,
+              task: item.task,
+              environmentId: meta.environmentId,
+              maximumSize: error.maximumSize,
+            });
+
+            span.setAttribute("batch.result.error", errorMessage);
+            span.setAttribute("batch.result.errorCode", QUEUE_SIZE_LIMIT_EXCEEDED_ERROR_CODE);
+            span.setAttribute("batch.result.skipRetries", true);
+            span.end();
+
+            return {
+              success: false as const,
+              error: errorMessage,
+              errorCode: QUEUE_SIZE_LIMIT_EXCEEDED_ERROR_CODE,
+              skipRetries: true,
+            };
+          }
+
           logger.error("[BatchQueue] Failed to trigger batch item", {
             batchId,
             friendlyId,
@@ -889,20 +930,51 @@ export function setupBatchQueueCallbacks() {
           },
         });
 
-        // Create error records if there were failures
+        // Create error records if there were failures.
+        //
+        // Fast-path for queue-size-limit overload: when every failure is the
+        // same QUEUE_SIZE_LIMIT_EXCEEDED error, collapse them into a single
+        // aggregate row instead of writing one per item. This keeps the DB
+        // write volume bounded to O(batches) instead of O(items) when a noisy
+        // tenant fills their queue and all of their batches start bouncing.
         if (failures.length > 0) {
-          await tx.batchTaskRunError.createMany({
-            data: failures.map((failure) => ({
-              batchTaskRunId: batchId,
-              index: failure.index,
-              taskIdentifier: failure.taskIdentifier,
-              payload: failure.payload,
-              options: failure.options as Prisma.InputJsonValue | undefined,
-              error: failure.error,
-              errorCode: failure.errorCode,
-            })),
-            skipDuplicates: true,
-          });
+          const allQueueSizeLimit = failures.every(
+            (f) => f.errorCode === QUEUE_SIZE_LIMIT_EXCEEDED_ERROR_CODE
+          );
+
+          if (allQueueSizeLimit) {
+            const sample = failures[0]!;
+            await tx.batchTaskRunError.createMany({
+              data: [
+                {
+                  batchTaskRunId: batchId,
+                  // Use the first item's index as a stable anchor for the
+                  // (batchTaskRunId, index) unique constraint so callback
+                  // retries remain idempotent.
+                  index: sample.index,
+                  taskIdentifier: sample.taskIdentifier,
+                  payload: sample.payload,
+                  options: sample.options as Prisma.InputJsonValue | undefined,
+                  error: `${sample.error} (${failures.length} items in this batch failed with the same error)`,
+                  errorCode: sample.errorCode,
+                },
+              ],
+              skipDuplicates: true,
+            });
+          } else {
+            await tx.batchTaskRunError.createMany({
+              data: failures.map((failure) => ({
+                batchTaskRunId: batchId,
+                index: failure.index,
+                taskIdentifier: failure.taskIdentifier,
+                payload: failure.payload,
+                options: failure.options as Prisma.InputJsonValue | undefined,
+                error: failure.error,
+                errorCode: failure.errorCode,
+              })),
+              skipDuplicates: true,
+            });
+          }
         }
       });
 
diff --git a/apps/webapp/app/v3/services/common.server.ts b/apps/webapp/app/v3/services/common.server.ts
@@ -10,3 +10,22 @@ export class ServiceValidationError extends Error {
     this.name = "ServiceValidationError";
   }
 }
+
+/**
+ * Thrown when a trigger is rejected because the environment's queue is at its
+ * maximum size. This is identified separately from other validation errors so
+ * the batch queue worker can short-circuit retries and skip pre-failed run
+ * creation for this specific overload scenario — see the batch process item
+ * callback in `runEngineHandlers.server.ts`.
+ */
+export class QueueSizeLimitExceededError extends ServiceValidationError {
+  constructor(
+    message: string,
+    public maximumSize: number,
+    status?: number,
+    logLevel?: ServiceValidationErrorLevel
+  ) {
+    super(message, status, logLevel);
+    this.name = "QueueSizeLimitExceededError";
+  }
+}
diff --git a/internal-packages/run-engine/src/batch-queue/index.ts b/internal-packages/run-engine/src/batch-queue/index.ts
@@ -865,8 +865,16 @@ export class BatchQueue {
             span?.setAttribute("batch.errorCode", result.errorCode);
           }
 
-          // If retries are available, use FairQueue retry scheduling
-          if (!isFinalAttempt) {
+          const skipRetries = result.skipRetries === true;
+          if (skipRetries) {
+            span?.setAttribute("batch.skipRetries", true);
+          }
+
+          // If retries are available AND the callback didn't opt out, use
+          // FairQueue retry scheduling. `skipRetries` short-circuits this
+          // regardless of attempt number so the batch can finalize quickly
+          // when the error is known to be non-recoverable on retry.
+          if (!isFinalAttempt && !skipRetries) {
             span?.setAttribute("batch.retry", true);
             span?.setAttribute("batch.attempt", attempt);
 
@@ -890,7 +898,7 @@ export class BatchQueue {
             return;
           }
 
-          // Final attempt exhausted - record permanent failure
+          // Final attempt exhausted (or retries skipped) - record permanent failure
           const payloadStr = await this.#startSpan(
             "BatchQueue.serializePayload",
             async (innerSpan) => {
diff --git a/internal-packages/run-engine/src/batch-queue/tests/index.test.ts b/internal-packages/run-engine/src/batch-queue/tests/index.test.ts
diff --git a/internal-packages/run-engine/src/batch-queue/types.ts b/internal-packages/run-engine/src/batch-queue/types.ts