PromptForge

Distributed LLM batch processing — at any scale, without babysitting.
Upload a file of prompts. Every result comes back. Reliably.

The Problem

Running thousands of prompts against an LLM provider is not just an API call problem. It is a systems problem.

Providers impose rate limits. Pods crash mid-job. Files are too large to hold in memory. You need to know exactly which prompts succeeded, which failed, and why — at a per-prompt level — without having to watch it happen.

Most teams write fragile scripts that break at scale. PromptForge is a purpose-built platform that handles all of it: rate learning, failure recovery, checkpointing, result reconciliation, and observability — without any manual configuration.

How It Works

The system is organized into three services. Nothing polls — every state transition is triggered by an event.

Ingestion — Cloud Run API

A client calls POST /v1/jobs/init with a provider, model, and optional rate limits. The API creates a job record in Firestore and returns a signed GCS upload URL. The client writes the prompt file directly to Google Cloud Storage — no proxy, no bottleneck. The API's job ends here.

Orchestration — Cloud Run Launcher

When the upload lands in GCS, an Eventarc OBJECT_FINALIZE event fires. The Launcher validates every line of the JSONL file (invalid lines go to errors.jsonl, the job continues), updates the job state in Firestore, and creates a GKE Kubernetes Job. The Launcher exits as soon as the pod is scheduled. It has no ongoing responsibility.

Per-client concurrency is enforced here: if a pod is already running for a client, the new job is queued as PENDING. The previous pod promotes it on completion.

Execution — GKE Pod

One pod per job. It runs four concurrent async loops for its entire lifetime:

Dispatch loop — spaces prompt requests by the learned effective RPM
Response handler — processes LLM responses as they arrive, out of order
Result buffer — accumulates responses and flushes to GCS in batches (500 responses, 50 MB, or 30s — whichever comes first)
Checkpoint writer — writes state.json to GCS every 30 seconds

The pod talks to the LLM provider directly over HTTPS via LiteLLM. No intermediate queue, no worker fleet — the async event loop is the scheduler.

Rate Learning

PromptForge never trusts static RPM/TPM values. It treats them as upper bounds and discovers real limits at runtime using a TCP-inspired algorithm.

Slow start — before seeing any 429, the dispatcher multiplies its RPM target by 1.5 every 30 seconds. Starting from 10 RPM: 10 → 15 → 22 → 33 → 49 → 73 → ...

Congestion avoidance — after a 429, it backs off to 75% of the current target, waits 60 seconds, then increments by +1 every 30 seconds until it hits a ceiling again.

TPM constraint — effective RPM is also bounded by token budget: effective_rpm = min(rpm_target, tpm_limit / p95_output_tokens). The P95 token estimate updates after every successful response. As responses grow longer, throughput drops automatically.

This means a job submitted with rpm: 500 won't saturate at 500 if the provider's real limit is lower. It will converge to whatever the provider actually allows.

Reliability

Crash recovery — on pod restart, the execution service reads state.json from GCS and resumes from the last checkpoint offset. At-least-once delivery: prompts near the checkpoint boundary may be re-dispatched, but none are silently dropped.

Reconciliation — job completion includes a bitset pass across all result and error files. Every prompt_id is accounted for. If any ID is missing, it is logged explicitly — there is no silent data loss.

Duplicate Eventarc events — the Launcher checks the current Firestore job status before acting. Duplicate OBJECT_FINALIZE events (common in GCS) are safely idempotent.

Observability

All three services are instrumented with OpenTelemetry, exporting to Axiom over OTLP. Errors are captured to Sentry.

Every prompt dispatch produces a child span carrying token counts, latency, attempt number, and error code. Key rate learning gauges are emitted in real time:

Metric	What it tells you
`rate.rpm_target`	Current learned RPM target
`rate.effective_rpm`	RPM after TPM constraint
`rate.p95_tokens`	Rolling P95 completion token estimate
`rate.429_events`	Rate limit events since job start
`prompts.completed`	Successfully processed
`prompts.failed`	Permanently failed after max retries
`llm.latency_ms`	Per-request provider latency histogram
`queue.retry_depth`	Prompts currently awaiting retry

prompt_id is never used as a metric label — cardinality is kept flat regardless of job size.

API

Authentication uses X-API-Key on every request. Keys are stored in GCP Secret Manager and cached in-process with a 5-minute TTL.

Endpoint	Description
`POST /v1/jobs/init`	Create a job. Returns `job_id` and a signed GCS upload URL.
`GET /v1/jobs/{job_id}`	Poll status and progress counts.
`GET /v1/jobs/{job_id}/results`	Returns signed download URLs for result JSONL files. Available once `COMPLETED`.
`DELETE /v1/jobs/{job_id}`	Cancel. In-flight prompts complete naturally; pending ones are abandoned.

Input format: newline-delimited JSON, one {"prompt_id": int, "prompt": string} per line. prompt_id must be unique within the job — it is used for deduplication, retry tracking, and reconciliation.

Output format: results_final.jsonl (or results_part_NNN.jsonl for large jobs) — one record per prompt carrying response, tokens, latency_ms, and attempt. Permanently failed prompts go to errors.jsonl with error codes and attempt history.

Stack

Layer	Technology
API service	Python / FastAPI on Cloud Run
Job launcher	Python / FastAPI on Cloud Run (ephemeral)
Execution engine	Python / asyncio on GKE (one pod per job)
LLM routing	LiteLLM
Event routing	GCP Eventarc (`OBJECT_FINALIZE`)
Job state	Firestore
Object storage	Google Cloud Storage
Secrets	GCP Secret Manager
Observability	OpenTelemetry → Axiom + Sentry
Infrastructure	Pulumi (TypeScript)
Supported providers	OpenAI, Gemini

Design Principles

Event-driven end to end. No component polls. Every state transition is triggered by an event: HTTP response, GCS object finalize, async task completion.

One pod, one job. Each job runs in a fully isolated GKE pod. A failure or rate event in one job cannot affect another. Accounting is exact.

Memory-constant. Prompt files are never loaded in full. The execution pod reads one line at a time via GCS byte-range streaming. A 10M-prompt job uses the same memory as a 100-prompt job.

At-least-once over at-most-once. On recovery, prompts near the checkpoint boundary may be re-sent. This is deliberate — duplicate work is preferable to silent data loss.

No per-response writes. Results buffer and flush in batches. One million prompts produce roughly 2,000 GCS writes, not one million.

Reconciliation as a completion step. Job completion is not just a status flip. A full bitset pass verifies every prompt_id was accounted for before the job is marked COMPLETED.

Repository Layout

infra/              Pulumi TypeScript — all GCP resources
services/
  api/              Cloud Run API (FastAPI)
  launcher/         Cloud Run Job Launcher (FastAPI + CloudEvents)
  execution/        GKE Execution Pod (asyncio)
    loops/          dispatch, response handler, buffer, checkpoint
    rate/           rate learning controller
shared/             Models, Firestore helpers, GCS helpers, observability bootstrap
tests/
  unit/             Isolated unit tests per service
  integration/      Tests against real GCP resources
  e2e/              Full end-to-end flow against live deployment
scripts/
  debug/            GKE log inspection, stale job repair
  deploy/           Image build and push helpers
docs/               Architecture, API spec, design decisions

Contributing

Open an issue before submitting a PR for anything beyond a small fix — it helps align on direction before work begins.

Fork the repository
Create a feature branch
Commit with clear messages
Open a pull request against main

License

MIT. See LICENSE for details.

Built for failure, scale, and visibility — from day one.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.claude		.claude
docs		docs
infra		infra
scripts		scripts
services		services
shared		shared
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gcloudignore		.gcloudignore
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
SHIPPING.md		SHIPPING.md
STATUS.md		STATUS.md
cloudbuild-api.yaml		cloudbuild-api.yaml
cloudbuild-launcher.yaml		cloudbuild-launcher.yaml
cloudbuild.yaml		cloudbuild.yaml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PromptForge

The Problem

How It Works

Ingestion — Cloud Run API

Orchestration — Cloud Run Launcher

Execution — GKE Pod

Rate Learning

Reliability

Observability

API

Stack

Design Principles

Repository Layout

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PromptForge

The Problem

How It Works

Ingestion — Cloud Run API

Orchestration — Cloud Run Launcher

Execution — GKE Pod

Rate Learning

Reliability

Observability

API

Stack

Design Principles

Repository Layout

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages