Skip to content

Conversation

@hiroTamada
Copy link
Contributor

@hiroTamada hiroTamada commented Jan 12, 2026

Summary

  • Remove the build queue - builds now start immediately when created
  • Delete lib/builds/queue.go and lib/builds/queue_test.go
  • Add ErrResourcesExhausted error and 503 response type for resource exhaustion
  • Update OpenAPI spec with 503 response and deprecated queue_position field

Motivation

The build queue added complexity without providing significant benefits:

  • Builder VMs use similar resources to regular VMs
  • Queueing delays builds when resources may be available
  • Simpler architecture without queue state management

Key Changes

Aspect Before After
Build start Queued, starts when slot available Starts immediately
Capacity exceeded Queued with position Build fails with resource error
Cancel queued Removes from queue N/A (no queue)
Recovery on restart Re-enqueue pending Start pending immediately

Files Changed

  • Deleted: lib/builds/queue.go, lib/builds/queue_test.go
  • Modified: manager.go, types.go, errors.go, metrics.go, storage.go
  • API: Added 503 response to openapi.yaml, updated cmd/api/api/builds.go
  • Tests: Updated manager_test.go to remove queue tests
  • Docs: Updated README.md to reflect new architecture

Test plan

  • go build ./... - project compiles
  • go test ./lib/builds/... - all builds package tests pass
  • Manual testing: create a build and verify it starts immediately with building status
  • Manual testing: verify build completion and logs streaming work correctly

Note

Modernizes the build flow to execute immediately and surface capacity errors early.

  • Remove build queue: delete lib/builds/queue.go and tests; initial status is building; recovery restarts pending builds directly
  • Add preflight resource check via instances.CheckResourceAvailability; introduce ErrResourcesExhausted; map to 503 Service Unavailable with Retry-After in API (cmd/api/..., oapi); surface from manager and instance layer
  • OpenAPI/types: drop queued status, deprecate queue_position, change 202 description to “started”, and add 503 response; regenerate client/server code
  • Improve manager: avoid overwriting terminal states during updates/cancel; start builds in goroutine; persist builder instance ID; adjust cancel to terminate VM safely
  • Metrics: keep hypeman_build_duration_seconds and hypeman_builds_total; remove queue/active gauges
  • Docs/tests/tooling: update README.md, revise tests to expect immediate building, remove queue tests; enhance e2e script to use WebSocket exec (websocat); minor dep tidy (go.mod)

Written by Cursor Bugbot for commit 587b767. This will update automatically on new commits. Configure here.

Replace the explicit build queue with immediate build execution.
Builds now start immediately when created instead of being queued.

Key changes:
- Delete queue.go and queue_test.go
- Remove StatusQueued - builds start with StatusBuilding
- Add ErrResourcesExhausted error for resource limit detection
- Add 503 response with Retry-After header to OpenAPI spec
- Update manager to start builds in goroutines directly
- Simplify CancelBuild to only handle running builds
- Update RecoverPendingBuilds to start builds immediately
- Remove queue-related metrics (queueLength, activeBuilds)
- Update tests and documentation

If host resources are exhausted during instance creation, the build
will fail with a clear error message. The API includes a 503 response
type with Retry-After header for future pre-check implementation.
@github-actions
Copy link

github-actions bot commented Jan 12, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

refactor: remove build queue for immediate build execution

Edit this comment to update it. It will appear in the SDK's changelogs.

hypeman-typescript studio · code · diff

Your SDK built successfully.
generate ⚠️build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/4289f51fe58184282fae729f82a3516d9960ea75/dist.tar.gz
hypeman-go studio · code · diff

Your SDK built successfully.
generate ⚠️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@862eb3cb88dd3508be7dd81fe891354a764735b8
hypeman-cli studio · conflict

⏳ These are partial results; builds are still running.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2026-01-13 21:39:34 UTC

@tembo
Copy link

tembo bot commented Jan 12, 2026

A couple things to consider with the queue removal:

  • lib/builds/manager.go: CreateBuild now always returns a build after spawning go m.runBuild(...). With that, ErrResourcesExhausted (detected in executeBuild) can’t propagate back to the caller, so the new 503 handling in cmd/api/api/builds.go likely never triggers. If the intent is “fail fast on create when resources are exhausted”, you may want a synchronous preflight/reservation step before returning the 202.

  • lib/builds/manager.go: the resource exhaustion detection via strings.Contains(err.Error(), "exceeds") && strings.Contains(..., "limit") feels pretty brittle and easy to break with wording changes. If instances.Manager has (or can expose) a typed/sentinel error for capacity/limit failures, that’d make this mapping much more reliable.

  • lib/builds/storage.go: listPendingBuilds no longer includes queued. If there are any existing builds on disk with status queued from previous versions, they’ll never get recovered and could be stuck indefinitely. A short-term backwards-compat path (treat "queued" as pending) or a migration step could help here.

  • lib/builds/manager.go: CancelBuild now deletes the builder instance but ignores the return error; might be worth at least logging failures (and/or using a short context.WithTimeout(context.Background(), ...) so cleanup isn’t skipped if the request context is already canceled).

  • openapi.yaml / generated lib/oapi/oapi.go: queue_position is marked deprecated, but generated code calls out no x-deprecated-reason. If you want nicer client docs, adding an explicit reason/extension in the spec would avoid that awkward generated comment.

Copy link
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

if err != nil {
// Check if this is a resource exhaustion error
errStr := err.Error()
if strings.Contains(errStr, "exceeds") && strings.Contains(errStr, "limit") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclear how this ErrResourcesExhausted error makes it back to the API handler - since executeBuild is called from runBuild which runs in a background goroutine, the caller has already received a 202 by the time this runs. the 503 path in the handler seems unreachable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right - that was exactly the issue. I've fixed this by adding a synchronous preflight resource check in CreateBuild that runs before spawning the async goroutine:

// Preflight check: verify resources are available before accepting the build
builderMemory := int64(policy.MemoryMB) * 1024 * 1024
if err := m.instanceManager.CheckResourceAvailability(ctx, policy.CPUs, builderMemory); err != nil {
    if errors.Is(err, instances.ErrResourcesExhausted) {
        return nil, fmt.Errorf("%w: %v", ErrResourcesExhausted, err)
    }
    return nil, fmt.Errorf("check resource availability: %w", err)
}

Added a new CheckResourceAvailability() method to instances.Manager that checks per-instance and aggregate limits without actually creating an instance. Now the 503 path is reachable when resources are truly exhausted at build creation time.

Also added proper sentinel errors (ErrResourcesExhausted) to both packages so we can use errors.Is() instead of brittle string matching.

- High: Add synchronous preflight resource check in CreateBuild to return
  503 when resources are exhausted (instead of failing asynchronously)
- Medium: Handle StatusQueued in recovery for backward compatibility with
  builds created before queue removal
- Medium: Check terminal states in updateStatus to prevent race condition
  where cancelled builds get overwritten by runBuild goroutine

Changes:
- Add ErrResourcesExhausted sentinel error to instances package
- Add CheckResourceAvailability() method to instances.Manager interface
- Add isTerminalStatus() helper and terminal state checks in manager.go
- Update CancelBuild to set cancelled status before deleting instance
- Handle StatusQueued in listPendingBuilds for backward compat
StatusQueued constant was removed, use "queued" string literal instead.
hiroTamada and others added 2 commits January 13, 2026 14:13
The exec endpoint uses WebSocket, not REST API. The previous curl POST
approach returned HTTP 405 and empty responses.

Changes:
- Use websocat WebSocket client for exec (auto-downloads if needed)
- Write JSON command to temp file to avoid shell escaping issues
- Add wait_for_agent parameter to give guest agent time to initialize
- Increase sleep before exec from 2s to 5s for reliability
- Verify exec success by checking for expected output (Alpine Linux)
… result

Fixed a race condition where a build could still execute after being
cancelled. The issue was:

1. runBuild checks if build is in terminal state
2. CancelBuild is called, setting status to "cancelled"
3. runBuild calls updateStatus(StatusBuilding) which silently returns
4. runBuild proceeds to executeBuild, wasting compute resources

The fix:
- updateStatus now returns bool indicating if the update was applied
- runBuild checks this return value and aborts if false
- This prevents creating builder VMs for already-cancelled builds
return nil, fmt.Errorf("%w: %v", ErrResourcesExhausted, err)
}
return nil, fmt.Errorf("check resource availability: %w", err)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preflight resource check has TOCTOU race condition

Medium Severity

The CheckResourceAvailability preflight check runs before acquiring createMu lock. Multiple concurrent CreateBuild requests can all pass this check simultaneously before any build actually consumes resources (which happens asynchronously in executeBuild). This means under concurrent load, users may receive 202 responses for builds that immediately fail due to resource exhaustion, defeating the intended 503 fast-fail behavior. Moving the check inside the lock would at least serialize the checks and make them more accurate relative to pending builds.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants