-
Notifications
You must be signed in to change notification settings - Fork 3
feat: add circuit breaker for upstream provider overload protection #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kacpersaw
wants to merge
21
commits into
main
Choose a base branch
from
kacpersaw/aibridge-circuit-breaker
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+522
−8
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Implement per-provider circuit breakers that detect upstream rate limiting (429/503/529 status codes) and temporarily stop sending requests when providers are overloaded. Key features: - Per-provider circuit breakers (Anthropic, OpenAI) - Configurable failure threshold, time window, and cooldown period - Half-open state allows gradual recovery testing - Prometheus metrics for monitoring (state gauge, trips counter, rejects counter) - Thread-safe implementation with proper state machine transitions - Disabled by default for backward compatibility Circuit breaker states: - Closed: normal operation, tracking failures within sliding window - Open: all requests rejected with 503, waiting for cooldown - Half-Open: limited requests allowed to test if upstream recovered Status codes that trigger circuit breaker: - 429 Too Many Requests - 503 Service Unavailable - 529 Anthropic Overloaded Relates to: coder/internal#1153
dannykopping
requested changes
Dec 16, 2025
pawbana
reviewed
Dec 16, 2025
…solation - Replace custom circuit breaker implementation with sony/gobreaker - Change from per-provider to per-endpoint circuit breakers (e.g., OpenAI chat completions failing won't block responses API) - Simplify API: CircuitBreakers manages all breakers internally - Update metrics to include endpoint label - Simplify tests to focus on key behaviors Based on PR review feedback suggesting use of established library and per-endpoint granularity for better fault isolation.
Rename fields to match gobreaker naming convention: - Window -> Interval - Cooldown -> Timeout - HalfOpenMaxRequests -> MaxRequests - FailureThreshold type int64 -> uint32
dannykopping
requested changes
Dec 17, 2025
…onfigs Address PR review feedback: 1. Middleware pattern - Circuit breaker is now HTTP middleware that wraps handlers, capturing response status codes directly instead of extracting from provider-specific error types. 2. Per-provider configs - NewCircuitBreakers takes map[string]CircuitBreakerConfig keyed by provider name. Providers not in the map have no circuit breaker. 3. Remove provider overfitting - Deleted extractStatusCodeFromError() which hardcoded AnthropicErrorResponse and OpenAIErrorResponse types. Middleware now uses statusCapturingWriter to inspect actual HTTP response codes. 4. Configurable failure detection - IsFailure func in config allows providers to define custom status codes as failures. Defaults to 429/503/529. 5. Fix gauge values - State gauge now uses 0 (closed), 0.5 (half-open), 1 (open) 6. Integration tests - Replaced unit tests with httptest-based integration tests that verify actual behavior: upstream errors trip circuit, requests get blocked, recovery after timeout, per-endpoint isolation. 7. Error message - Changed from 'upstream rate limiting' to 'circuit breaker is open'
- Add CircuitBreaker interface with Allow(), RecordSuccess(), RecordFailure() - Add NoopCircuitBreaker struct for providers without circuit breaker config - Add gobreakerCircuitBreaker wrapping sony/gobreaker implementation - CircuitBreakers.Get() returns NoopCircuitBreaker when provider not configured - Add http.Flusher support to statusCapturingWriter for SSE streaming - Add Unwrap() for ResponseWriter interface detection
pawbana
reviewed
Dec 17, 2025
- Changed CircuitBreaker interface to Execute(fn func() int) (statusCode, rejected) - Use gobreaker.Execute() to properly handle both ErrOpenState and ErrTooManyRequests - NoopCircuitBreaker.Execute simply runs the function and returns not rejected - Simplified middleware by removing separate Allow/Record pattern
pawbana
reviewed
Dec 17, 2025
…e gobreakerCircuitBraker along with the interface and noop struct
pawbana
reviewed
Dec 17, 2025
Co-authored-by: Paweł Banaszewski <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Implement per-provider circuit breakers that detect upstream rate limiting (429/503/529 status codes) and temporarily stop sending requests when providers are overloaded.
This completes the overload protection story by adding the aibridge-specific component that couldn't be implemented as generic HTTP middleware in coderd (since it requires understanding upstream provider responses).
Key Features
Circuit Breaker States
Status Codes That Trigger Circuit Breaker
Other error codes (400, 401, 500, 502, etc.) do not trigger the circuit breaker since they indicate different issues that circuit breaking wouldn't help with.
Default Configuration
EnabledfalseFailureThreshold5Window10sCooldown30sHalfOpenMaxRequests3New Prometheus Metrics
aibridge_circuit_breaker_state{provider}- Current state (0=closed, 1=open, 2=half-open)aibridge_circuit_breaker_trips_total{provider}- Total times circuit openedaibridge_circuit_breaker_rejects_total{provider}- Requests rejected due to open circuitFiles Changed
circuit_breaker.go- Core circuit breaker implementationcircuit_breaker_test.go- Comprehensive test suite (13 tests)bridge.go- Integration into RequestBridgeinterception.go- Apply circuit breaker to intercepted requestsmetrics.go- Add Prometheus metricsTesting
All tests pass:
Related
aibridgedinternal#1153