feat: adding circuit breaker feature#8266
Conversation
| return cached | ||
|
|
||
| try: | ||
| record = self._get_record(name) |
There was a problem hiding this comment.
Does _get_record really need to throw an exception here? Can we not just check if the value returned is `None?
| opened_at=opened_at, | ||
| expiry_timestamp=self._durable_ttl(), | ||
| ) | ||
| self._update_record(record) |
There was a problem hiding this comment.
What happens if this fails? Does it leave the DynamoDB row permanently stale?
| raise | ||
|
|
||
| def _update_record(self, record: CircuitStateRecord) -> None: | ||
| update_expression = "SET #state = :state, #failure_count = :failure_count" |
There was a problem hiding this comment.
I think this query creation logic could be a separate private method
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #8266 +/- ##
===========================================
+ Coverage 96.58% 96.62% +0.04%
===========================================
Files 286 296 +10
Lines 14294 14679 +385
Branches 1192 1227 +35
===========================================
+ Hits 13806 14184 +378
- Misses 357 360 +3
- Partials 131 135 +4 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
| local_expiry = int(datetime.datetime.now().timestamp()) + self.local_cache_max_age | ||
| self._cache[self._cache_key(record.name)] = (local_expiry, record) | ||
|
|
||
| def _retrieve_from_cache(self, name: str) -> CircuitStateRecord | None: |
There was a problem hiding this comment.
Nit: I would put the delete in the conditional rather than in the main path:
def _retrieve_from_cache(self, name: str) -> CircuitStateRecord | None:
"""Return a cached record if present and still within its local freshness window."""
cached = self._cache.get(self._cache_key(name))
if cached is None:
return None
local_expiry, record = cached
if int(datetime.datetime.now().timestamp()) >= local_expiry:
del self._cache[self._cache_key(name)]
return None
return record| self.failure_count_attr: {"N": str(record.failure_count)}, | ||
| } | ||
| if record.opened_at is not None: | ||
| item[self.opened_at_attr] = {"N": str(record.opened_at)} |
There was a problem hiding this comment.
Could we add a simple helper function to unpack these values rather than having to dig into the the test N field and cast to string multiple times in this module?
|
|
||
| **Circuit** is a named guard around a single downstream dependency. Each `name` is an independent circuit. | ||
|
|
||
| **State** is the circuit's current mode: `CLOSED` (normal), `OPEN` (downstream considered unhealthy, calls skipped), or `HALF_OPEN` (testing recovery). |
There was a problem hiding this comment.
| **State** is the circuit's current mode: `CLOSED` (normal), `OPEN` (downstream considered unhealthy, calls skipped), or `HALF_OPEN` (testing recovery). | |
| **State** is the circuit's current mode: `CLOSED` (healthy), `OPEN` (downstream considered unhealthy, calls skipped), or `HALF_OPEN` (testing recovery). |
| --8<-- "examples/circuit_breaker_alpha/src/working_with_callback.py" | ||
| ``` | ||
|
|
||
| !!! info "Why a callback instead of built-in S3/SQS sinks?" |
There was a problem hiding this comment.
Not sure about the tone of this message and also it's a bit verbose.



Issue number: closes #8257
Summary
Changes
This PR adds a Circuit Breaker utility (under the
circuit_breaker_alphanamespace) so a Lambda function can stop calling an unhealthy downstream and let it recover, instead of piling on retries.It ships as alpha on purpose: I want about a month of feedback before we lock the public API and promote it to GA.
The failure counter lives in memory per execution environment, so a healthy circuit writes nothing; we only persist state transitions. State is shared via DynamoDB, fails open if the store is unreachable, and uses a conditional write to elect a single probe during recovery (no thundering herd). You handle rejected requests with an
on_circuit_opencallback and observe state changes with anon_transitionhook.User experience
Before, you had to build the state machine, shared storage, and recovery logic yourself. Now you wrap the function that makes the downstream call:
With no config, sensible defaults apply (open after 5 failures, probe after 30s, close after 3 successes). When open, the call is skipped and the callback's value is returned, or a
CircuitBreakerOpenErroris raised.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Disclaimer: We value your time and bandwidth. As such, any pull requests created on non-triaged issues might not be successful.