Skip to content

fix(helpers): honor x-should-retry header in runner-helper retry classifier#1692

Open
WalkingDreams798 wants to merge 1 commit into
anthropics:mainfrom
WalkingDreams798:fix/runner-retry-honor-x-should-retry
Open

fix(helpers): honor x-should-retry header in runner-helper retry classifier#1692
WalkingDreams798 wants to merge 1 commit into
anthropics:mainfrom
WalkingDreams798:fix/runner-retry-honor-x-should-retry

Conversation

@WalkingDreams798

Copy link
Copy Markdown

What

is_fatal_status_error in lib/_retry.py — used by the environments poller, the session tool runner, and the worker heartbeat to decide whether a failure is worth retrying — classified errors purely by status code and ignored the server's x-should-retry response header. Its docstring nonetheless claims it "Aligns with the core client's _should_retry policy", and the core client (_base_client._should_retry) honors that header first.

Impact

Two divergences from the core client's retry behavior:

  • a 4xx with x-should-retry: true → treated as fatal, so the runner stops, even though the server explicitly asked to retry;
  • a 429 (or 5xx) with x-should-retry: false → retried, even though the server explicitly asked not to.

Repro

import httpx
from anthropic._exceptions import APIStatusError
from anthropic.lib._retry import is_fatal_status_error

r = httpx.Response(400, headers={"x-should-retry": "true"}, request=httpx.Request("POST", "https://x"))
is_fatal_status_error(APIStatusError("e", response=r, body=None))
# -> True   (the poller stops; the core client would RETRY)

Change

Honor x-should-retry first — true is never fatal, false is always fatal, regardless of status code — then fall back to the existing 4xx-code logic. This makes the classifier match the documented intent and the core client.

Tests

Added tests/lib/test_retry.py: plain 4xx fatal, transient 4xx (408/409/429) not fatal, 5xx not fatal, x-should-retry override in both directions, and non-status errors (transport/connection) not fatal.

Verification

  • uv run pytest tests/lib/test_retry.py → 15 passed
  • uv run ruff check → all checks passed
  • uv run pyright (strict) → 0 errors, 0 warnings

…sifier

`is_fatal_status_error` (used by the environments poller, session tool runner,
and worker heartbeat to decide whether a failure is worth retrying) classified
errors purely by status code and ignored the server's `x-should-retry` response
header — even though its docstring claims it "aligns with the core client's
_should_retry policy". The core client (`_base_client._should_retry`) honors the
header first.

This caused two divergences from the core client:
- a 4xx carrying `x-should-retry: true` was treated as fatal (the runner stops)
  when the server explicitly asked to retry;
- a 429 (or 5xx) carrying `x-should-retry: false` was retried when the server
  explicitly asked not to.

Honor the header first (`true` -> not fatal, `false` -> fatal, regardless of
status), then fall back to the existing 4xx code logic. Added unit tests.
@WalkingDreams798 WalkingDreams798 requested a review from a team as a code owner June 19, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant