Skip to content

feat: add exponential backoff retry for transient SDK errors (rebased #127)#170

Merged
RichardAtCT merged 4 commits intomainfrom
pr-127-retry-rebase
Mar 30, 2026
Merged

feat: add exponential backoff retry for transient SDK errors (rebased #127)#170
RichardAtCT merged 4 commits intomainfrom
pr-127-retry-rebase

Conversation

@RichardAtCT
Copy link
Copy Markdown
Owner

Summary

  • Rebased PR feat: add exponential backoff retry for transient SDK errors #127 by @haripatel07 onto current main, resolving merge conflicts with the interrupt support feature
  • Adds exponential backoff retry (1s → 3s → 9s, capped at 30s) for transient CLIConnectionError in execute_command()
  • Correctly excludes MCP config errors and respects user-configured timeouts
  • Interrupt handling preserved inside retry loop — user interrupts break out immediately

Conflict Resolution

  • Merged the retry loop with the interrupt_event/interrupt_watcher pattern from main
  • Each retry attempt gets fresh interrupt watcher setup
  • messages.clear() at top of loop prevents partial message pollution

Closes #60
Original PR: #127

Test plan

  • 521 tests passing after rebase
  • Retry tests verify backoff math, MCP exclusion, timeout bypass

Closes #60 - adds configurable retry logic to ClaudeSDKManager.execute_command()
for transient CLIConnectionError failures (non-MCP).

Changes:
- src/utils/constants.py: 4 new retry default constants
- src/config/settings.py: 4 new settings fields (claude_retry_max_attempts,
  claude_retry_base_delay, claude_retry_backoff_factor, claude_retry_max_delay)
- src/claude/sdk_integration.py: _is_retryable_error() helper + retry loop
  wrapping asyncio.wait_for() in execute_command()

Retry decision:
- CLIConnectionError (non-MCP): retried with exponential backoff
- asyncio.TimeoutError: not retried (user-configured timeout, intentional)
- CLINotFoundError, ProcessError, CLIJSONDecodeError: not retried

Default backoff: 1s → 3s → 9s, capped at 30s (CLAUDE_RETRY_MAX_ATTEMPTS=0 disables)

Tests: 491 passed, 0 failed
@RichardAtCT RichardAtCT merged commit 77c3056 into main Mar 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add retry logic for transient network errors in Claude SDK calls

2 participants