Retry connect() on SFU full by requesting a different SFU by aliev · Pull Request #222 · GetStream/stream-py

aliev · 2026-03-17T08:09:03Z

Why

When an SFU server is full, connect() immediately fails with SfuConnectionError: WebSocket connection failed: Connection failed: server is full. Instead of failing, we should ask the coordinator for a different SFU — matching the behavior of the JS SDK (stream-video-js#2121).

Changes

Preserve SFU error code in SignalingError (previously discarded, only the message string was kept)
Add SfuJoinError exception with error code detection for retryable SFU errors (SFU_FULL, SFU_SHUTTING_DOWN, CALL_PARTICIPANT_LIMIT_REACHED)
Pass migrating_from and migrating_from_list to coordinator join request so it excludes full SFUs
Add retry loop in connect() with exponential backoff (0.5s, 1s, 2s) — on retryable error, request a new SFU from the coordinator
Remove dead string-matching retry code (_RETRYABLE_ERROR_PATTERNS, _is_retryable())

Summary by CodeRabbit

Release Notes

New Features
- Added automatic retry logic for video server connections with exponential backoff to gracefully handle temporary service outages and server overload
- Configurable maximum number of retry attempts for connection establishment
- Improved server selection logic that learns from previous failed connections
Improvements
- Enhanced error detection and reporting for connection failures
- Better connection state cleanup between retry attempts

When the SFU returns an error (e.g. SFU_FULL), the error code was discarded and only the message string was kept. This preserves the full error object so downstream retry logic can inspect the code.

…cket connect_websocket now distinguishes retryable SFU errors (SFU_FULL, SFU_SHUTTING_DOWN, CALL_PARTICIPANT_LIMIT_REACHED) from fatal ones, raising SfuJoinError so the retry loop in connect() can request a different SFU from the coordinator.

Allows the coordinator to exclude full/failed SFUs when assigning a new SFU for the participant, by passing migrating_from and migrating_from_list in the join call body.

When an SFU rejects a join with a retryable error (SFU_FULL, SFU_SHUTTING_DOWN, CALL_PARTICIPANT_LIMIT_REACHED), connect() now retries by asking the coordinator for a different SFU via migrating_from_list, instead of immediately failing.

_RETRYABLE_ERROR_PATTERNS and _is_retryable() used string matching and were never called. Retry logic now uses SFU error codes via SfuJoinError and _RETRYABLE_SFU_ERROR_CODES.

coderabbitai · 2026-03-17T08:09:13Z

📝 Walkthrough

Walkthrough

This change introduces retry capabilities for SFU joins when nodes are full. It implements exponential backoff-based retries, tracks failed SFUs to avoid during subsequent attempts, and propagates migration history through connection retry attempts with comprehensive error handling.

Changes

Cohort / File(s)	Summary
Retry Logic & State Management `getstream/video/rtc/connection_manager.py`	Added max_join_retries configuration, exponential backoff retry loop in connect(), migrating_from tracking through _connect_internal, and _handle_join_failure method for state cleanup and failed SFU tracking.
Error Handling & Join Parameters `getstream/video/rtc/connection_utils.py`	Introduced SfuJoinError exception with error_code and should_retry attributes; updated join_call_coordinator_request to accept migrating_from parameters; enhanced connect_websocket with retryable error detection and SfuJoinError raising on SignalingError with retryable codes; replaced pattern-based retry logic with retryable error code set.
Error Enrichment `getstream/video/rtc/signaling.py`	Enhanced SignalingError with init accepting optional error parameter; updated WebSocketClient.connect to extract and attach nested SFU error objects to SignalingError for better error context propagation.
Test Coverage `tests/test_connection_manager.py`, `tests/test_connection_utils.py`, `tests/test_signaling.py`	Added unit tests for retry behavior on SfuJoinError with migrating_from_list validation, exhaustion of max_join_retries, non-retryable error propagation, WS client cleanup, and error code preservation in SignalingError.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ConnectionManager
    participant SFU as SFU (Full)
    participant SFU2 as SFU (Alt)
    participant Backoff as Exponential Backoff

    Client->>ConnectionManager: connect()
    activate ConnectionManager
    
    Note over ConnectionManager: Attempt 1
    ConnectionManager->>SFU: _connect_internal(migrating_from=None)
    SFU-->>ConnectionManager: SfuJoinError(should_retry=True)
    
    Note over ConnectionManager: Check if retryable
    alt should_retry == True
        ConnectionManager->>ConnectionManager: Track failed SFU<br/>migrating_from_list = [SFU]
        ConnectionManager->>Backoff: Calculate wait time<br/>(exponential backoff)
        Backoff-->>ConnectionManager: wait_ms
        
        Note over ConnectionManager: Attempt 2
        ConnectionManager->>SFU2: _connect_internal(migrating_from_list=[SFU])
        SFU2-->>ConnectionManager: Success
        ConnectionManager-->>Client: Connected
    else should_retry == False
        ConnectionManager-->>Client: Raise SfuJoinError
    end
    
    deactivate ConnectionManager

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hops through failures with grace,
Retry logic sets the pace,
Failed SFUs tracked and noted,
Exponential backoff denoted,
Connections now more robust and spaced! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 74.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding retry logic to connect() when an SFU is full by requesting a different SFU, which matches the core objective of the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/retry-connect-on-sfu-full

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/test_connection_manager.py (2)
12-29: Prefer a fixture-backed ConnectionManager setup here.

This helper patches every collaborator up front, so the retry tests mostly validate a mocked shell instead of the concrete retry plumbing added in this PR. A shared fixture with lightweight fakes would give stronger coverage and better match the repo's test conventions. As per coding guidelines, **/test_*.py: Use fixtures to inject objects in tests and Do not use mocks or mock objects in tests unless directly requested.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_connection_manager.py` around lines 12 - 29, The helper
_make_connection_manager currently patches every collaborator
(PeerConnectionManager, NetworkMonitor, ReconnectionManager, RecordingManager,
SubscriptionManager, ParticipantsState, Tracer) which over-mocks
ConnectionManager and masks the real retry behavior; replace this with a fixture
that constructs a ConnectionManager using lightweight fakes/stubs for only the
external dependencies needed for retry testing (e.g., a FakeReconnectionManager
or FakeNetworkMonitor) and register that fixture for tests that call
_make_connection_manager; update tests to use the new fixture instead of calling
_make_connection_manager so the concrete ConnectionManager.retry/join logic (in
ConnectionManager) is exercised while avoiding heavyweight mocks.
62-62: Avoid the real backoff delays in these unit tests.

connect() now retries via exp_backoff(..., sleep=True), so these assertions pay the real 0.5s, 1s, etc. waits. That makes the suite slower and less deterministic than it needs to be. Swap in an immediate backoff/sleeper for the tests.

Also applies to: 91-91, 109-109, 141-141
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_connection_manager.py` at line 62, Tests are incurring real delays
because connect() calls exp_backoff(..., sleep=True); in the test setup patch
exp_backoff (or the module-level sleep used by it) to a no-op/immediate variant
so retries don't actually await. Concretely, in tests/test_connection_manager.py
before calling await cm.connect() replace the exp_backoff import used by the
ConnectionManager with a stub that calls exp_backoff(..., sleep=False) or
returns immediately (or monkeypatch asyncio.sleep to an async lambda that
returns None) so backoff waits are skipped; do the same for the other
occurrences around lines 91, 109, and 141 by applying the same patch/fixture.
Ensure you patch the symbol the code under test imports (the exp_backoff name
used by connect()) so the replacement takes effect.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@getstream/video/rtc/connection_utils.py`:
- Around line 461-472: The retryable SignalingError branch can leak the
temporary WebSocketClient started by ws_client.connect(); before raising
SfuJoinError you must cleanly close or shutdown that local ws_client instance so
its background thread/socket is not leaked (the cleanup in _handle_join_failure
only touches self._ws_client, not the local variable). Update the except
SignalingError block to call the ws_client's close/disconnect/shutdown method
(the same method used elsewhere to tear down self._ws_client) on the local
ws_client before raising SfuJoinError when e.error.code is in
_RETRYABLE_SFU_ERROR_CODES, ensuring no resource/thread is left running.

---

Nitpick comments:
In `@tests/test_connection_manager.py`:
- Around line 12-29: The helper _make_connection_manager currently patches every
collaborator (PeerConnectionManager, NetworkMonitor, ReconnectionManager,
RecordingManager, SubscriptionManager, ParticipantsState, Tracer) which
over-mocks ConnectionManager and masks the real retry behavior; replace this
with a fixture that constructs a ConnectionManager using lightweight fakes/stubs
for only the external dependencies needed for retry testing (e.g., a
FakeReconnectionManager or FakeNetworkMonitor) and register that fixture for
tests that call _make_connection_manager; update tests to use the new fixture
instead of calling _make_connection_manager so the concrete
ConnectionManager.retry/join logic (in ConnectionManager) is exercised while
avoiding heavyweight mocks.
- Line 62: Tests are incurring real delays because connect() calls
exp_backoff(..., sleep=True); in the test setup patch exp_backoff (or the
module-level sleep used by it) to a no-op/immediate variant so retries don't
actually await. Concretely, in tests/test_connection_manager.py before calling
await cm.connect() replace the exp_backoff import used by the ConnectionManager
with a stub that calls exp_backoff(..., sleep=False) or returns immediately (or
monkeypatch asyncio.sleep to an async lambda that returns None) so backoff waits
are skipped; do the same for the other occurrences around lines 91, 109, and 141
by applying the same patch/fixture. Ensure you patch the symbol the code under
test imports (the exp_backoff name used by connect()) so the replacement takes
effect.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d50b4fa-9c75-46c8-a3c2-2e991e5b7faa

📥 Commits

Reviewing files that changed from the base of the PR and between 64754d9 and 5af1ac6.

📒 Files selected for processing (8)

getstream/video/rtc/connection_manager.py
getstream/video/rtc/connection_utils.py
getstream/video/rtc/coordinator/backoff.py
getstream/video/rtc/signaling.py
tests/rtc/coordinator/test_backoff.py
tests/test_connection_manager.py
tests/test_connection_utils.py
tests/test_signaling.py

getstream/video/rtc/connection_utils.py

WebSocketClient starts a background thread on creation. If connect() fails (e.g. SFU full), the client was not being closed, leaking the thread.

coderabbitai

🧹 Nitpick comments (1)

getstream/video/rtc/connection_utils.py (1)

92-96: Replace hardcoded SFU error literals with protobuf constants.

The set uses raw error codes (700, 600, 301) instead of the protobuf-generated constants already available from models_pb2. Using named constants like ERROR_CODE_SFU_FULL, ERROR_CODE_SFU_SHUTTING_DOWN, and ERROR_CODE_CALL_PARTICIPANT_LIMIT_REACHED provides safer coupling to the protobuf definitions and improves maintainability.

♻️ Proposed refactor

 from getstream.video.rtc.pb.stream.video.sfu.models.models_pb2 import (
+    ERROR_CODE_CALL_PARTICIPANT_LIMIT_REACHED,
+    ERROR_CODE_SFU_FULL,
+    ERROR_CODE_SFU_SHUTTING_DOWN,
     TRACK_TYPE_AUDIO,
     TRACK_TYPE_VIDEO,
     ClientDetails,
@@
 _RETRYABLE_SFU_ERROR_CODES = {
-    700,  # ERROR_CODE_SFU_FULL
-    600,  # ERROR_CODE_SFU_SHUTTING_DOWN
-    301,  # ERROR_CODE_CALL_PARTICIPANT_LIMIT_REACHED
+    ERROR_CODE_SFU_FULL,
+    ERROR_CODE_SFU_SHUTTING_DOWN,
+    ERROR_CODE_CALL_PARTICIPANT_LIMIT_REACHED,
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@getstream/video/rtc/connection_utils.py` around lines 92 - 96, Replace the
hardcoded numeric SFU error codes in the _RETRYABLE_SFU_ERROR_CODES set with the
protobuf-generated constants from models_pb2: use
models_pb2.ERROR_CODE_SFU_FULL, models_pb2.ERROR_CODE_SFU_SHUTTING_DOWN, and
models_pb2.ERROR_CODE_CALL_PARTICIPANT_LIMIT_REACHED; ensure models_pb2 is
imported in the module and update the _RETRYABLE_SFU_ERROR_CODES set to
reference those symbols instead of 700, 600, 301.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@getstream/video/rtc/connection_utils.py`:
- Around line 92-96: Replace the hardcoded numeric SFU error codes in the
_RETRYABLE_SFU_ERROR_CODES set with the protobuf-generated constants from
models_pb2: use models_pb2.ERROR_CODE_SFU_FULL,
models_pb2.ERROR_CODE_SFU_SHUTTING_DOWN, and
models_pb2.ERROR_CODE_CALL_PARTICIPANT_LIMIT_REACHED; ensure models_pb2 is
imported in the module and update the _RETRYABLE_SFU_ERROR_CODES set to
reference those symbols instead of 700, 600, 301.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f80bb44-f7b9-4bf6-81ec-175d2b8c6c0b

📥 Commits

Reviewing files that changed from the base of the PR and between 5af1ac6 and 9db988d.

📒 Files selected for processing (1)

getstream/video/rtc/connection_utils.py

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_connection_manager.py`:
- Around line 18-35: The helper _make_connection_manager currently constructs
ConnectionManager using broad patch(...) calls and a MagicMock call object;
replace this with a pytest fixture that injects minimal real or lightweight test
doubles for the required dependencies and a simple call object, and remove the
blanket patch scaffolding: create a fixture (e.g., connection_manager or
cm_factory) that instantiates ConnectionManager(call=SimpleCall(id="test_call",
call_type="default", user_id="user1"), user_id="user1", max_join_retries=...)
and pass in only the specific mocked external funcs or classes via pytest
monkeypatch if absolutely necessary; update tests to use the fixture instead of
_make_connection_manager and remove references to
patch("getstream.video.rtc.connection_manager.PeerConnectionManager") etc.,
keeping mocking targeted to external side-effects only.
- Around line 124-152: Update the test_cleans_up_ws_client_between_retries test
to assert that the partial WebSocket client is properly closed and cleared
before retry: when the first mock_connect_internal simulates a partial WS
connection by assigning cm._ws_client = MagicMock(), ensure you assert
cm._ws_client.close() was awaited/called and that cm._ws_client is set to None
prior to the second attempt; place these assertions either by checking the
MagicMock's close call after await cm.connect() or by injecting verification
into the mock flow so that connect() behavior (using _connect_internal,
_connect_coordinator_ws, and SfuJoinError) confirms the cleanup occurred.
- Around line 49-77: The test currently appends the mutable migrating_from_list
reference to received_migrating_from_list causing later mutations to affect
earlier snapshots; change the append to store an immutable snapshot (e.g.,
append a shallow copy or tuple) so each attempt's state is preserved — update
the line using received_migrating_from_list.append(migrating_from_list) to
append a copy (e.g., list(migrating_from_list) or tuple(migrating_from_list)) so
assertions against received_migrating_from_list[0..2] reflect the per-attempt
values.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dabe67c3-7b85-4df0-a6ea-435febce94d5

📥 Commits

Reviewing files that changed from the base of the PR and between 9db988d and 857d3c7.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (1)

tests/test_connection_manager.py

tests/test_connection_manager.py

Replace helper method with a shared fixture that handles dependency patching, backoff mocking, and coordinator setup in one place.

…t in test_connection_manager Extract mock_ws_client and coordinator_request fixtures to reduce duplication. Snapshot migrating_from_list per attempt to avoid mutable reference aliasing in assertions.

Keeps exp_backoff as a pure delay generator, consistent with how coordinator/ws.py uses it. The caller (connect()) handles asyncio.sleep explicitly.

aliev added 5 commits March 17, 2026 11:45

feat: preserve SFU error code in SignalingError

f68bbf7

When the SFU returns an error (e.g. SFU_FULL), the error code was discarded and only the message string was kept. This preserves the full error object so downstream retry logic can inspect the code.

feat: pass migrating_from to coordinator join request

369a085

Allows the coordinator to exclude full/failed SFUs when assigning a new SFU for the participant, by passing migrating_from and migrating_from_list in the join call body.

chore: remove dead retry code replaced by error-code-based detection

8f73de8

_RETRYABLE_ERROR_PATTERNS and _is_retryable() used string matching and were never called. Retry logic now uses SFU error codes via SfuJoinError and _RETRYABLE_SFU_ERROR_CODES.

aliev had a problem deploying to ci March 17, 2026 08:09 — with GitHub Actions Error

style: remove extra blank line left after dead code removal

6a352b9

aliev had a problem deploying to ci March 17, 2026 08:11 — with GitHub Actions Error

style: apply ruff formatting

536f95d

aliev temporarily deployed to ci March 17, 2026 08:13 — with GitHub Actions Inactive

style: apply ruff formatting to test_signaling.py

311447d

aliev temporarily deployed to ci March 17, 2026 09:39 — with GitHub Actions Inactive

aliev had a problem deploying to ci March 17, 2026 10:52 — with GitHub Actions Error

Merge branch 'main' into feat/retry-connect-on-sfu-full

5af1ac6

aliev temporarily deployed to ci March 17, 2026 10:55 — with GitHub Actions Inactive

aliev had a problem deploying to ci March 17, 2026 10:55 — with GitHub Actions Failure

aliev temporarily deployed to ci March 17, 2026 10:55 — with GitHub Actions Inactive

aliev temporarily deployed to ci March 17, 2026 11:01 — with GitHub Actions Inactive

aliev marked this pull request as ready for review March 17, 2026 11:13

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

getstream/video/rtc/connection_utils.py Show resolved Hide resolved

aliev changed the title ~~feat: retry connect() on SFU full by requesting a different SFU~~ (WIP) feat: retry connect() on SFU full by requesting a different SFU Mar 17, 2026

fix: close ws_client on connect_websocket failure to prevent thread leak

9db988d

WebSocketClient starts a background thread on creation. If connect() fails (e.g. SFU full), the client was not being closed, leaking the thread.

aliev had a problem deploying to ci March 17, 2026 11:55 — with GitHub Actions Error

aliev temporarily deployed to ci March 17, 2026 11:55 — with GitHub Actions Inactive

aliev had a problem deploying to ci March 17, 2026 11:55 — with GitHub Actions Error

aliev temporarily deployed to ci March 17, 2026 11:55 — with GitHub Actions Inactive

aliev added 2 commits March 17, 2026 15:59

test: mock exp_backoff in connect() tests to avoid real sleep delays

6a34c05

chore: update uv.lock

857d3c7

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

tests/test_connection_manager.py Outdated Show resolved Hide resolved

tests/test_connection_manager.py Outdated Show resolved Hide resolved

tests/test_connection_manager.py Outdated Show resolved Hide resolved

aliev added 4 commits March 17, 2026 16:14

refactor: use pytest fixture for ConnectionManager setup in tests

efd9b09

Replace helper method with a shared fixture that handles dependency patching, backoff mocking, and coordinator setup in one place.

refactor: use fixtures in test_connection_utils, snapshot mutable lis…

ab8a27c

…t in test_connection_manager Extract mock_ws_client and coordinator_request fixtures to reduce duplication. Snapshot migrating_from_list per attempt to avoid mutable reference aliasing in assertions.

test: assert ws_client cleanup behavior, not just retry count

df9a07c

refactor: remove sleep param from exp_backoff, keep sleep in caller

de6d889

Keeps exp_backoff as a pure delay generator, consistent with how coordinator/ws.py uses it. The caller (connect()) handles asyncio.sleep explicitly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry connect() on SFU full by requesting a different SFU#222

Retry connect() on SFU full by requesting a different SFU#222
aliev wants to merge 19 commits intomainfrom
feat/retry-connect-on-sfu-full

aliev commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aliev commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Changes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aliev commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading