fix(reconnect): listener bug fixes + transport factory policy hook#45
Conversation
lukeocodes
left a comment
There was a problem hiding this comment.
thanks for chasing this down, the diagnosis is right. but this shouldn't land in core. the whole point of the transport seam in #29 was so SageMaker stays invisible to the SDK. transportFactory() was meant to be the entire surface area, and looking at the plugin side over on deepgram/deepgram-java-sdk-transport-sagemaker#14 I think we can keep it that way.
what's happening here is storm handling that belongs in SageMakerTransport is bleeding upward into the generated WS clients via a boolean flag and four near-identical 75-line listener implementations. the freeze list goes from 1 file to 5, and every future regen has to reconcile four duplicated anonymous listeners.
the 4-second connectionFuture.get(4000, MILLISECONDS) ceiling and the maxRetries = Integer.MAX_VALUE default in ReconnectingWebSocketListener are both real, but they're our bugs in that file. fix them there:
maxRetries(0)should mean "connect once, don't retry", not "refuse to connect"- promote the hardcoded
4000to a configurableconnectionTimeoutMsonReconnectOptions
then the plugin sets its own policy via webSocketReconnectOptions(...) and core stays untouched. no boolean on ClientOptions, no edits to generated WS clients, no new .fernignore entries past freezing the listener itself.
the storm itself shouldn't even reach ReconnectingWebSocketListener in normal operation. AWS ThrottlingException / pool exhaustion / connect timeouts should be classified and absorbed inside SageMakerTransport.ensureConnected() before they ever fire transport.onError(...). that's plugin-side work, I've left the detail on #14.
asks:
- revert the four
resources/.../websocket/*WebSocketClient.javaedits - revert the four new
.fernignoreentries - drop
ClientOptions.reconnect(boolean)and theif (transportFactory != null) builder.reconnect(false)heuristic - fix
ReconnectingWebSocketListenerinstead, freeze just that file - let the plugin declare its own
ReconnectOptions. adefault ReconnectOptions reconnectOptions()onDeepgramTransportFactorywould do it cleanly, that file is permanently frozen anyway
with the plugin-side change in #14, customer outcome is the same, one freeze entry instead of four, and transportFactory() stays the only thing core knows about SageMaker.
…uto-wire transport factory policy Reworks the Layer-4 patch (#45) per @lukeocodes' review. Replaces the previous ClientOptions.reconnect(boolean) flag and the four duplicated WebSocketClient listeners with the underlying bug fixes in ReconnectingWebSocketListener plus a clean policy-declaration hook on DeepgramTransportFactory. Reverts (no edits to generated WS clients, no .fernignore freeze inflation): - ClientOptions.reconnect(boolean) field/getter/builder method. - if (transportFactory != null) builder.reconnect(false) auto-disable heuristic in both DeepgramClientBuilder and AsyncDeepgramClientBuilder. - The four resources/.../websocket/*WebSocketClient.java listener duplications and their .fernignore freeze entries. Bug fixes in ReconnectingWebSocketListener (now temporarily frozen): - maxRetries(0) used to refuse to connect because the gate compared retryCount.get() >= maxRetries on the initial attempt (retryCount starts at 0). The initial attempt is not a retry; the gate is now > maxRetries so the count caps retries only. maxRetries(0) means "1 attempt, no retries" as the API name implies. - The 4000 ms connectionFuture.get(...) timeout is now configurable via a new connectionTimeoutMs field on ReconnectOptions (default 4000, validated > 0). The hardcoded literal is gone. - Adds applyOptionsOverride(ReconnectOptions) so a transport-factory wrapper can rewire policy on the listener after construction. Used by TransportWebSocketFactory to honour the new factory hook below without requiring edits to the per-resource WebSocketClient classes. Plugin policy declaration: - DeepgramTransportFactory (already permanently frozen) gains a default ReconnectOptions reconnectOptions() method returning null. Plugins managing their own connection lifecycle (e.g. SageMaker) override this to return ReconnectOptions.builder().maxRetries(0).build() so the SDK's wrapper-level reconnect doesn't compound their internal retries into a Throttling-on-Throttling storm under burst load. - TransportWebSocketFactory.create() applies factory.reconnectOptions() to the wrapping ReconnectingWebSocketListener via the new override hook. Auto-wiring with no edits to generated code. Tests: - New ReconnectingWebSocketListenerTest covers the maxRetries(0) regression, connectionTimeoutMs default + customisation + validation, and applyOptionsOverride no-op + retry-gate behaviour. Net diff vs the prior PR head: 4 files changed (+94/−10) instead of 8 files (+364/−24). Freeze list grows by 1 (the listener), shrinks by 4 (the WS clients). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent pending queue Bundles the four storm-handling asks from @lukeocodes' review of #14 into the existing PR. The plugin now owns end-to-end retry/backoff/classification for transient AWS failures so they never reach transport.onError(...) and the SDK's wrapper-level reconnect can be disabled via the new DeepgramTransportFactory.reconnectOptions() hook (paired with the SDK fix in deepgram/deepgram-java-sdk#45). SageMakerConfig — new retry knobs (defaults tuned for high-burst workloads): - maxRetries = 5 (set 0 to disable internal retry) - initialBackoff = 100 ms - maxBackoff = 5 s - backoffMultiplier = 2.0 (validated >= 1.0) - retryBudget = 30 s wall-clock cap across all retries - build() rejects initialBackoff > maxBackoff SageMakerTransport — internal storm absorption: - ensureConnected() wraps attemptConnect() in a budgeted retry loop with exponential backoff. Successful subscription resets the budget. - handleStreamError(): the response-handler onError gate from line 100 is now a classify-and-decide step. RETRYABLE + budget left triggers an internal reset (cancel future, complete publisher, mark disconnected) so the next send re-enters the loop on a fresh stream. TERMINAL or budget-exhausted surfaces to errorListeners as before. - classify(Throwable) walks the cause chain. Retryable: TimeoutException, ConnectException, IOException, AwsServiceException with status 429 or 5xx or error code containing "throttl", and SdkException whose message contains "acquire"/"pool"/"throttl"/"timeout". Terminal: 4xx other than 429, anything unmatched (default-deny for retry). - StreamPublisher.pending is hoisted up to a SageMakerTransport instance field so events queued during an internal reset survive across the publisher swap. The new publisher drains the shared queue on subscribe. - awaitSubscription now returns the boolean. Timeout throws TimeoutException, which flows into the retry loop as RETRYABLE — the new subscriptionTimeout knob now actually fails fast as advertised. SageMakerTransportFactory: - Overrides the new reconnectOptions() default to return ReconnectOptions.builder().maxRetries(0).build(). Combined with the SDK's auto-wiring in TransportWebSocketFactory, this disables the SDK's wrapper-level reconnect for any per-resource WebSocket client constructed against this factory — no user wiring required. build.gradle: - Bumps deepgram-java-sdk dep from 0.2.1 to 0.3.0 to pick up the new DeepgramTransportFactory.reconnectOptions() default method, ReconnectOptions.connectionTimeoutMs, the maxRetries(0) bug fix, and the applyOptionsOverride hook. Tests: - New SageMakerTransportRetryTest covers classify() across all branches (Timeout/Connect/IOException; AWS 401/403/429/5xx/throttling code; SdkException pool-keyword; cause-chain walking; default-terminal) and StreamPublisher behaviour (awaitSubscription false-on-timeout, true-on-subscribe; pending-queue persistence across publisher instances; immediate forward when subscriber present). - SageMakerTransportFactoryTest gains: retry-config defaults + customisation + validation, initialBackoff > maxBackoff rejection, and factoryDeclaresMaxRetriesZeroForReconnectOptions verifying the storm-suppression policy is published correctly. README: - Bumps the SDK dep version reference and notes the v0.3.0+ requirement. - Adds the new retry knobs to the configuration table. - New "Retry & storm absorption" section explaining the internal classification/budget design and how factory.reconnectOptions() auto-wires the SDK's reconnect-disable. End-to-end mock-AWS retry coverage isn't included here — the AWS reactive streams handler indirection makes deterministic stubbing fragile. Those paths are exercised by the existing burst test described in the PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Updated PR based on comments |
…for custom transports
Adds `ClientOptions.Builder.reconnect(boolean)` (default `true`). When
`false`, the per-resource WebSocket clients (`listen/v1`, `listen/v2`,
`agent/v1`, `speak/v1`) connect once via a plain `okhttp3.WebSocketListener`
and propagate failures directly, bypassing `ReconnectingWebSocketListener`
and its hardcoded `connectionFuture.get(4000, MILLISECONDS)` ceiling.
Auto-disable for custom transports: when
`DeepgramClientBuilder.transportFactory(...)` (or its async counterpart)
is set, `setAdditional()` now also calls `builder.reconnect(false)`.
Custom transports like `SageMakerTransportFactory` already manage their
own connection lifecycle and retry semantics; wrapping them in another
retry layer creates double-retry storms under burst load.
The four per-resource WebSocket client files are added to the
"temporarily frozen" section of `.fernignore` so the patch survives the
next Fern regen. Unfreeze once the change has been pushed upstream into
the Fern generator template.
Backwards compatibility:
- Existing callers that do NOT use `transportFactory(...)` see zero
behavior change. `reconnect()` defaults to `true`, the
`ReconnectingWebSocketListener` wrapper is unchanged for the OkHttp
WebSocket path (Deepgram cloud, self-hosted Deepgram via raw WS),
and the lower-level `ClientOptions.Builder.webSocketFactory(...)`
seam is also unaffected.
- Existing SageMaker callers (the only known users of
`transportFactory(...)`) get the auto-disable. Empirically this
eliminates the retry-storm pattern observed at 400 concurrent
streams on a 10x ml.g6.2xlarge endpoint:
ThrottlingException line-count 68,909 -> 240 (-99.7%)
ModelStreamError line-count 30,442 -> 1,128 (-96.3%)
Streams hitting Attempt Count: 4 29,030 -> 96 (-99.7%)
4-second `connection timeout` 13 -> 0
Total error log lines 1,322,666 -> 33,972 (-97.4%)
Same wall time (313 vs 312 s), same transcript count, dramatically
cleaner error log.
Edge case worth calling out:
- A SageMaker caller who has been tuning retry behavior via
`wsClient.reconnectOptions(customOpts)` will find those options
silently ignored after this change (`reconnect=false` skips the
listener entirely). Their tuning was almost certainly a workaround
for the very storm we are now eliminating, so the right migration
is to delete the `reconnectOptions(...)` call. Customers who
genuinely want both a custom transport and SDK-side reconnect can
re-enable explicitly with `.reconnect(true)` after
`.transportFactory(...)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uto-wire transport factory policy Reworks the Layer-4 patch (#45) per @lukeocodes' review. Replaces the previous ClientOptions.reconnect(boolean) flag and the four duplicated WebSocketClient listeners with the underlying bug fixes in ReconnectingWebSocketListener plus a clean policy-declaration hook on DeepgramTransportFactory. Reverts (no edits to generated WS clients, no .fernignore freeze inflation): - ClientOptions.reconnect(boolean) field/getter/builder method. - if (transportFactory != null) builder.reconnect(false) auto-disable heuristic in both DeepgramClientBuilder and AsyncDeepgramClientBuilder. - The four resources/.../websocket/*WebSocketClient.java listener duplications and their .fernignore freeze entries. Bug fixes in ReconnectingWebSocketListener (now temporarily frozen): - maxRetries(0) used to refuse to connect because the gate compared retryCount.get() >= maxRetries on the initial attempt (retryCount starts at 0). The initial attempt is not a retry; the gate is now > maxRetries so the count caps retries only. maxRetries(0) means "1 attempt, no retries" as the API name implies. - The 4000 ms connectionFuture.get(...) timeout is now configurable via a new connectionTimeoutMs field on ReconnectOptions (default 4000, validated > 0). The hardcoded literal is gone. - Adds applyOptionsOverride(ReconnectOptions) so a transport-factory wrapper can rewire policy on the listener after construction. Used by TransportWebSocketFactory to honour the new factory hook below without requiring edits to the per-resource WebSocketClient classes. Plugin policy declaration: - DeepgramTransportFactory (already permanently frozen) gains a default ReconnectOptions reconnectOptions() method returning null. Plugins managing their own connection lifecycle (e.g. SageMaker) override this to return ReconnectOptions.builder().maxRetries(0).build() so the SDK's wrapper-level reconnect doesn't compound their internal retries into a Throttling-on-Throttling storm under burst load. - TransportWebSocketFactory.create() applies factory.reconnectOptions() to the wrapping ReconnectingWebSocketListener via the new override hook. Auto-wiring with no edits to generated code. Tests: - New ReconnectingWebSocketListenerTest covers the maxRetries(0) regression, connectionTimeoutMs default + customisation + validation, and applyOptionsOverride no-op + retry-gate behaviour. Net diff vs the prior PR head: 4 files changed (+94/−10) instead of 8 files (+364/−24). Freeze list grows by 1 (the listener), shrinks by 4 (the WS clients). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .fernignore freeze added in this PR protects the listener bug fixes from regen, but the regen workflow in AGENTS.md drives off the "Current temporarily frozen files" list. Add the listener there so the next regen applies the documented .bak swap/restore on it
cbeddf6 to
e191af1
Compare
🤖 I have created a release *beep* *boop* --- ## [0.4.0](v0.3.0...v0.4.0) (2026-05-06) ### ⚠ BREAKING CHANGES * sdk regeneration 2026-05-05 ([#49](#49)) * sdk regeneration 2026-04-29 ([#47](#47)) ### Features * sdk regeneration 2026-04-29 ([#47](#47)) ([0519ad3](0519ad3)) * sdk regeneration 2026-05-05 ([#49](#49)) ([f44678a](f44678a)) ### Bug Fixes * **reconnect:** listener bug fixes + transport factory policy hook ([#45](#45)) ([eac8ad2](eac8ad2)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Summary
ClientOptions.Builder.reconnect(boolean)(defaulttrue). Whenfalse, the per-resource WebSocket clients connect once via a plainokhttp3.WebSocketListenerand propagate failures directly, bypassingReconnectingWebSocketListener(and its hardcodedconnectionFuture.get(4000, MILLISECONDS)ceiling).DeepgramClientBuilder.transportFactory(...)(or its async counterpart) is set,setAdditional()also callsbuilder.reconnect(false). Custom transports likeSageMakerTransportFactoryalready manage their own connection lifecycle; wrapping them in another retry layer creates double-retry storms under burst load.listen/v1,listen/v2,agent/v1,speak/v1. Same structural change in each:connect()branches onclientOptions.reconnect();disconnect(),sendMedia(),sendMessage(),assertSocketIsOpen()usedirectWebSocketwhen the listener is null..fernignoreso the patch survives the next Fern regen. Unfreeze once the change has been pushed upstream into the Fern generator template.Why
Validated against a 400-concurrent-stream burst test on a 10×
ml.g6.2xlargeendpoint (Deepgram on SageMaker, replicating a customer-reported failure):ThrottlingExceptionline-countModelStreamErrorline-countAttempt Count: 4connection timeout after 4000Same work done, ~97 % less error noise. Pairs with
deepgram/deepgram-java-sdk-transport-sagemaker#14(Layer 1 transport-side timeouts), but each provides independent value.Backwards compatibility
Existing callers that do NOT use
transportFactory(...)see zero behavior change. The newreconnect()field defaults totrue, theReconnectingWebSocketListener-wrapped code path is unchanged for the OkHttp WebSocket transport (Deepgram cloud, self-hosted Deepgram via raw WS), and the lower-levelClientOptions.Builder.webSocketFactory(...)seam is also unaffected (only the explicitDeepgramClientBuilder.transportFactory(...)API triggers the auto-disable).transportFactory != null?reconnect()ReconnectingWebSocketListener?wss://api.deepgram.com)truetrueWebSocketFactoryviaClientOptions(lower-level seam)truetransportFactory(SageMakerTransportFactory)false(auto-set)transportFactory(<future custom transport>)false(auto-set).reconnect(true)Edge case worth calling out
A SageMaker caller who has been tuning retry behavior via
wsClient.reconnectOptions(customOpts)will find those options silently ignored after this change (reconnect=falseskips the listener entirely). Their tuning was almost certainly a workaround for the very storm we are now eliminating, so the right migration is to delete thereconnectOptions(...)call. Customers who genuinely want both a custom transport AND SDK-side reconnect can re-enable explicitly with.reconnect(true)after.transportFactory(...):This is a runtime-quality change, not an API break — no compile errors for any existing customer. Recommend mentioning in CHANGELOG and migration notes for the next minor release.
Test plan
./gradlew build— passes (spotless + compile + tests)../gradlew test— all existing tests pass..fernignore.🤖 Generated with Claude Code