Skip to content

[Rust (affects all language bindings)] Use exponential backoff with jitter for stream recovery retry strategy #192

@moomindani

Description

@moomindani

SDK

Rust (affects all language bindings)

Description

The current retry strategy for stream creation and recovery uses tokio_retry::strategy::FixedInterval (lib.rs:990, arrow_stream.rs:509). This means all retry attempts use the same fixed delay (default: 2 seconds), with no jitter.

Fixed-interval retries are problematic in distributed systems:

  • Thundering herd: When a server failure disconnects many clients simultaneously, all clients retry at the same interval, concentrating load on the recovering server.
  • No backoff under sustained failures: A fixed 2-second interval gives the server no additional recovery time as failures persist.
  • Exponential backoff with jitter is the industry standard recommended by AWS, GCP, and gRPC best practices.

Proposed Solution

  1. Add a RetryStrategy enum to StreamConfiguration:
    pub enum RetryStrategy {
        Fixed,
        ExponentialBackoffWithJitter,
    }
  2. Add max_recovery_backoff_ms field (cap for exponential growth, default: 30,000 ms).
  3. Change the default strategy from Fixed to ExponentialBackoffWithJitter.
  4. recovery_backoff_ms serves as the initial backoff for exponential, or the interval for fixed.
  5. No new dependencies required — tokio-retry already provides ExponentialBackoff and jitter.

Changes are needed in 2 locations in the Rust core (lib.rs and arrow_stream.rs), plus configuration structs. All language bindings inherit the behavior through their respective FFI/binding layers.

Additional Context

References:

Current code (lib.rs:990-991):

let strategy = FixedInterval::from_millis(options.recovery_backoff_ms)
    .take(options.recovery_retries as usize);

Proposed equivalent:

use tokio_retry::strategy::{ExponentialBackoff, jitter};
let strategy = ExponentialBackoff::from_millis(options.recovery_backoff_ms)
    .max_delay(Duration::from_millis(options.max_recovery_backoff_ms))
    .map(jitter)
    .take(options.recovery_retries as usize);

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions