Speculative Retries #3782

NaluTripician · 2023-03-28T18:19:24Z

Speculative Processing

Background

Goal: Read resiliency (Not tied to consistency change)
- Out-of-scope: Write resilient (multi master)
[ASK] Fallback region: Reads should not apply session barrier's if-any
[ASK] Dynamic/mutable preferred regions: Otherwise demands clients re-creation (used during outages)
[ASK] Always on cross-region hedging: ~500ms possible hedging threshold (teams will tune)
[ASK] Ability to opt-in/out of request hedging feature
[ASK] Parallel hedging: Don't cancel current in-flight one, pick one which finishes first
[ASK] Region choice at request level (ability to share connections)
[ASK] Reflection based approach (fully supported)
[ASK] Available on both Preview and GA packages

Parallel Hedging APIs + Samples

When Building a new CosmosClient there will be an option to include Parallel hedging in that client.

CosmosClient client = new CosmosClientBuilder("connection string")
    .WithApplicationPreferredRegions(
        new List<string> { "East US", "Central US", "West US" } )
    .WithAvailabilityStrategy(
        AvailabilityStrategy.CrossRegionHedgingStrategy(
        threshold: TimeSpan.FromMilliseconds(500),
        thresholdStep: TimeSpan.FromMilliseconds(100)
     ))
    .Build();

or

CosmosClientOptions options = new CosmosClientOptions()
{
    AvailabilityStrategy
     = AvailabilityStrategy.CrossRegionHedgingStrategy(
        threshold: TimeSpan.FromMilliseconds(500),
        thresholdStep: TimeSpan.FromMilliseconds(100)
     )
      ApplicationPreferredRegions = new List<string>() { "East US", "West US", "Central US"},
};

CosmosClient client = new CosmosClient(
    accountEndpoint: "account endpoint",
    authKeyOrResourceToken: "auth key or resource token",
    clientOptions: options);

The example above will create a CosmosClient instance with AvailabilityStrategy enabled with at 500ms threhshold. This means that if a request takes longer than 500ms the SDK will send a new request to the backend in order of the Preferred Regions List. If the ApplicationRegion or ApplicationPreferredRegions list is not set, then an AvailabilityStrategy will not be able to be set. If still no response comes back from the first hedge or the primary request after the step time, another parallel request will be made to the next region. The SDK will then return the first response that comes back from the backend. The threshold parameter is a required parameter can can be set to any value greater than 0. There is also an option to the AvailabilityStrategy at request level and override the client level AvailabilityStrategy by setting the AvailabilityStrategy on the RequestOptions object.

Note: ApplicationRegion or ApplicationPreferredRegions MUST be set to use Hedging

Override AvailabilityStrategy:

//Send one request out with a more aggressive threshold
ItemRequestOptions requestOptions = new ItemRequestOptions()
{
    AvailabilityStrategyOptions =AvailabilityStrategy.CrossRegionHedgingStrategy(
        threshold: TimeSpan.FromMilliseconds(100),
        thresholdStep: TimeSpan.FromMilliseconds(50)
     ))
};

Hedging can be enabled for all read requests: ReadItem, Queries (single and cross partition), ReadMany, and ChangeFeed. It is not enabled for write requests.

Diagnostics

In the diagnostics data there are two new areas of note Response Region and Hedge Context that will appear when using this feature. Response Region shows the region that the request is ultimately served out of. Hedge Context shows all the regions requests were sent to.

Design

The SDK will send the first request to the primary region. If there is no response from the backend before the threshold time, then the SDK will begin sending hedged requests to the regions in order of the ApplicationPreferredRegions list. After the first hedged request is sent out, the hedged requests will continue to be fired off one by one after waiting the time specified in the threshold step. Once a response is received from one of the requests, the availability strategy will check to see if the result is considered final. If the result is final, then it is returned. If not, the SDK will skip the remaining threshold/threshold step time and send out the next hedged request. If all hedged requests are sent out and no final response is received, the SDK will return the last response it received. The AvaiabilityStrategy operates on the RequestInvokerHandler level meaning that each hedged request will go through its own handler pipeline, including the ClientRetryPolicy. This means that the hedged requests will be retried independently of each other. Note that the hedged requests are restricted to the region they are sent out in so no cross region retries will be made, only local retries. The primary request will be retried as normal.

Status Codes SDK Will Consider Final

Status Code	Description
1xx	1xx Status Codes are considered Final
2xx	2xx Status Codes are considered Final
3XX	3xx Status Codes are considered Final
400	Bad Request
401	Unauthorized
404/0	Not Found, 404/0 responses are final results as the document was not yet available after enforcing the consistency model
409	Conflict
405	Method Not Allowed
412	Precondition Failed
413	Request Entity Too Large

All other status codes are treated as possible transient errors and will be retried with hedging.

Example Flow For Cross Region Hedging With 3 Regions

graph TD
    A[RequestMessage] <--> B[RequestInvokerHandler]
    B <--> C[CrossRegionHedgingStrategy]
    C --> E(PrimaryRequest)
    E --> F{time spent < threshold}

    F -- No --> I
    F -- Yes --> G[[Wait for response]]
    G -- Response --> H{Is Response Final}
    H -- Yes --> C
    H -- No --> I(Hedge Request 1)
    
    I --> J{time spent < threshold step}

    J -- No --> K(Hedge Request 2)
    J -- Yes --> M[[Wait for response]]
    M -- Response --> N{Is Response Final}
    N -- Yes --> C
    N -- No --> K

    K --> O[[Wait for response]]
    O -- Response --> P{Is Response Final}
    P -- Yes --> C
    P -- No, But this is the final hedge request --> C

Additional Work

After the initial work is complete, there are two areas where additional work can be done to improve the feature.

Thomson Sampling

Adding an additional Thomson Sampling mode would be a logical next step for the feature. Thomson Sampling is a probabilistic algorithm that builds a probability model from the observed latency of each region. This method will result in a much more accurate estimate of the best result when compared to a mean based model. This will also provide a level of confidence in which region is the best to route to. This algorithm will also improve over time. By using a Thomson Sampling based model we would hope to have even better latency with threshold mode. This would likely come a the cost of RUs.

Samples and Metrics

Adding a sample library on how to use this feature as well as metrics showing the performance benefits of each mode would be a great addition to the feature. It would be ideal to show potential customers how this feature could be a benefit to their application and could help with onboarding additional customers. Some metrics/figures that could be provided would be:

Latency vs Time (and showing where the latency is injected to a region)
RU cost vs Time
Latency to each region vs Time + What region the SDK is sending requests to
P99/95/75 Latecy for each mode with constant injection of delay on local region

This sample library would also take advantage of the FaultInjectionLibrary

Tasks

Give feedback

Exclude Regions #4178

feature-request
Diagnostics Combination #4179

feature-request
Hedging With Writes on MultiMaster #4457

Selenium feature-request
Transportation Chaos Simulation - Fault Injection #3819

3 of 5
Options

The text was updated successfully, but these errors were encountered:

philipthomas-MSFT · 2023-10-31T15:10:07Z

Exclude regions PR.

NaluTripician added the Routing label Mar 28, 2023

NaluTripician self-assigned this Mar 28, 2023

NaluTripician added this to Azure Cosmos SDKs Mar 28, 2023

NaluTripician moved this to In Progress in Azure Cosmos SDKs Mar 28, 2023

NaluTripician moved this from In Progress to Approved in Azure Cosmos SDKs Apr 25, 2023

NaluTripician changed the title ~~End to End Operation Latency Policy~~ Speculative Reties May 15, 2023

kundadebdatta mentioned this issue May 16, 2023

Gallium Semester Tasks #3850

Open

27 tasks

NaluTripician moved this from Approved to In Progress in Azure Cosmos SDKs Jul 18, 2023

philipthomas-MSFT changed the title ~~Speculative Reties~~ Speculative Retries Aug 8, 2023

philipthomas-MSFT moved this from In Progress to Blocked in Azure Cosmos SDKs Sep 5, 2023

philipthomas-MSFT moved this from Blocked to In Progress in Azure Cosmos SDKs Oct 3, 2023

NaluTripician mentioned this issue Jan 2, 2024

Routing: Adds Parallel Request Hedging #4198

Merged

kundadebdatta mentioned this issue Jan 9, 2024

Germanium Semester Tasks #4246

Open

16 tasks

philipthomas-MSFT moved this from In Progress to Blocked in Azure Cosmos SDKs Jan 30, 2024

philipthomas-MSFT moved this from Blocked to In Progress in Azure Cosmos SDKs Feb 13, 2024

NaluTripician moved this from In Progress to Approved in Azure Cosmos SDKs Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative Retries #3782

Speculative Retries #3782

NaluTripician commented Mar 28, 2023 •

edited by kirankumarkolli

Loading

Tasks

philipthomas-MSFT commented Oct 31, 2023

Speculative Retries #3782

Speculative Retries #3782

Comments

NaluTripician commented Mar 28, 2023 • edited by kirankumarkolli Loading

Speculative Processing

Background

Parallel Hedging APIs + Samples

Diagnostics

Design

Status Codes SDK Will Consider Final

Example Flow For Cross Region Hedging With 3 Regions

Additional Work

Thomson Sampling

Samples and Metrics

Tasks

philipthomas-MSFT commented Oct 31, 2023

NaluTripician commented Mar 28, 2023 •

edited by kirankumarkolli

Loading