Skip to content

Retry Logic Overview (WIP)

Jim Borden edited this page Aug 23, 2016 · 5 revisions

This document exists to be the authoritative document on retrying requests over a network. There are quite a few places where this applies during the replication process. This will cover what should happen in the event of both a transient and permanent error. A transient error is one that is expected to pass given a relatively short period of time (such as a connection timeout, or a 503). A permanent error is the opposite (such as a 401 or 404), and is not likely to recover without intervention. This document will not cover other replication logic such as "going offline."

The flow of the replication retry follows:

  1. Replication attempts to run
  2. A connection error occurs
    • 2a The connection error is transient, go to 3
    • 2b The connection error is permanent, go to 4
  3. Retry according to the applied retry strategy (not customizable on all platforms)
    • 3a The retry strategy fails, go to 4
    • 3b The retry strategy succeeds, go to 1
  4. At this point the error is considered a permanent one
    • 4a The replication is continuous. Switch to idle, set last error, enter long delay (~60 sec) and go to 1
    • 4b The replication is non-continuous. Set last error, give up and stop the replication

Examples: Start non-continuous replication
Initial connection reports 401 (Unauthorized)
Stop replication, callback for error and stopped status (two notifications)

Start non-continuous replication
Halfway through, a 503 error is encountered (Service Unavailable)
Error is transient, so retry
Retry succeeds, replication continues

Start non-continuous replication
Halfway through, a connection time out happens
Error is transient, so retry
Retry failed, replication stops

Start a continuous replication
A 404 error is encountered on the endpoint
Permanent error, so don't retry the request
Enter master retry loop (wait 60 sec and restart replication)

Clone this wiki locally