Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[yugabyte/yugabyte-db#21281] Refactor retry method for GetChanges #329

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

vaibhav-yb
Copy link
Collaborator

Problem

The connector model to retry GetChanges request follows the model:

while (connectorIsRunning) {
  try {
    for (String tablet : listOfTablets) {
      // Call GetChanges
    }
    
    // Reset retry count after each iteration over tablet list
  } catch (Exception e) {
    // Increment retry count and then wait before retrying
    
    // Retry will cause the loop of iteration over tablets again
  }
}

Suppose there are 5 tablets in the list [tablet_0, tablet_1, tablet_2, tablet_3, tablet_4] and our total retry count is set to 5. Now while iterating over the tablets, we hit some error on tablet_3 so we will increment the retry count and the flow will be:

  1. Retry count 1 on tablet_0 when iteration starts again
    a. This hit an error
  2. Retry count 2 on tablet_0 when iteration started again, this time it succeeded, but it hit the error on tablet_1
  3. Now iteration starts again and tablet_0 and tablet_1 succeed but tablet_2 hit the error, we will retry again
  4. Next time tablet_3 hit the error again, so we will retry again
    a. Next time again tablet_3 caused some error but by this time, we have not been able to send any successful GetChanges request and let’s say that our retention barrier is low and we end up passing it.
    b. We will get the error indicating stream ID expiry

Solution

To avoid incorrect retries, the retry model is changed as follows:

while (connectorIsRunning) {
  for (String tablet : listOfTablets) {
    try {
      // Call GetChanges

      // Reset retry counter to indicate a successful GetChanges call
    } catch (Exception e) {
      // Mark the tablet for retry and retry after the delay on current tablet

      // This continue here ensures that the execution keeps flowing for the other tablets in the task
      continue;
    }
  }
}

In other terms, if there's any error while calling GetChanges now:

  1. If it is TABLET_SPLIT in streaming only, it will be handled.
  2. If it's not tablet split:
    a. A log will be printed indicating that the tablet has hit error
    b. Loop will continue iteration as usual for other tablets
    i. If the time between last attempt for GetChanges for a tablet awaiting retry and current time becomes equal to or greater than the retry delay, we will call GetChanges again or else we will skip that tablet.

@vaibhav-yb vaibhav-yb changed the title [yugabyte/yugabyte-db#] Refactor retry method for GetChanges [yugabyte/yugabyte-db#21281] Refactor retry method for GetChanges Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant