[yugabyte/yugabyte-db#21281] Refactor retry method for GetChanges #329
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The connector model to retry
GetChanges
request follows the model:Suppose there are 5 tablets in the list
[tablet_0, tablet_1, tablet_2, tablet_3, tablet_4]
and our total retry count is set to 5. Now while iterating over the tablets, we hit some error ontablet_3
so we will increment the retry count and the flow will be:tablet_0
when iteration starts againa. This hit an error
tablet_0
when iteration started again, this time it succeeded, but it hit the error ontablet_1
tablet_0
andtablet_1
succeed buttablet_2
hit the error, we will retry againtablet_3
hit the error again, so we will retry againa. Next time again
tablet_3
caused some error but by this time, we have not been able to send any successfulGetChanges
request and let’s say that our retention barrier is low and we end up passing it.b. We will get the error indicating stream ID expiry
Solution
To avoid incorrect retries, the retry model is changed as follows:
In other terms, if there's any error while calling
GetChanges
now:TABLET_SPLIT
in streaming only, it will be handled.a. A log will be printed indicating that the tablet has hit error
b. Loop will continue iteration as usual for other tablets
i. If the time between last attempt for
GetChanges
for a tablet awaiting retry and current time becomes equal to or greater than the retry delay, we will callGetChanges
again or else we will skip that tablet.