Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry keyword does not check for stale objects before retrying. #34

Open
tmagrino opened this issue Feb 21, 2019 · 8 comments
Open

retry keyword does not check for stale objects before retrying. #34

tmagrino opened this issue Feb 21, 2019 · 8 comments

Comments

@tmagrino
Copy link
Member

The implementation of the retry keyword can cause problems when working off of stale worker cache data.

For example

atomic {
  if (val.condition()) retry;
}

Will infinitely loop if val.condition() returns true only when state is inconsistent (the collection of cached objects read never existed at the same time on their respective stores in a consistent state).

We should update the transaction loop to check for stale objects before running a user retry or user abort.

@andrewcmyers
Copy link
Contributor

In theory we are supposed to have a transaction timeout capability, so that a long-running transaction like this gets gunned down and at that point the state would be checked.

@andrewcmyers
Copy link
Contributor

I guess what I'm saying is that this problem doesn't seem specific to retry.

@tmagrino
Copy link
Member Author

It's not specific to retry, but the code for performing an explicit retry is special-cased in a way that doesn't exhibit the correct behavior we have with other retry scenarios.

@tmagrino
Copy link
Member Author

tmagrino commented Feb 21, 2019

For example, in nearly all other exception scenarios we check for stale objects in the loop, like here:

if (tm.checkForStaleObjects()) continue;

However, the case for RetryException doesn't do this:

} catch (RetryException e) {
success = false;
continue;

It's honestly a really simple fix. However, I think the logic in this loop has gotten a bit hairy and complicated with small changes over time in a way that suggests that there's a cleaner, harder-to-get-wrong rewrite of the transaction loop logic that should be considered.

I'm largely documenting the issue here to come back to later (tomorrow/this weekend) when I'm not debugging something else. 😃

@andrewcmyers
Copy link
Contributor

Checking for stale objects is expensive, of course. But I guess retry happens rarely enough that it's not an issue?

@tmagrino
Copy link
Member Author

I'm actually thinking we could do the check asynchronously here and potentially other places where we're going to retry regardless.

@tmagrino
Copy link
Member Author

So yeah, this is like the timer idea we've tossed around for long-running transactions.

@andrewcmyers
Copy link
Contributor

The timer thing is actually in the original SOSP paper but I guess it never got implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants