`retry` keyword does not check for stale objects before retrying. #34

tmagrino · 2019-02-21T19:47:03Z

The implementation of the retry keyword can cause problems when working off of stale worker cache data.

For example

atomic {
  if (val.condition()) retry;
}

Will infinitely loop if val.condition() returns true only when state is inconsistent (the collection of cached objects read never existed at the same time on their respective stores in a consistent state).

We should update the transaction loop to check for stale objects before running a user retry or user abort.

The text was updated successfully, but these errors were encountered:

andrewcmyers · 2019-02-21T21:18:32Z

In theory we are supposed to have a transaction timeout capability, so that a long-running transaction like this gets gunned down and at that point the state would be checked.

andrewcmyers · 2019-02-21T21:18:55Z

I guess what I'm saying is that this problem doesn't seem specific to retry.

tmagrino · 2019-02-21T21:34:16Z

It's not specific to retry, but the code for performing an explicit retry is special-cased in a way that doesn't exhibit the correct behavior we have with other retry scenarios.

tmagrino · 2019-02-21T21:40:20Z

For example, in nearly all other exception scenarios we check for stale objects in the loop, like here:

fabric/src/system/fabric/worker/Worker.java

Line 822 in 099e3b0

if (tm.checkForStaleObjects()) continue;

However, the case for RetryException doesn't do this:

fabric/src/system/fabric/worker/Worker.java

Lines 798 to 800 in 099e3b0

    
           } catch (RetryException e) { 
        
             success = false; 
        
             continue;

It's honestly a really simple fix. However, I think the logic in this loop has gotten a bit hairy and complicated with small changes over time in a way that suggests that there's a cleaner, harder-to-get-wrong rewrite of the transaction loop logic that should be considered.

I'm largely documenting the issue here to come back to later (tomorrow/this weekend) when I'm not debugging something else. 😃

andrewcmyers · 2019-02-21T22:21:17Z

Checking for stale objects is expensive, of course. But I guess retry happens rarely enough that it's not an issue?

tmagrino · 2019-02-21T22:23:05Z

I'm actually thinking we could do the check asynchronously here and potentially other places where we're going to retry regardless.

tmagrino · 2019-02-21T22:23:45Z

So yeah, this is like the timer idea we've tossed around for long-running transactions.

andrewcmyers · 2019-02-21T22:33:27Z

The timer thing is actually in the original SOSP paper but I guess it never got implemented.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`retry` keyword does not check for stale objects before retrying. #34

`retry` keyword does not check for stale objects before retrying. #34

tmagrino commented Feb 21, 2019

andrewcmyers commented Feb 21, 2019

andrewcmyers commented Feb 21, 2019

tmagrino commented Feb 21, 2019

tmagrino commented Feb 21, 2019 •

edited

Loading

andrewcmyers commented Feb 21, 2019

tmagrino commented Feb 21, 2019

tmagrino commented Feb 21, 2019

andrewcmyers commented Feb 21, 2019

retry keyword does not check for stale objects before retrying. #34

retry keyword does not check for stale objects before retrying. #34

Comments

tmagrino commented Feb 21, 2019

andrewcmyers commented Feb 21, 2019

andrewcmyers commented Feb 21, 2019

tmagrino commented Feb 21, 2019

tmagrino commented Feb 21, 2019 • edited Loading

andrewcmyers commented Feb 21, 2019

tmagrino commented Feb 21, 2019

tmagrino commented Feb 21, 2019

andrewcmyers commented Feb 21, 2019

`retry` keyword does not check for stale objects before retrying. #34

`retry` keyword does not check for stale objects before retrying. #34

tmagrino commented Feb 21, 2019 •

edited

Loading