Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency issue #751

Open
patb2 opened this issue Nov 29, 2021 · 5 comments
Open

Consistency issue #751

patb2 opened this issue Nov 29, 2021 · 5 comments
Labels
api: datastore Issues related to the googleapis/python-ndb API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@patb2
Copy link

patb2 commented Nov 29, 2021

  1. API: python-ndb
  2. AppEngine
  3. Python version: Python 3.8
  4. google-cloud-ndb version: 1.9.0

Steps to reproduce

On a _post_put_hook(), we have a deferred classmethod, via Cloud Tasks, which will do a few additional processing based on the updated or created entity. We pass self.key to the deferred method, and it will then start by retrieving the entity thru a simple entity = key.get().

It works most of the time, but it sometimes happens that key.get() will return None, although the entity definitely does exist. We tried adding use_cache=False, use_global_cache=False to the get(), but that did not help.
We could circumvent the issue by passing the entity itself (self) to the deferred method, but that is not good practice as it might have been modified before the deferred method runs. We could also add some code to sleep a while, then try again (and this actually does work). But I thought firebase in datastore mode now provided strong consistency, and I thought that meant I was certain to be able to read a freshly put entity.

The elapsed time between the put() and the deferred_post_put_hook() that will do the key.get() may vary, but in one example it was around 100ms, and yet the get() returned None.

Code example

class Thing(ndb.Model):
    name = ndb.StringProperty()

    def _post_put_hook(self, future):
        Thing.deferred_post_put_hook(self.key)

    @classmethod
    @defer()     # The defer() custom decorator handles calling the method via a Cloud Task.  It is quite reliable, and is likely not guilty here.
    def deferred_post_put_hook(cls, thing_key: ndb.Key):

        thing: Thing = thing_key.get()
        
        # Would process thing from here on, but on occasions thing is None
        
@product-auto-label product-auto-label bot added the api: datastore Issues related to the googleapis/python-ndb API. label Nov 29, 2021
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Nov 30, 2021
@patb2
Copy link
Author

patb2 commented Dec 1, 2021

Just to make it clearer, the effect of this "deferred" thing is that the thing_key.get() happens a few milliseconds after the put() and possibly on another GAE instance. This probably is a factor, because I have never witnessed a key.get() failing right after a put() on the same instance.

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Dec 4, 2021
@meredithslota meredithslota added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. 🚨 This issue needs some love. labels Feb 3, 2022
@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels May 4, 2022
@anoteboom
Copy link

I have witnessed this problem as well in our software. We rely on this posthook method to keep an elasticsearch in sync.
the key just isn't set in the posthook. The key seems to be available to the rest of code following the put() statement.

@anoteboom
Copy link

anoteboom commented Feb 15, 2024

We found a cause in the end. When the put would fail, for instance when the entity is larger than 1Mb, the post_hook is still triggered, with a empty key.
so for us it was just a side effect of another error we could prevent easily.

class Thing(ndb.Model):
    name = ndb.StringProperty()
    desc = ndb.TextProperty()

    def _post_put_hook(self, future):
        print(self.key)

thing = Thing(name= 'too big entity', desc='d' * 11000000)
# this put will throw an exception.  but the post_put_hook will run anyway with key=None
thing.put()

@patb2
Copy link
Author

patb2 commented Feb 16, 2024

Hi @anoteboom, and thanks for helping keep this issue alive. The case you encounter, with a failure of the put() is different from ours however, in our case, the put() definitely did occur, and we can later retrieve the entity.

Here is an experiment we did when trying to find a workaround:

thing: Thing = thing_key.get()
if thing is None:
    thing: Thing = thing_key.get(use_cache=False, use_global_cache=False)
    if thing is not None:
        logging.debug("Thing was None with cache, and not using cache resolved it")

    if thing is None:
        time.sleep(8)
        thing: Thing = thing_key.get(use_cache=False, use_global_cache=False)
        if thing is not None:
            logging.debug("Thing was None with cache, and not using cache *and waiting* resolved it")

    if thing is None:
        logging.error(f"Error on get thing_key : {thing_key}")
        return

    # Things are ok here, continue ... 

What happens is we find all 3 traces in our logs, meaning that all 3 cases happen (together with a great majority of successes): (a) reading the entity with a simple get() fails but reading with no_cache succeeds, (b) reading with no_cache fails as well but trying again after 8 seconds succeeds, (c) reading after wait fails as well. We ended up leaving the wait in place since it recovers a number of errors. But it is not always sufficient.

@patb2
Copy link
Author

patb2 commented Jul 31, 2024

Hi, Trying to keep alive this unresolved issue (classified as bug so far).
python-ndb APIs allow a read_consistency argument, that may be ndb.EVENTUAL or ndb.STRONG. And it is quite explicit that ndb.EVENTUAL would lead to transient inconsistencies. However the default value is ndb.STRONG, so we would expect consistency to be the default.

Here are two possible ways to mitigate this (not yet tested in a production environment):

  1. Read with retry:
def get_with_retry(thing_key):
    for attempt in range(RETRY_COUNT):
        thing = thing_key.get(read_consistency=ndb.STRONG)
        if thing is not None:
            return thing
        time.sleep(RETRY_DELAY)
    raise Exception(f"Failed to get entity with key {thing_key} after {RETRY_COUNT} attempts")
  1. Read with transaction
def get_thing_in_transaction(thing_key):
    @ndb.transactional()
    def txn():
        return thing_key.get(read_consistency=ndb.STRONG)
    return txn()

thing = get_thing_in_transaction(thing_key)

None of these is quite satisfactory however.

Also, please refer to this "Google Database Products" article, that says this:

In Datastore, there are only two APIs that provide a strongly consistent view for reading entity values and indexes: (1) the lookup by key method and (2) the ancestor query. If application logic requires strong consistency, then the developer should use one of these methods to read entities from Datastore.

The article is undated unfortunately. It would seem that a "lookup by key" (which I understand is a key.get()), offers strong consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the googleapis/python-ndb API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants