Revert "Fix possible endless wait in stop() after AUTH_FAILED error (… #744

jeblair · 2024-02-22T23:49:59Z

Revert "Fix possible endless wait in stop() after AUTH_FAILED error (#688)"

This reverts commit 5225b3e.

The commit being reverted here caused kazoo not to empty the send
queue before disconnecting. This means that if a client submitted
asynchronous requests and then called client.stop(), the connection
would be closed immediately, usually after only one (but possibly
more) of the submitted requests were sent. Prior to this, Kazoo
would empty the queue of submitted requests all the way up to and
including the Close request when client.stop() was called.

Another area where this caused problems is in a busy multi-threaded
system. One thread might decide to gracefully close the connection,
but if there is any traffic generated by another thread, then the
connection would end up terminating without ever sending the Close
request.

Failure to gracefully shutdown a ZooKeeper connection can mean that
other system components need to wait for ephemeral node timeouts to
detect that a component has shutdown.

Given that this behavior is easily reproducible and can have serious
consequences in production, 5225b3e is reverted.

jeblair · 2024-02-22T23:50:12Z

Here is a test script that will show the behavior in both async and
threaded environments:

import kazoo.client
import time
import threading
import logging

logging.basicConfig(level=5)
l = logging.getLogger('kazoo')
l.setLevel(5)

client = kazoo.client.KazooClient('localhost:2181')
client.start()

def send():
    client.create('/test', b'',
                  makepath=True, ephemeral=True, sequence=True)

def test_thread():
    t = threading.Thread(target=send)
    t.start()
    client.stop()

def test_async():
    for x in range(100):
        client.create_async('/test', b'',
                            makepath=True, ephemeral=True, sequence=True)
    client.stop()

# test_thread()
test_async()

jeblair · 2024-02-22T23:51:09Z

In addition to the reasoning in the commit message, I have also
attempted to reproduce the problem described in #688 in an attempt to
avoid regressing on that bug fix. I have been unable to reproduce the
error using the script provided in that PR. Since the PR description
mentioned threading, and the provided script did not appear to use it,
I also ran a test script I devised myself along with an added sleep
call in Kazoo to try to reproduce the behavior and was still unable to
do so. In every case, it appears that the remote ZooKeeper server
closed the connection after the AUTH_FAILED response which caused the
Kazoo loop to terminate. The script and changes I made can be seen at:
jeblair@7e22fea

I tested with ZooKeeper 3.8.3; perhaps that is a behavior change from
when the bug was originally reported.

StephenSorriaux · 2024-02-23T14:26:20Z

Thank you for the PR, I will look into it.

(FWIW, tests failures seem unrelated to the change)

ceache

Looking at the original PR, it seems the change was not addressing the issue correctly. And quite honestly, I don't understand how the test case provided in the gist triggered the issue that was described.
The proper and expected behavior is that everything ahead of the close in the send queue be flushed before closing. So the reversal looks proper to me.

In any case, LGTM. Thanks!

codecov · 2024-02-27T04:16:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.81%. Comparing base (4c6bad8) to head (2fb93a8).
Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #744      +/-   ##
==========================================
- Coverage   96.84%   96.81%   -0.03%     
==========================================
  Files          27       27              
  Lines        3549     3549              
==========================================
- Hits         3437     3436       -1     
- Misses        112      113       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

StephenSorriaux

Thank you again for the PR.

You mentioned this might be an issue because of some specific ZK version, I was not able to reproduce it using ZK 3.6.4 or ZK 3.5.10.
Also, stop() and close() are called after every test methods (during tearDown()), so I think we would have detected it before if the issue was there, unfortunately the job logs has expired so I can't check that.

FWIW, the initial PR mentioned the issue #582 which more likely seem would be fixed by adding a timeout in the thread join().

Anyway, I agree with both you and also think this should be reversed. Would it be possible to make sure the commit title match our guidelines?

…LED error (python-zk#688)" This reverts commit 5225b3e. The commit being reverted here caused kazoo not to empty the send queue before disconnecting. This means that if a client submitted asynchronous requests and then called client.stop(), the connection would be closed immediately, usually after only one (but possibly more) of the submitted requests were sent. Prior to this, Kazoo would empty the queue of submitted requests all the way up to and including the Close request when client.stop() was called. Another area where this caused problems is in a busy multi-threaded system. One thread might decide to gracefully close the connection, but if there is any traffic generated by another thread, then the connection would end up terminating without ever sending the Close request. Failure to gracefully shutdown a ZooKeeper connection can mean that other system components need to wait for ephemeral node timeouts to detect that a component has shutdown.

ceache · 2024-03-06T13:19:22Z

Thanks!

westphahl approved these changes Feb 23, 2024

View reviewed changes

StephenSorriaux requested review from jeffwidman, ceache, a-ungurianu and StephenSorriaux February 23, 2024 14:25

ceache approved these changes Feb 26, 2024

View reviewed changes

ceache force-pushed the revert-close branch from 95e42c9 to f337a4a Compare February 27, 2024 04:09

StephenSorriaux approved these changes Mar 1, 2024

View reviewed changes

jeblair force-pushed the revert-close branch from f337a4a to 2fb93a8 Compare March 1, 2024 19:49

ceache merged commit 8269235 into python-zk:master Mar 6, 2024
29 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Fix possible endless wait in stop() after AUTH_FAILED error (… #744

Revert "Fix possible endless wait in stop() after AUTH_FAILED error (… #744

jeblair commented Feb 22, 2024

jeblair commented Feb 22, 2024

jeblair commented Feb 22, 2024

StephenSorriaux commented Feb 23, 2024

ceache left a comment

codecov bot commented Feb 27, 2024 •

edited

Loading

StephenSorriaux left a comment

ceache commented Mar 6, 2024

Revert "Fix possible endless wait in stop() after AUTH_FAILED error (… #744

Revert "Fix possible endless wait in stop() after AUTH_FAILED error (… #744

Conversation

jeblair commented Feb 22, 2024

jeblair commented Feb 22, 2024

jeblair commented Feb 22, 2024

StephenSorriaux commented Feb 23, 2024

ceache left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 27, 2024 • edited Loading

Codecov Report

StephenSorriaux left a comment

Choose a reason for hiding this comment

ceache commented Mar 6, 2024

codecov bot commented Feb 27, 2024 •

edited

Loading