Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Fix possible endless wait in stop() after AUTH_FAILED error (… #744

Merged
merged 1 commit into from
Mar 6, 2024

Conversation

jeblair
Copy link
Contributor

@jeblair jeblair commented Feb 22, 2024

Revert "Fix possible endless wait in stop() after AUTH_FAILED error (#688)"

This reverts commit 5225b3e.

The commit being reverted here caused kazoo not to empty the send
queue before disconnecting. This means that if a client submitted
asynchronous requests and then called client.stop(), the connection
would be closed immediately, usually after only one (but possibly
more) of the submitted requests were sent. Prior to this, Kazoo
would empty the queue of submitted requests all the way up to and
including the Close request when client.stop() was called.

Another area where this caused problems is in a busy multi-threaded
system. One thread might decide to gracefully close the connection,
but if there is any traffic generated by another thread, then the
connection would end up terminating without ever sending the Close
request.

Failure to gracefully shutdown a ZooKeeper connection can mean that
other system components need to wait for ephemeral node timeouts to
detect that a component has shutdown.

Given that this behavior is easily reproducible and can have serious
consequences in production, 5225b3e is reverted.

@jeblair
Copy link
Contributor Author

jeblair commented Feb 22, 2024

Here is a test script that will show the behavior in both async and
threaded environments:

import kazoo.client
import time
import threading
import logging

logging.basicConfig(level=5)
l = logging.getLogger('kazoo')
l.setLevel(5)

client = kazoo.client.KazooClient('localhost:2181')
client.start()

def send():
    client.create('/test', b'',
                  makepath=True, ephemeral=True, sequence=True)

def test_thread():
    t = threading.Thread(target=send)
    t.start()
    client.stop()

def test_async():
    for x in range(100):
        client.create_async('/test', b'',
                            makepath=True, ephemeral=True, sequence=True)
    client.stop()

# test_thread()
test_async()

@jeblair
Copy link
Contributor Author

jeblair commented Feb 22, 2024

In addition to the reasoning in the commit message, I have also
attempted to reproduce the problem described in #688 in an attempt to
avoid regressing on that bug fix. I have been unable to reproduce the
error using the script provided in that PR. Since the PR description
mentioned threading, and the provided script did not appear to use it,
I also ran a test script I devised myself along with an added sleep
call in Kazoo to try to reproduce the behavior and was still unable to
do so. In every case, it appears that the remote ZooKeeper server
closed the connection after the AUTH_FAILED response which caused the
Kazoo loop to terminate. The script and changes I made can be seen at:
jeblair@7e22fea

I tested with ZooKeeper 3.8.3; perhaps that is a behavior change from
when the bug was originally reported.

@StephenSorriaux
Copy link
Member

Thank you for the PR, I will look into it.

(FWIW, tests failures seem unrelated to the change)

Copy link
Contributor

@ceache ceache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the original PR, it seems the change was not addressing the issue correctly. And quite honestly, I don't understand how the test case provided in the gist triggered the issue that was described.
The proper and expected behavior is that everything ahead of the close in the send queue be flushed before closing. So the reversal looks proper to me.

In any case, LGTM. Thanks!

Copy link

codecov bot commented Feb 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.81%. Comparing base (4c6bad8) to head (2fb93a8).
Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #744      +/-   ##
==========================================
- Coverage   96.84%   96.81%   -0.03%     
==========================================
  Files          27       27              
  Lines        3549     3549              
==========================================
- Hits         3437     3436       -1     
- Misses        112      113       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@StephenSorriaux StephenSorriaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for the PR.

You mentioned this might be an issue because of some specific ZK version, I was not able to reproduce it using ZK 3.6.4 or ZK 3.5.10.
Also, stop() and close() are called after every test methods (during tearDown()), so I think we would have detected it before if the issue was there, unfortunately the job logs has expired so I can't check that.

FWIW, the initial PR mentioned the issue #582 which more likely seem would be fixed by adding a timeout in the thread join().

Anyway, I agree with both you and also think this should be reversed. Would it be possible to make sure the commit title match our guidelines?

…LED error (python-zk#688)"

This reverts commit 5225b3e.

The commit being reverted here caused kazoo not to empty the send
queue before disconnecting. This means that if a client submitted
asynchronous requests and then called client.stop(), the connection
would be closed immediately, usually after only one (but possibly
more) of the submitted requests were sent. Prior to this, Kazoo
would empty the queue of submitted requests all the way up to and
including the Close request when client.stop() was called.

Another area where this caused problems is in a busy multi-threaded
system. One thread might decide to gracefully close the connection,
but if there is any traffic generated by another thread, then the
connection would end up terminating without ever sending the Close
request.

Failure to gracefully shutdown a ZooKeeper connection can mean that
other system components need to wait for ephemeral node timeouts to
detect that a component has shutdown.
@ceache ceache merged commit 8269235 into python-zk:master Mar 6, 2024
29 of 30 checks passed
@ceache
Copy link
Contributor

ceache commented Mar 6, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants