Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): Adjust connection timeout and retry logic #685

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ceache
Copy link
Contributor

@ceache ceache commented Nov 24, 2022

Why is this needed?

Do not use the session timeout as connection timeout. This value is too large and results in a bad non-responsive server holding up the client long enough for the session to timeout.

Proposed Changes

  • Use the KazooRetry object to use an increasing backoff timeout and cycle over all servers quickly, working around bad servers with minimal impact.

Does this PR introduce any breaking change?

This change should be transparent at a high level.
It makes the client connection attempts more aggressive, on purpose.

kazoo/retry.py Outdated Show resolved Hide resolved
kazoo/tests/test_connection.py Outdated Show resolved Hide resolved
kazoo/tests/test_connection.py Outdated Show resolved Hide resolved
@ceache ceache force-pushed the feat/connection_retry branch 2 times, most recently from 4a89840 to 0170cdf Compare November 24, 2022 04:13
@ceache ceache self-assigned this Nov 24, 2022
@ceache
Copy link
Contributor Author

ceache commented Nov 24, 2022

@python-zk/maintainers This is an early draft of fix to the connection logic. The original issue that sent me on this fix is as follow:

zkc = KazooClient(hosts=["nonresponsive:2181", "goodhost:2181"], timeout=60)
zkc.start()

On a new connection, it results in a painful delay of 60 seconds wasted on the bad server before trying the good one.
On a reconnect, it is worse. It results in the connection being stuck on an attempt on the bad server with a TCP timeout set to 60 seconds, i.e. equal to the session timeout. By the time it gives up, and moves on to trying the good host, the session is already lost.

@codecov-commenter
Copy link

codecov-commenter commented Nov 24, 2022

Codecov Report

Patch coverage: 100.00% and project coverage change: -2.26 ⚠️

Comparison is base (2c36d69) 96.65% compared to head (fd57818) 94.40%.

❗ Current head fd57818 differs from pull request most recent head e34e407. Consider uploading reports for the commit e34e407 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #685      +/-   ##
==========================================
- Coverage   96.65%   94.40%   -2.26%     
==========================================
  Files          27       58      +31     
  Lines        3554     8413    +4859     
==========================================
+ Hits         3435     7942    +4507     
- Misses        119      471     +352     
Impacted Files Coverage Δ
kazoo/protocol/connection.py 96.28% <100.00%> (-1.44%) ⬇️
kazoo/retry.py 100.00% <100.00%> (ø)
kazoo/tests/test_connection.py 100.00% <100.00%> (ø)

... and 41 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ceache
Copy link
Contributor Author

ceache commented Nov 24, 2022

I am pondering if the below should be changed too.

I really do not think session_timeout_in_sec / number_of_hosts is appropriate here.
I would like it to use the client's connection_retry or even command_retry.

        connect_result, zxid = self._invoke(
            client._session_timeout / 1000.0 / len(client.hosts), connect
        )

Copy link
Contributor

@ztzg ztzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, in general—except possibly for the default initial cur_delay. I have left a few comments.

connect_result, zxid = self._invoke(
client._session_timeout / 1000.0 / len(client.hosts), connect
)
connect_result, zxid = self._invoke(timeout, connect)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that _session_timeout, being the requested session timeout and thus potentially arbitrarily larger than the actual session timeout, is an unfortunate choice.

Thinking about it some more, I am not sure that using the retry's cur_delay as-is is a good choice either. It may have been sufficient for the TCP (and potentially SSL) handshakes, but this request/response cycle involves a round trip through the JVM, disk logging, quorum packets (learner with global sessions), etc.

I can imagine many "false starts" (and thus gratuitous network & server activity) if the default cur_delay is kept too low.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, _authenticate_with_sasl and AUTH_XID below reuse connect_timeout—which, at least, is related to the negotiated timeout—for "numerous" write and read operations which have to complete timely as they do not "touch" the session.

The good news is that the session is "fresh" at that point. But in theory, the negotiated timeout should be used to determine a deadline for the whole sequence, so that it aborts before causing a session expiry if e.g. SASL authentication proceeds too slowly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note that this is not handled in the other clients either.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the "session timeout" is an indication what you want (as you point out, before negotiation with the server's config) for how long your session will remain "alive" in case the client disconnects. In other words, how much time the client has to reconnect.

I have a more "radical" view than yours, I think, which is that it has nothing to do with how long it takes to establish TCP connection between the client and the servers.

The increasing backoff method is a good way to adapt to Zookeeper servers that can be deployed on the local network (~10s ms), across data centers (~100s ms) or across continents. The default value would be more fitting for the 2 first use cases and, yes, possibly less so for wider deployments, but anyway for such setups, it is possible to easily tweak that value.

That is my opinion for the TCP connection timeout.

For the requests timeout, which is once the TCP connection is established, this is something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For request/reply messages on an established TCP connection, including Connect requests.

The original code used client._session_timeout / 1000.0 / len(client.hosts) for the Connect (pre-negotiation).

After that, all other request/reply use these two, connect_timeout for writes, read_timeout for reads:

connect_timeout = negotiated_session_timeout / len(client.hosts)
read_timeout = negotiated_session_timeout * 2.0 / 3.0

My rational for changing the timeout on the 1st Connect is that it is possible to have a healthy host (which succeeds to TCP connect) but an unhealthy JVM/Zookeeper that does not handle the Connect request.

In such case, the client would wait a long time before timing out and moving to the next server, much of its total session timeout.
It feels that waiting arbitrary long when we know what was the timeout of a successful TCP is not ideal, especially when we have not yet selected a server and there could be others.

But I could be convinced that this should be part of a different change.

@@ -614,7 +614,9 @@ def _connect_attempt(self, host, hostip, port, retry):

try:
self._xid = 0
read_timeout, connect_timeout = self._connect(host, hostip, port)
read_timeout, connect_timeout = self._connect(
host, hostip, port, timeout=retry.cur_delay
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks good, but the default initial _cur_delay is 100ms. Do we really want to keep it that low?

@ceache ceache force-pushed the feat/connection_retry branch from b738347 to fd57818 Compare November 26, 2022 00:03
@ceache ceache force-pushed the feat/connection_retry branch from fd57818 to d4a28f8 Compare January 30, 2023 05:58
@ceache ceache marked this pull request as ready for review January 30, 2023 06:03
jeffwidman
jeffwidman previously approved these changes Jan 30, 2023
Copy link
Member

@jeffwidman jeffwidman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high-level this makes complete sense and the implementation looks solid. Reality of course is that timeouts are always tricky, so we may need to further tweak once this is live in prod and folks start to see how it behaves.

@ceache ceache force-pushed the feat/connection_retry branch 2 times, most recently from c36c2a1 to 2295cbc Compare February 12, 2023 14:34
@ceache ceache force-pushed the feat/connection_retry branch from 2295cbc to c634e52 Compare April 24, 2023 21:14
Do not use the session timeout as connection timeout.
This value is too large and results in a bad non-responsive server
holding up the client long enough for the session to timeout.

Use the KazooRetry object to use an increasing backoff timeout and cycle
over all servers quickly, working around bad servers with minimal
impact.
@ceache ceache force-pushed the feat/connection_retry branch from c634e52 to e34e407 Compare April 24, 2023 21:19
@StephenSorriaux
Copy link
Member

Just like the others, I think the changes (+ explanations) make a lot of sense. Although, just like @ztzg commented, the new default value might seem a little low from what the previous was (100ms vs 10s). I like the changes making the client less lazy but I am not sure if that kind of default value would suit to every current possible uses of the client and, if possible, I would prefer we try not to change it.
I noticed that in ConnectionHandler, the retry_sleeper is only used for the zk_loop, do you think some logic around the initial connection_retry value passed to the client would be doable/make sense? My (possibly naive) idea would be to use your new behavior if the connection_retry is given (ie. the user was kind of expecting this behavior), but keep the previous value if connection_retry is None (ie. just keep things like before).

@ztzg
Copy link
Contributor

ztzg commented Jun 13, 2024

Hi @StephenSorriaux, all,

It would be great to have this fixed in master, as teams using unpatched versions of Kazoo regularly experience "outages."

Just like the others, I think the changes (+ explanations) make a lot of sense. Although, just like @ztzg commented, the new default value might seem a little low from what the previous was (100ms vs 10s).

Note that 10s is (supposed to be) the session lifetime, and that by using it as a connection timeout, the session is dead by the time any unresponsive TCP/IP operation aborts. Which is… suboptimal!

I would not have dared 100ms, but agree with @ceache that the backoff should cause clients to quickly adapt to the actual network.

In any case, I vote for it over repeated puzzled users :)

I like the changes making the client less lazy but I am not sure if that kind of default value would suit to every current possible uses of the client and, if possible, I would prefer we try not to change it.

As Jeff wrote, "timeouts are always tricky"—but I don't understand why we would want to keep using this value, as it basically guarantees that the session is… dead… by the deadline. Am I missing something?

It seems that the default value should ideally be below (lifetime / n_hosts), to give the client a chance to try each host once—and even lower if we want to allow multiple "rounds."

For the record, the Java client initially sets the TCP/IP connect timeout to (requested_lifetime / n_hosts), and ajusts it to (negotiated_lifetime / n_hosts) once the negotiated lifetime is known—but it does not apply backoff.

I noticed that in ConnectionHandler, the retry_sleeper is only used for the zk_loop, do you think some logic around the initial connection_retry value passed to the client would be doable/make sense? My (possibly naive) idea would be to use your new behavior if the connection_retry is given (ie. the user was kind of expecting this behavior), but keep the previous value if connection_retry is None (ie. just keep things like before).

This would alleviate the need for a patched Kazoo, but would IMHO still be a suboptimal default.

Cheers, -D

@andanicolae
Copy link

Hi all. Gentle reminder :)
Can someone pls take a look and do the necessary changes so that this patch gets merged?
I would need a new kazoo package version that includes this patch.
Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants