You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The TreeCache recipe seems to generate a large amount of traffic while reconnecting to the ZK node ( This usually happens when there is a leader election in the cluster, which forces every client to get disconnected from the cluster and then go into a reconnection loop, which lasts until the leader election finishes. )
Here we can see that the TreeCache tries to reload the entire Tree after getting reconnected to the ZK node. To test this, we ran a 3 node ZK cluster with 5k znodes and around 200 TreeCache clients (The clients use TreeCache to maintain the view of ZK data in memory ). During the leader election process, we noticed a 40 to 50x increase in read traffic (read traffic going from 600rps to 30k rps). This ends up causing a thundering herd problem for the cluster.
After some discussion, we came up with a few ideas that can probably help us to avoid the situation
Instead of reloading the entire Tree after a reconnect event, Only reload the znodes that have been updated between client disconnect and reconnect event. Zk guarantees that watches for the znodes will be triggered once clients get reconnected. we can use this to do a selective update of the znodes.
Introduce a client-side rate limiting which can help us to smooth the traffic burst after a reconnection. Clients should be able to use this to avoid overwhelming the cluster.
Let me know what you think about the issue.
The text was updated successfully, but these errors were encountered:
Add a "semaphore_impl" attribute on the various handlers.
Allow a new, optional, `concurrent_request_limit` argument to the client
constructor.
Change the client to bound the number of outstanding async requests with
a semaphore limited to `concurrent_request_limit`.
Fixespython-zk#664
The TreeCache recipe seems to generate a large amount of traffic while reconnecting to the ZK node ( This usually happens when there is a leader election in the cluster, which forces every client to get disconnected from the cluster and then go into a reconnection loop, which lasts until the leader election finishes. )
kazoo/kazoo/recipe/cache.py
Line 210 in 9bb8499
Here we can see that the TreeCache tries to reload the entire Tree after getting reconnected to the ZK node. To test this, we ran a 3 node ZK cluster with 5k znodes and around 200 TreeCache clients (The clients use TreeCache to maintain the view of ZK data in memory ). During the leader election process, we noticed a 40 to 50x increase in read traffic (read traffic going from 600rps to 30k rps). This ends up causing a thundering herd problem for the cluster.
After some discussion, we came up with a few ideas that can probably help us to avoid the situation
Let me know what you think about the issue.
The text was updated successfully, but these errors were encountered: