Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kgo.LiveProduceConnection option for low-latency settings #807

Open
helgeholm opened this issue Aug 15, 2024 · 6 comments · May be fixed by #839
Open

kgo.LiveProduceConnection option for low-latency settings #807

helgeholm opened this issue Aug 15, 2024 · 6 comments · May be fixed by #839

Comments

@helgeholm
Copy link

We are trying to use franz-go in a low-latency setting.

With a normal, steady flow of produced messages, we're seeing latencies of 5ms through the system which is nice. However, there are two cases where messages have 40-120ms latency because they need to wait for a TCP connect and a SSL handshake and auth:

  1. The first message sent triggers broker.loadConnection() and creates the broker.cxnProduce connection on the brokers before sending.
  2. There is a lull in messages for longer than cfg.connIdleTimeout. This causes the broker.cxnProduce to be reaped and scenario 1 reapplies on the next message.

A config option like kgo.LiveProduceConnection that would create broker.cxnProduce on init and shield it from the connection reaper, would solve our problem.

@twmb
Copy link
Owner

twmb commented Aug 21, 2024

Would it be preferable to shield it from the reaper, or would a function that ensures a producer connection to a specific broker is open (or all brokers with node id -1 perhaps) be preferable? The function could return only once auth is complete, and can also return an error that contains any brokers connections could not be opened for.

@helgeholm
Copy link
Author

Thanks for following up!

Ensuring connection to all brokers would be necessary for our use case, yes. The ideal is to send an "apiRequest" or other heartbeat-type message to each broker on startup and at a regular interval.

We've done a lot more testing on our end, and found that the brokers also have connection reapers defaulting to 10 minutes. While these can be configured off, it is only via undocumented behavior and some Kafka providers (most importantly ours) doesn't allow it.

Our topics have 12 partitions on a cluster of 3 brokers.

Currently we have it working well by producing dummy messages every 9 minutes and sending one each to an arbitrary partition on each of the brokers. This maintains the connection indefinitely and we are getting low latency behavior. It would however be more clean if we could maintain a connection without piling unnecessary messages on the topics.

@twmb
Copy link
Owner

twmb commented Aug 26, 2024

Have you tried adjusting ConnIdleTimeout? Is that an option?

@twmb twmb added the waiting label Aug 26, 2024
@helgeholm
Copy link
Author

Some of our producers' traffic patterns have days and hours between messages, so even the max value of 15 minutes is insufficient.

If kgo's configuration tolerance were extended to high values or even 0=disabled, we can use it like we currently use librdkafka. That is, heartbeat messages every 30 minutes to avoid the server's connection reaper's max tolerance of 60 minutes.

But our hope is to be able to maintain live connections even without generating messages on a heartbeat topic. :)

Allowing us to send ApiRequest messages to individual brokers in the cluster would be an alternative that gives us the ability to solve our problem.

twmb added a commit that referenced this issue Oct 15, 2024
This can help reduce latency if you produce infrequently, but know
you'll be producing shortly.

Closes #807.
@twmb
Copy link
Owner

twmb commented Oct 15, 2024

PTAL at the function introduced in this PR: #839

I don't think I'm going to squeeze this into the next release (tonight), so this may wait a month unfortunately, but lmk if that would solve what you're looking for.

@twmb twmb linked a pull request Oct 15, 2024 that will close this issue
@twmb twmb added the has pr label Oct 15, 2024
@helgeholm
Copy link
Author

Thanks! The documented behavior makes sense. Specified or discovered brokers will definitely cover it.

If I understand the code path, it will A) initiate a connection to each broker if no connection exists, B) if a connection does exist, reset any client side timeout on it but otherwise make no communication with the broker. Is this correct? Or will it perform some sort of network "ping" in the B case?

If that understanding is correct, this will remove most cases of latency spikes as long as we keep calling EnsureProduceConnectionIsOpen often enough to immediately recover from server side or network middleware connection reaping, which I think we can keep down to around once every 30 minutes per broker. Definitely an improvement for us.

If I'm incorrect and B also performs a "ping" (e.g. querying "rd_kafka_metadata") or any other network communication, we will also avoid server-side connection reaping, and also only need to call EnsureProduceConnectionIsOpen every few minutes. If that is happening, or can be added, it would mitigate all reconnect latency spikes that are possible client-side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants