kgo.LiveProduceConnection option for low-latency settings #807

helgeholm · 2024-08-15T20:47:44Z

We are trying to use franz-go in a low-latency setting.

With a normal, steady flow of produced messages, we're seeing latencies of 5ms through the system which is nice. However, there are two cases where messages have 40-120ms latency because they need to wait for a TCP connect and a SSL handshake and auth:

The first message sent triggers broker.loadConnection() and creates the broker.cxnProduce connection on the brokers before sending.
There is a lull in messages for longer than cfg.connIdleTimeout. This causes the broker.cxnProduce to be reaped and scenario 1 reapplies on the next message.

A config option like kgo.LiveProduceConnection that would create broker.cxnProduce on init and shield it from the connection reaper, would solve our problem.

The text was updated successfully, but these errors were encountered:

twmb · 2024-08-21T18:44:42Z

Would it be preferable to shield it from the reaper, or would a function that ensures a producer connection to a specific broker is open (or all brokers with node id -1 perhaps) be preferable? The function could return only once auth is complete, and can also return an error that contains any brokers connections could not be opened for.

helgeholm · 2024-08-22T08:50:59Z

Thanks for following up!

Ensuring connection to all brokers would be necessary for our use case, yes. The ideal is to send an "apiRequest" or other heartbeat-type message to each broker on startup and at a regular interval.

We've done a lot more testing on our end, and found that the brokers also have connection reapers defaulting to 10 minutes. While these can be configured off, it is only via undocumented behavior and some Kafka providers (most importantly ours) doesn't allow it.

Our topics have 12 partitions on a cluster of 3 brokers.

Currently we have it working well by producing dummy messages every 9 minutes and sending one each to an arbitrary partition on each of the brokers. This maintains the connection indefinitely and we are getting low latency behavior. It would however be more clean if we could maintain a connection without piling unnecessary messages on the topics.

twmb · 2024-08-26T17:23:49Z

Have you tried adjusting ConnIdleTimeout? Is that an option?

helgeholm · 2024-08-27T08:29:13Z

Some of our producers' traffic patterns have days and hours between messages, so even the max value of 15 minutes is insufficient.

If kgo's configuration tolerance were extended to high values or even 0=disabled, we can use it like we currently use librdkafka. That is, heartbeat messages every 30 minutes to avoid the server's connection reaper's max tolerance of 60 minutes.

But our hope is to be able to maintain live connections even without generating messages on a heartbeat topic. :)

Allowing us to send ApiRequest messages to individual brokers in the cluster would be an alternative that gives us the ability to solve our problem.

This can help reduce latency if you produce infrequently, but know you'll be producing shortly. Closes #807.

twmb · 2024-10-15T00:17:26Z

PTAL at the function introduced in this PR: #839

I don't think I'm going to squeeze this into the next release (tonight), so this may wait a month unfortunately, but lmk if that would solve what you're looking for.

helgeholm · 2024-10-15T12:06:50Z

Thanks! The documented behavior makes sense. Specified or discovered brokers will definitely cover it.

If I understand the code path, it will A) initiate a connection to each broker if no connection exists, B) if a connection does exist, reset any client side timeout on it but otherwise make no communication with the broker. Is this correct? Or will it perform some sort of network "ping" in the B case?

If that understanding is correct, this will remove most cases of latency spikes as long as we keep calling EnsureProduceConnectionIsOpen often enough to immediately recover from server side or network middleware connection reaping, which I think we can keep down to around once every 30 minutes per broker. Definitely an improvement for us.

If I'm incorrect and B also performs a "ping" (e.g. querying "rd_kafka_metadata") or any other network communication, we will also avoid server-side connection reaping, and also only need to call EnsureProduceConnectionIsOpen every few minutes. If that is happening, or can be added, it would mitigate all reconnect latency spikes that are possible client-side.

twmb added the waiting label Aug 26, 2024

twmb added a commit that referenced this issue Oct 15, 2024

kgo: add EnsureProduceConnectionIsOpen

d0d94b2

This can help reduce latency if you produce infrequently, but know you'll be producing shortly. Closes #807.

twmb linked a pull request Oct 15, 2024 that will close this issue

kgo: add EnsureProduceConnectionIsOpen #839

Open

twmb added the has pr label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kgo.LiveProduceConnection option for low-latency settings #807

kgo.LiveProduceConnection option for low-latency settings #807

helgeholm commented Aug 15, 2024

twmb commented Aug 21, 2024

helgeholm commented Aug 22, 2024

twmb commented Aug 26, 2024

helgeholm commented Aug 27, 2024

twmb commented Oct 15, 2024 •

edited

Loading

helgeholm commented Oct 15, 2024

kgo.LiveProduceConnection option for low-latency settings #807

kgo.LiveProduceConnection option for low-latency settings #807

Comments

helgeholm commented Aug 15, 2024

twmb commented Aug 21, 2024

helgeholm commented Aug 22, 2024

twmb commented Aug 26, 2024

helgeholm commented Aug 27, 2024

twmb commented Oct 15, 2024 • edited Loading

helgeholm commented Oct 15, 2024

twmb commented Oct 15, 2024 •

edited

Loading