Question: should an error from EndTransaction re-initialize the producer ID/epoch? #855

rodaine · 2024-11-05T18:20:21Z

Hello! I have a question regarding the expected behavior around Client.EndTransaction and GroupTransactSession.End. In the event a partition between the kgo client and the broker for this call, eventually the end transaction times out, returning an error, but the client is still left in a usable state and able to start a new transaction. Some of our logs (including WARN logs from within kgo):

time=2024-11-04T19:13:40.712Z level=INFO msg="begin transaction" worker_id=1 txn_id=467
[ some fetches/produces within that transaction ]
time=2024-11-04T19:13:48.384Z level=INFO msg="end transaction started" worker_id=1 txn_id=467 intended_result=abort
time=2024-11-04T19:14:02.975Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:03.450Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:04.380Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:06.704Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:08.623Z level=WARN msg="read from broker errored, killing connection after 0 successful responses (is SASL missing?)" worker_id=1 req=Heartbeat addr=kafka1:9092 broker=1 err="read tcp 10.0.0.43:37712->10.0.0.40:9092: i/o timeout"
time=2024-11-04T19:14:09.206Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:11.707Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:12.876Z level=WARN msg="read from broker errored, killing connection after 0 successful responses (is SASL missing?)" worker_id=1 req=Metadata addr=kafka2:9092 broker=2 err="read tcp 10.0.0.43:45962->10.0.0.41:9092: i/o timeout"
time=2024-11-04T19:14:14.491Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp 10.0.0.42:9092: connect: connection refused"
time=2024-11-04T19:14:17.056Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp 10.0.0.42:9092: connect: connection refused"
time=2024-11-04T19:14:20.376Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp 10.0.0.42:9092: connect: connection refused"
time=2024-11-04T19:14:20.888Z level=WARN msg="read from broker errored, killing connection after 0 successful responses (is SASL missing?)" worker_id=1 req=EndTxn addr=kafka1:9092 broker=1 err="read tcp 10.0.0.43:52692->10.0.0.40:9092: i/o timeout"
time=2024-11-04T19:14:20.888Z level=WARN msg="end transaction finished" worker_id=1 txn_id=467 intended_result=abort result="unexpected error, txn state unknown" error="read tcp 10.0.0.43:52692->10.0.0.40:9092: i/o timeout"
[ call to session.Begin returns no error ]
time=2024-11-04T19:14:20.889Z level=INFO msg="begin transaction" worker_id=1 txn_id=473
[ essentially continuing with the same producer ID and epoch ]

This can result (from what we've seen) produced records added to a transaction intended to be aborted but timing out in this way being committed in the next transaction on the same client since the producer ID/epoch does not appear to be re-initailized in this case (per KIP-360). From the docs on GroupTransactSession.End:

No returned error is retryable. Either the transactional ID has entered a failed state, or the client retried so much that the retry limit was hit, and odds are you should not continue.

So my questions:

When it says "should not continue" does that mean it's up to the user to close the client and create a new one?
Should the client not permit beginning a new transaction if this is the case?
If it should allow starting a new one, should the client reinitialize first to fence off the previous failure?

The text was updated successfully, but these errors were encountered:

twmb · 2024-11-11T19:04:12Z

Yes
Yes -- though this should already be the case? GroupTransactSession.End calls EndTransaction, which on line 953, fails the producer ID with the error. The producer ID then internally always loads with an error and all records fail with that error (line 379 of sink.go). What are the info level logs of the client?
Due to (2) the client should enter a fatal state, forcing you to close the client and reinitialize

I'm curious to see the info logs?

rodaine · 2024-11-18T16:22:14Z

I believe I had the logger configured to only emit WARN+ logs, will adjust our Antithesis workload to emit INFO as well.

twmb added the waiting label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: should an error from EndTransaction re-initialize the producer ID/epoch? #855

Question: should an error from EndTransaction re-initialize the producer ID/epoch? #855

rodaine commented Nov 5, 2024

twmb commented Nov 11, 2024

rodaine commented Nov 18, 2024 •

edited

Loading

Question: should an error from EndTransaction re-initialize the producer ID/epoch? #855

Question: should an error from EndTransaction re-initialize the producer ID/epoch? #855

Comments

rodaine commented Nov 5, 2024

twmb commented Nov 11, 2024

rodaine commented Nov 18, 2024 • edited Loading

rodaine commented Nov 18, 2024 •

edited

Loading