Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: should an error from EndTransaction re-initialize the producer ID/epoch? #855

Open
rodaine opened this issue Nov 5, 2024 · 2 comments
Labels

Comments

@rodaine
Copy link

rodaine commented Nov 5, 2024

Hello! I have a question regarding the expected behavior around Client.EndTransaction and GroupTransactSession.End. In the event a partition between the kgo client and the broker for this call, eventually the end transaction times out, returning an error, but the client is still left in a usable state and able to start a new transaction. Some of our logs (including WARN logs from within kgo):

time=2024-11-04T19:13:40.712Z level=INFO msg="begin transaction" worker_id=1 txn_id=467
[ some fetches/produces within that transaction ]
time=2024-11-04T19:13:48.384Z level=INFO msg="end transaction started" worker_id=1 txn_id=467 intended_result=abort
time=2024-11-04T19:14:02.975Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:03.450Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:04.380Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:06.704Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:08.623Z level=WARN msg="read from broker errored, killing connection after 0 successful responses (is SASL missing?)" worker_id=1 req=Heartbeat addr=kafka1:9092 broker=1 err="read tcp 10.0.0.43:37712->10.0.0.40:9092: i/o timeout"
time=2024-11-04T19:14:09.206Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:11.707Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp: lookup kafka3 on 10.0.0.1:53: server misbehaving"
time=2024-11-04T19:14:12.876Z level=WARN msg="read from broker errored, killing connection after 0 successful responses (is SASL missing?)" worker_id=1 req=Metadata addr=kafka2:9092 broker=2 err="read tcp 10.0.0.43:45962->10.0.0.41:9092: i/o timeout"
time=2024-11-04T19:14:14.491Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp 10.0.0.42:9092: connect: connection refused"
time=2024-11-04T19:14:17.056Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp 10.0.0.42:9092: connect: connection refused"
time=2024-11-04T19:14:20.376Z level=WARN msg="unable to open connection to broker" worker_id=1 addr=kafka3:9092 broker=3 err="dial tcp 10.0.0.42:9092: connect: connection refused"
time=2024-11-04T19:14:20.888Z level=WARN msg="read from broker errored, killing connection after 0 successful responses (is SASL missing?)" worker_id=1 req=EndTxn addr=kafka1:9092 broker=1 err="read tcp 10.0.0.43:52692->10.0.0.40:9092: i/o timeout"
time=2024-11-04T19:14:20.888Z level=WARN msg="end transaction finished" worker_id=1 txn_id=467 intended_result=abort result="unexpected error, txn state unknown" error="read tcp 10.0.0.43:52692->10.0.0.40:9092: i/o timeout"
[ call to session.Begin returns no error ]
time=2024-11-04T19:14:20.889Z level=INFO msg="begin transaction" worker_id=1 txn_id=473
[ essentially continuing with the same producer ID and epoch ]

This can result (from what we've seen) produced records added to a transaction intended to be aborted but timing out in this way being committed in the next transaction on the same client since the producer ID/epoch does not appear to be re-initailized in this case (per KIP-360). From the docs on GroupTransactSession.End:

No returned error is retryable. Either the transactional ID has entered a failed state, or the client retried so much that the retry limit was hit, and odds are you should not continue.

So my questions:

  1. When it says "should not continue" does that mean it's up to the user to close the client and create a new one?
  2. Should the client not permit beginning a new transaction if this is the case?
  3. If it should allow starting a new one, should the client reinitialize first to fence off the previous failure?
@twmb
Copy link
Owner

twmb commented Nov 11, 2024

  1. Yes
  2. Yes -- though this should already be the case? GroupTransactSession.End calls EndTransaction, which on line 953, fails the producer ID with the error. The producer ID then internally always loads with an error and all records fail with that error (line 379 of sink.go). What are the info level logs of the client?
  3. Due to (2) the client should enter a fatal state, forcing you to close the client and reinitialize

I'm curious to see the info logs?

@twmb twmb added the waiting label Nov 11, 2024
@rodaine
Copy link
Author

rodaine commented Nov 18, 2024

I believe I had the logger configured to only emit WARN+ logs, will adjust our Antithesis workload to emit INFO as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants