Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.2.11 - Non-graceful recovery after a transaction timeout #26

Open
truthordata opened this issue Nov 21, 2022 · 3 comments
Open

v0.2.11 - Non-graceful recovery after a transaction timeout #26

truthordata opened this issue Nov 21, 2022 · 3 comments

Comments

@truthordata
Copy link
Contributor

In Fluvii version 0.2.11, it appears that the producer needs an "epoch refresh" after a timeout exception.

PID 1 - 2022-11-19 17:17:43,092 - eloqua_bulk_batcher_controller.helper_funcs.status_methods - INFO: PeopleStream_PeopleStreamFromEloqua_ContactRetriever__2022-11-19T17:13:39Z__2022-11-19T17:17:14Z - Now waiting for sync creation, population, and record count...
PID 1 - 2022-11-19 17:17:43,092 - fluvii.producer.producer - INFO: Producing message with guid 13639a82-682e-11ed-90a6-0a58ac147930
PID 1 - 2022-11-19 17:17:43,092 - fluvii.consumer.consumer - INFO: Consumption attempts for this batch are finished.
%6|1668878263.093|FAIL|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker-1-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443: Disconnected (after 559819ms in state UP, 1 identical error(s) suppressed)
%6|1668878263.161|FAIL|rdkafka#producer-1| [thrd:sasl_ssl://broker-4-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-4-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/4: Disconnected (after 559889ms in state UP, 1 identical error(s) suppressed)
%6|1668878263.594|FAIL|rdkafka#producer-1| [thrd:sasl_ssl://mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:4]: sasl_ssl://mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/bootstrap: Disconnected (after 560321ms in state UP, 1 identical error(s) suppressed)
%6|1668878264.095|FAIL|rdkafka#producer-1| [thrd:sasl_ssl://broker-2-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-2-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/2: Disconnected (after 560820ms in state UP, 1 identical error(s) suppressed)
%6|1668878292.077|FAIL|rdkafka#consumer-2| [thrd:sasl_ssl://broker-3-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-3-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/3: Disconnected (after 214671ms in state UP)
%6|1668878292.081|FAIL|rdkafka#consumer-2| [thrd:sasl_ssl://broker-4-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-4-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/4: Disconnected (after 244698ms in state UP, 1 identical error(s) suppressed)
%6|1668878292.084|FAIL|rdkafka#consumer-2| [thrd:sasl_ssl://broker-5-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-5-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/5: Disconnected (after 244710ms in state UP, 1 identical error(s) suppressed)
%6|1668878292.087|FAIL|rdkafka#consumer-2| [thrd:sasl_ssl://broker-2-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-2-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/2: Disconnected (after 244701ms in state UP, 1 identical error(s) suppressed)
%6|1668878292.089|FAIL|rdkafka#consumer-2| [thrd:sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/0: Disconnected (after 244710ms in state UP, 1 identical error(s) suppressed)
%6|1668878292.092|FAIL|rdkafka#consumer-2| [thrd:sasl_ssl://broker-1-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-1-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/1: Disconnected (after 244704ms in state UP, 1 identical error(s) suppressed)
%6|1668878292.094|FAIL|rdkafka#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443: Disconnected (after 194267ms in state UP)
%4|1668878292.095|FAIL|rdkafka#consumer-2| [thrd:sasl_ssl://mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:4]: sasl_ssl://mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/bootstrap: Disconnected (after 1142751ms in state UP, 1 identical error(s) suppressed)
PID 1 - 2022-11-19 17:18:43,092 - fluvii.transaction.transaction - ERROR: KafkaError{code=_TIMED_OUT,val=-185,str="Transactional operation timed out"}
PID 1 - 2022-11-19 17:18:43,093 - fluvii.transaction.transaction - ERROR: <built-in method code of cimpl.KafkaError object at 0x7f88af1d0c10>
PID 1 - 2022-11-19 17:18:43,093 - fluvii.transaction.transaction - INFO: KafkaException: is retriable? - True, should abort? - False
PID 1 - 2022-11-19 17:18:43,093 - fluvii.fluvii_app.fluvii_app - INFO: Graceful transaction failure; retrying commit...
PID 1 - 2022-11-19 17:18:43,093 - fluvii.fluvii_app.fluvii_app - INFO: Consuming 1 messages over 10 seconds...
%5|1668878323.176|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/0: Timed out MetadataRequest in flight (after 60083ms, timeout #0)
%5|1668878323.176|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/0: Timed out ProduceRequest in flight (after 60016ms, timeout #1)
%5|1668878323.176|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/0: Timed out MetadataRequest in flight (after 60015ms, timeout #2)
%5|1668878323.176|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/0: Timed out FindCoordinatorRequest in flight (after 59979ms, timeout #3)
%5|1668878323.176|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/0: Timed out TxnOffsetCommitRequest in flight (after 59946ms, timeout #4)
%4|1668878323.177|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/0: Timed out 5 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
%3|1668878323.177|FAIL|rdkafka#producer-1| [thrd:sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/0: 5 request(s) timed out: disconnect (after 319861ms in state UP)
%3|1668878323.180|TXNERR|rdkafka#producer-1| [thrd:sasl_ssl://broker-0-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: Current transaction failed in state InTransaction: 1 message(s) timed out on NubiumIntegrations_Eloqua_EbbWorkerTasks [10] (_TIMED_OUT, requires epoch bump)
%6|1668878323.184|FAIL|rdkafka#producer-1| [thrd:sasl_ssl://broker-5-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcl]: sasl_ssl://broker-5-mode-ps-cbi-g--tbjmfoubd-hvg.bf2.kafka.rhcloud.com:443/5: Disconnected (after 319824ms in state UP, 1 identical error(s) suppressed)
%4|1668878323.295|GETPID|rdkafka#producer-1| [thrd:main]: Failed to acquire transactional PID from broker TxnCoordinator/1: Broker: Producer attempted to update a transaction while another concurrent operation on the same transaction was ongoing: retrying
%1|1668878323.798|TXNERR|rdkafka#producer-1| [thrd:main]: Fatal transaction error: Producer fenced by newer instance: Failed to acquire transactional PID from broker TxnCoordinator/1: Broker: There is a newer producer with the same transactionalId which fences the current one (_FENCED)
@truthordata
Copy link
Contributor Author

It seems that despite the library saying the transaction is "retriable", the easiest thing to do for now is likely just abort the transaction if a timeout exception is raised.

Maybe we even just kill the application to make sure everything gets refreshed cleanly, as there are likely other issues going on if a transaction timeout is happening, like we see in this case based on the fact the client could not acquire metadata either. In the future we might be able to come up with something more graceful.

@truthordata
Copy link
Contributor Author

Note that to handle this more cleanly, would need to investigate how librdkafka is handling producer epochs.

@truthordata truthordata changed the title Non-graceful recovery after a transaction timeout v0.2.11 - Non-graceful recovery after a transaction timeout Nov 21, 2022
@truthordata
Copy link
Contributor Author

Will monitor to see if #27 fixes this, which is now live (0.2.12)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant