-
Notifications
You must be signed in to change notification settings - Fork 241
Description
Describe the bug
We were debugging a weird issue where multiple HelixTaskExecutor-message_handle_thread threads were trying to take the lock on _controllerOpt.
Codewise createNewStateModel() creates a new DistClusterControllerStateModel instance for each (resourceName, partitionKey) pair.
Earlier we thought maybe all those threads are for same resourceName, partitionKey pair, but trying to handle different state transitions(maybe stale or whatever), but this was ruled out when we checked that these threads were processing the ST for different Clusters( i.e. resource for a SuperCluster/CONTROLLER_CLUSTER)
Example for Thread 0, 35 and 37
2025/06/25 09:37:13.340 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_0] [helix] [] handling message: 9591d4c7-b08a-466e-bf21-2cfa95d94896 transit gobblin-ddm-kfketl2-test-ltx1-holdem-test-mho.gobblin-ddm-kfketl2-test-ltx1-holdem-test-mho|[] from:OFFLINE to:DROPPED, relayedFrom: null
2025/06/25 09:37:13.342 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_0] [helix] [] Instance ltx1-app0932.xyz.com_12923, partition gobblin-ddm-kfketl2-test-ltx1-holdem-test-mho received state transition from OFFLINE to DROPPED on session 1100800c49e63bb4, message id: 9591d4c7-b08a-466e-bf21-2cfa95d94896
2025/06/25 09:36:51.119 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_35] [helix] [] handling message: 0fae1c0d-129a-47c2-a8fd-2c2d00f0d7ca transit gobblin-kafka-streaming-service-ltx1-holdem-test-ppc.gobblin-kafka-streaming-service-ltx1-holdem-test-ppc|[] from:STANDBY to:LEADER, relayedFrom: null
2025/06/25 09:36:51.120 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_35] [helix] [] Instance ltx1-app0932.xyz.com_12923, partition gobblin-kafka-streaming-service-ltx1-holdem-test-ppc received state transition from STANDBY to LEADER on session 1100800c49e63bb4, message id: 0fae1c0d-129a-47c2-a8fd-2c2d00f0d7ca
2025/06/25 09:36:51.120 INFO [DistClusterControllerStateModel] [HelixTaskExecutor-message_handle_thread_35] [helix] [] ltx1-app0932.xyz.com_12923 becoming leader from standby for gobblin-kafka-streaming-service-ltx1-holdem-test-ppc
2025/06/25 09:36:51.408 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_37] [helix] [] handling message: 03c3a0e9-40e9-4169-8ec5-4c25d3c0f290 transit gobblin-kafka-streaming-tracking-ltx1-holdem-medvol-localConsumption.gobblin-kafka-streaming-tracking-ltx1-holdem-medvol-localConsumption|[] from:STANDBY to:LEADER, relayedFrom: null
2025/06/25 09:36:51.409 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_37] [helix] [] Instance ltx1-app0932.xyz.com_12923, partition gobblin-kafka-streaming-tracking-ltx1-holdem-medvol-localConsumption received state transition from STANDBY to LEADER on session 1100800c49e63bb4, message id: 03c3a0e9-40e9-4169-8ec5-4c25d3c0f290
2025/06/25 09:36:51.409 INFO [DistClusterControllerStateModel] [HelixTaskExecutor-message_handle_thread_37] [helix] [] ltx1-app0932.xyz.com_12923 becoming leader from standby for gobblin-kafka-streaming-tracking-ltx1-holdem-medvol-localConsumption
Turns out there is a bug in DistClusterControllerStateModel where it creates _controllerOpt as Optional.empty().
Even though we are creating different objects of DistClusterControllerStateModel, since its using Optional.empty() the reference will be same across all objects (Doc for Optional.empty() here) .
We simualted this through a test and verified the behavior.
To Reproduce
Create two objects of DistClusterControllerStateModel and compare _controllerOpt for both, it returns true.
Expected behavior
The lock should be at per (resource, partition) pair level and not across them.
Additional context
Thread dump:
Add any other context about the problem here.