Skip to content

_controllerOpt getting shared across resources #3050

@LZD-PratyushBhatt

Description

@LZD-PratyushBhatt

Describe the bug

We were debugging a weird issue where multiple HelixTaskExecutor-message_handle_thread threads were trying to take the lock on _controllerOpt.

Codewise createNewStateModel() creates a new DistClusterControllerStateModel instance for each (resourceName, partitionKey) pair.

Earlier we thought maybe all those threads are for same resourceName, partitionKey pair, but trying to handle different state transitions(maybe stale or whatever), but this was ruled out when we checked that these threads were processing the ST for different Clusters( i.e. resource for a SuperCluster/CONTROLLER_CLUSTER)
Example for Thread 0, 35 and 37

2025/06/25 09:37:13.340 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_0] [helix] [] handling message: 9591d4c7-b08a-466e-bf21-2cfa95d94896 transit gobblin-ddm-kfketl2-test-ltx1-holdem-test-mho.gobblin-ddm-kfketl2-test-ltx1-holdem-test-mho|[] from:OFFLINE to:DROPPED, relayedFrom: null
2025/06/25 09:37:13.342 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_0] [helix] [] Instance ltx1-app0932.xyz.com_12923, partition gobblin-ddm-kfketl2-test-ltx1-holdem-test-mho received state transition from OFFLINE to DROPPED on session 1100800c49e63bb4, message id: 9591d4c7-b08a-466e-bf21-2cfa95d94896
2025/06/25 09:36:51.119 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_35] [helix] [] handling message: 0fae1c0d-129a-47c2-a8fd-2c2d00f0d7ca transit gobblin-kafka-streaming-service-ltx1-holdem-test-ppc.gobblin-kafka-streaming-service-ltx1-holdem-test-ppc|[] from:STANDBY to:LEADER, relayedFrom: null
2025/06/25 09:36:51.120 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_35] [helix] [] Instance ltx1-app0932.xyz.com_12923, partition gobblin-kafka-streaming-service-ltx1-holdem-test-ppc received state transition from STANDBY to LEADER on session 1100800c49e63bb4, message id: 0fae1c0d-129a-47c2-a8fd-2c2d00f0d7ca
2025/06/25 09:36:51.120 INFO [DistClusterControllerStateModel] [HelixTaskExecutor-message_handle_thread_35] [helix] [] ltx1-app0932.xyz.com_12923 becoming leader from standby for gobblin-kafka-streaming-service-ltx1-holdem-test-ppc
2025/06/25 09:36:51.408 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_37] [helix] [] handling message: 03c3a0e9-40e9-4169-8ec5-4c25d3c0f290 transit gobblin-kafka-streaming-tracking-ltx1-holdem-medvol-localConsumption.gobblin-kafka-streaming-tracking-ltx1-holdem-medvol-localConsumption|[] from:STANDBY to:LEADER, relayedFrom: null
2025/06/25 09:36:51.409 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_37] [helix] [] Instance ltx1-app0932.xyz.com_12923, partition gobblin-kafka-streaming-tracking-ltx1-holdem-medvol-localConsumption received state transition from STANDBY to LEADER on session 1100800c49e63bb4, message id: 03c3a0e9-40e9-4169-8ec5-4c25d3c0f290
2025/06/25 09:36:51.409 INFO [DistClusterControllerStateModel] [HelixTaskExecutor-message_handle_thread_37] [helix] [] ltx1-app0932.xyz.com_12923 becoming leader from standby for gobblin-kafka-streaming-tracking-ltx1-holdem-medvol-localConsumption

Turns out there is a bug in DistClusterControllerStateModel where it creates _controllerOpt as Optional.empty().
Even though we are creating different objects of DistClusterControllerStateModel, since its using Optional.empty() the reference will be same across all objects (Doc for Optional.empty() here) .
We simualted this through a test and verified the behavior.

To Reproduce

Create two objects of DistClusterControllerStateModel and compare _controllerOpt for both, it returns true.

Expected behavior

The lock should be at per (resource, partition) pair level and not across them.

Additional context

Thread dump:

threadDumpST.txt

Image

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions