-
Notifications
You must be signed in to change notification settings - Fork 241
Description
Describe the bug
Helix participants can enter a "zombie" state where they appear healthy in ZooKeeper but are functionally disconnected from cluster operations, unable to process state transition messages. This occurs due to a critical flaw in the legacy compatibility wrapper IZkStateListenerI0ItecImpl that completely ignores session ID parameters when processing ZooKeeper session events.
When the single-threaded ZkEventThread becomes blocked (e.g., during long-running FINALIZE operations), multiple SyncConnected events accumulate in the event queue. Once processing resumes, these events are handled in FIFO order, but the legacy wrapper discards the original session ID from each queued event and falls back to using the current active session ID via getSessionId(). This causes session events originally queued for older sessions to be processed using the context of the most recent session, violating the intended event processing timeline.
The result is catastrophic: the first misprocessed event successfully creates a LiveInstance for the current session, but all subsequent events in the backlog attempt to create LiveInstances for the same current session and fail with "already has a live-instance" exceptions. Each failed attempt partially resets the participant's message handlers (setting _ready = false) but never completes the re-initialization process, leaving the participant in a broken state where it cannot process any state transition messages despite appearing active to the Helix controller.
To Reproduce
Create a DedicateZkClient, pass any fake sessionID in handleNewSession. It will totally discard that sessionID and will use the current active zk connection session ID and will all the processing based on that. Trigger another handleNewSession event with fake sessionID, it will fail to create liveinstance this time, and handlers will be kept in reset state.
Expected behavior
Session events should be processed with their original session context regardless of timing delays. The handleNewSession() method should receive and use the correct session ID that was active when the event was originally queued, ensuring proper FIFO event processing and maintaining participant message handling capabilities.
Additional context
Add any other context about the problem here.