Skip to content

[server] fix stop replica deletion stuck when TabletServer is offline#3391

Open
gyang94 wants to merge 4 commits into
apache:mainfrom
gyang94:per-sender-retry
Open

[server] fix stop replica deletion stuck when TabletServer is offline#3391
gyang94 wants to merge 4 commits into
apache:mainfrom
gyang94:per-sender-retry

Conversation

@gyang94
Copy link
Copy Markdown
Contributor

@gyang94 gyang94 commented May 27, 2026

Purpose

Linked issue: close #3357

Brief change log

Summary

When a stopReplica RPC fails due to transient network issues or a TabletServer crash, the Coordinator has no reliable retry mechanism. This causes replicas to get stuck and table deletion to never complete, resulting in the tableCount metric never decreasing.

This PR introduces a per-TabletServer sender thread model (aligned with Kafka's ControllerChannelManager / RequestSendThread) and a new ReplicaDeletionIneligible state. These changes provide robust retry and pause/resume semantics for replica deletion.

Changes

Core: Per-TS Sender Thread (ControlRequestSendThread)

  • Dedicated Sender Thread: Each TabletServer gets a dedicated sender thread with a FIFO queue.
  • New Replica State: Introduced a state for replicas whose deletion cannot proceed (e.g., TS offline or returned a business error).
  • Resume Logic: TableManager.resumeDeletions() implements 3-step logic:
    1. Complete if all replicas succeeded.
    2. Retry previously-ineligible replicas on alive TSes.
    3. Re-fire eligible tables.
  • Auto-Resume on Reconnect: processNewTabletServer() clears ineligible marks and triggers resumeDeletions(), so paused deletions automatically resume when a TS reconnects.
  • Handle Dead TS: processDeadTabletServer() transitions in-flight deletion replicas to ineligible.

Config

  • coordinator.request.retry.backoff: Backoff between retries (default: 100ms).
  • coordinator.request.timeout: RPC timeout per attempt (default: 30s).

️ What was removed

  • retryDeleteAndSuccessDeleteReplicas(): The old "retry-N-then-force-success" mechanism.
  • failDeleteNumbers tracking and DELETE_TRY_TIMES constant.
  • Direct RPC calls from CoordinatorRequestBatch (replaced by queue-based dispatch).

Tests

API and Format

Documentation

Copy link
Copy Markdown
Contributor

@swuferhong swuferhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @gyang94 thanks for your contributuon, it's an important feature, I left some comments:

Comment thread fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java Outdated
Comment thread fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java Outdated
Comment thread fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java Outdated
@gyang94 gyang94 force-pushed the per-sender-retry branch from caaaebb to a490b4c Compare June 2, 2026 10:35
@gyang94 gyang94 force-pushed the per-sender-retry branch from a490b4c to 4c9372f Compare June 3, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[server] Table deletion stuck permanently when StopReplica request fails

2 participants