Skip to content

[server] Add Cluster Health API implementation#3400

Open
swuferhong wants to merge 2 commits into
apache:mainfrom
swuferhong:fluss-server-recovery
Open

[server] Add Cluster Health API implementation#3400
swuferhong wants to merge 2 commits into
apache:mainfrom
swuferhong:fluss-server-recovery

Conversation

@swuferhong
Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #3399

  • Add GetClusterHealth RPC to Coordinator that computes cluster health from in-memory state
  • Track inactive leaders in CoordinatorContext (marked inactive on NotifyLeaderAndIsr send,
    marked active on successful response when responding server is still the leader)
  • Handle send failures in CoordinatorRequestBatch by synthesizing error responses to clear
    pending inactive state
  • Add client API Admin.getClusterHealth() with ClusterHealth / ClusterHealthStatus types
  • Add ClusterHealthReadinessCheck CLI tool in fluss-dist (exit 0=GREEN, 1=not ready, 2=API unsupported)
  • Add readiness-check.sh two-step readiness probe script (TCP + Cluster Health API)
    with first-boot detection and grace period for API-unsupported (mixed-version rolling upgrade)
  • Wire tablet-server readiness probe to readiness-check.sh in Helm chart
  • Add documentation for Helm deployment and upgrade guide

Brief change log

Tests

API and Format

Documentation

@swuferhong swuferhong force-pushed the fluss-server-recovery branch from fe0c5c1 to 042fc7a Compare May 29, 2026 03:01
Copy link
Copy Markdown
Contributor

@loserwang1024 loserwang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comment.

Comment thread fluss-server/src/main/java/org/apache/fluss/server/RpcServiceBase.java Outdated
// re-election via onReplicaBecomeOffline.
List<NotifyLeaderAndIsrResultForBucket> failedResults =
new ArrayList<>();
ApiError sendError =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, after this modification, even a network timeout exception, the CoordinatorEventProcessor#processNotifyLeaderAndIsrResponseReceivedEvent will mark the server as offlineReplicas

won't is any problem?
Image

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I have reverted it — the original "just ignore" approach was intentional (the coordinator detects actual server death via heartbeat timeout). The only thing I kept is clearing the pendingLeaderActivationBuckets on send failure, so the health API doesn't report stale RED for buckets that are simply stuck in the pending set. Actual failover is still handled by the existing heartbeat/offline mechanism.

echo "advertised.listeners: ${ADVERTISED_LISTENERS}" >> $FLUSS_HOME/conf/server.yaml && \

bin/coordinator-server.sh start-foreground
exec bin/coordinator-server.sh start-foreground
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to change it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without exec, the Fluss process is not PID 1 in the pod, so graceful shutdown signals won't reach it. I fixed this as well.

@swuferhong swuferhong force-pushed the fluss-server-recovery branch from 64f2544 to c23ee95 Compare June 3, 2026 04:15
@swuferhong
Copy link
Copy Markdown
Contributor Author

@loserwang1024 comments addressed. PLTA.

Copy link
Copy Markdown
Contributor

@loserwang1024 loserwang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[server] Support Cluster Health API for safe rolling upgrades

2 participants