[server] Add Cluster Health API implementation by swuferhong · Pull Request #3400 · apache/fluss

swuferhong · 2026-05-29T02:03:01Z

Purpose

Linked issue: close #3399

Add GetClusterHealth RPC to Coordinator that computes cluster health from in-memory state
Track inactive leaders in CoordinatorContext (marked inactive on NotifyLeaderAndIsr send,
marked active on successful response when responding server is still the leader)
Handle send failures in CoordinatorRequestBatch by synthesizing error responses to clear
pending inactive state
Add client API Admin.getClusterHealth() with ClusterHealth / ClusterHealthStatus types
Add ClusterHealthReadinessCheck CLI tool in fluss-dist (exit 0=GREEN, 1=not ready, 2=API unsupported)
Add readiness-check.sh two-step readiness probe script (TCP + Cluster Health API)
with first-boot detection and grace period for API-unsupported (mixed-version rolling upgrade)
Wire tablet-server readiness probe to readiness-check.sh in Helm chart
Add documentation for Helm deployment and upgrade guide

Brief change log

Tests

API and Format

Documentation

loserwang1024

I have left some comment.

loserwang1024 · 2026-06-02T06:04:14Z

+                            // re-election via onReplicaBecomeOffline.
+                            List<NotifyLeaderAndIsrResultForBucket> failedResults =
+                                    new ArrayList<>();
+                            ApiError sendError =


Currently, after this modification, even a network timeout exception, the CoordinatorEventProcessor#processNotifyLeaderAndIsrResponseReceivedEvent will mark the server as offlineReplicas

won't is any problem?

Good catch. I have reverted it — the original "just ignore" approach was intentional (the coordinator detects actual server death via heartbeat timeout). The only thing I kept is clearing the pendingLeaderActivationBuckets on send failure, so the health API doesn't report stale RED for buckets that are simply stuck in the pending set. Actual failover is still handled by the existing heartbeat/offline mechanism.

loserwang1024 · 2026-06-02T06:07:44Z

              echo "advertised.listeners: ${ADVERTISED_LISTENERS}" >> $FLUSS_HOME/conf/server.yaml && \

-              bin/coordinator-server.sh start-foreground
+              exec bin/coordinator-server.sh start-foreground


why need to change it?

Without exec, the Fluss process is not PID 1 in the pod, so graceful shutdown signals won't reach it. I fixed this as well.

swuferhong · 2026-06-03T07:05:00Z

@loserwang1024 comments addressed. PLTA.

loserwang1024

LGTM

swuferhong added the priority=blocker label May 29, 2026

[server] Add Cluster Health API implementation

042fc7a

swuferhong force-pushed the fluss-server-recovery branch from fe0c5c1 to 042fc7a Compare May 29, 2026 03:01

loserwang1024 reviewed Jun 2, 2026

View reviewed changes

address hongshun's comments

c23ee95

swuferhong force-pushed the fluss-server-recovery branch from 64f2544 to c23ee95 Compare June 3, 2026 04:15

loserwang1024 approved these changes Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server] Add Cluster Health API implementation#3400

[server] Add Cluster Health API implementation#3400
swuferhong wants to merge 2 commits into
apache:mainfrom
swuferhong:fluss-server-recovery

swuferhong commented May 29, 2026

Uh oh!

loserwang1024 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

loserwang1024 Jun 2, 2026

Uh oh!

swuferhong Jun 3, 2026

Uh oh!

loserwang1024 Jun 2, 2026

Uh oh!

swuferhong Jun 3, 2026

Uh oh!

swuferhong commented Jun 3, 2026

Uh oh!

loserwang1024 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swuferhong commented May 29, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

loserwang1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

loserwang1024 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

swuferhong Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

loserwang1024 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

swuferhong Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

swuferhong commented Jun 3, 2026

Uh oh!

loserwang1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants