feat: optimize fuse client hang issue during multi-master failover #379

jlon · 2025-11-06T02:25:21Z

Fix FUSE client hang during master failover by optimizing RPC timeout and retry logic.

Changes:

Reduce RPC timeout: 120s → 10s
Reduce retry duration: 300s → 40s
Add concurrent RPC on NotLeaderMaster error
Implement immediate node switching (no retry delay within round)

Result: Recovery time reduced from 250-360s to 12-40s (12-23x faster).

bigbigxu · 2025-11-10T12:23:15Z

curvine-common/src/conf/client_conf.rs

-            rpc_retry_max_duration_ms: 5 * 60 * 1000,
-            rpc_retry_min_sleep_ms: 300,
-            rpc_retry_max_sleep_ms: 30 * 1000,
+            rpc_retry_max_duration_ms: 40 * 1000, // 40s: optimized for fast master failover


The modified configuration size is very insecure.
Under network fluctuations, rapid RPC timeouts can lead to the following:

RPC requests fail, and workers quickly lose their jobs.

Normal business requests fail.

Frequent job selection/selection issues occur.

jlon force-pushed the optimize/master-failover-fast-recovery branch 2 times, most recently from 97d5ac9 to b8ff8f1 Compare November 7, 2025 00:49

jlon changed the title ~~feat: optimize fuse client hang issue during multi-master failover by optimizing rpc timeout and retry logic.~~ feat: optimize fuse client hang issue during multi-master failover Nov 7, 2025

bigbigxu reviewed Nov 10, 2025

View reviewed changes

jlon closed this Nov 15, 2025

jlon reopened this Nov 27, 2025

jlon force-pushed the optimize/master-failover-fast-recovery branch 3 times, most recently from 7b518f5 to 859aa28 Compare November 28, 2025 03:06

feat: optimize master failover recovery time

19a069f

jlon force-pushed the optimize/master-failover-fast-recovery branch from 859aa28 to 19a069f Compare November 28, 2025 03:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: optimize fuse client hang issue during multi-master failover #379

feat: optimize fuse client hang issue during multi-master failover #379

Uh oh!

jlon commented Nov 6, 2025

Uh oh!

bigbigxu Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: optimize fuse client hang issue during multi-master failover #379

Are you sure you want to change the base?

feat: optimize fuse client hang issue during multi-master failover #379

Uh oh!

Conversation

jlon commented Nov 6, 2025

Uh oh!

bigbigxu Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants