Skip to content

Conversation

@jlon
Copy link
Contributor

@jlon jlon commented Nov 6, 2025

Fix FUSE client hang during master failover by optimizing RPC timeout and retry logic.

Changes:

  • Reduce RPC timeout: 120s → 10s
  • Reduce retry duration: 300s → 40s
  • Add concurrent RPC on NotLeaderMaster error
  • Implement immediate node switching (no retry delay within round)

Result: Recovery time reduced from 250-360s to 12-40s (12-23x faster).

@jlon jlon force-pushed the optimize/master-failover-fast-recovery branch 2 times, most recently from 97d5ac9 to b8ff8f1 Compare November 7, 2025 00:49
@jlon jlon changed the title feat: optimize fuse client hang issue during multi-master failover by optimizing rpc timeout and retry logic. feat: optimize fuse client hang issue during multi-master failover Nov 7, 2025
rpc_retry_max_duration_ms: 5 * 60 * 1000,
rpc_retry_min_sleep_ms: 300,
rpc_retry_max_sleep_ms: 30 * 1000,
rpc_retry_max_duration_ms: 40 * 1000, // 40s: optimized for fast master failover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modified configuration size is very insecure.
Under network fluctuations, rapid RPC timeouts can lead to the following:

  1. RPC requests fail, and workers quickly lose their jobs.
  2. Normal business requests fail.
  3. Frequent job selection/selection issues occur.

@jlon jlon closed this Nov 15, 2025
@jlon jlon reopened this Nov 27, 2025
@jlon jlon force-pushed the optimize/master-failover-fast-recovery branch 3 times, most recently from 7b518f5 to 859aa28 Compare November 28, 2025 03:06
@jlon jlon force-pushed the optimize/master-failover-fast-recovery branch from 859aa28 to 19a069f Compare November 28, 2025 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants