Force client disconnects when node is unhealthy #13

pmantica11 · 2024-11-26T18:19:10Z

Clients are prevented from connecting to an unhealthy gRPC instance but are not disconnected from a lagging one. In this PR we disconnect clients clients so that they are forced to reconnect to a healthy gRPC instance.

pmantica11 · 2024-11-26T18:20:42Z

Cargo.toml

+solana-sdk = "~2.1.2"
+solana-transaction-status = "~2.1.2"
+solana-client = "~2.1.2"
+solana-rpc-client-api = "~2.1.2"


Using the master git versions was giving me package conflict errors. So I will use the latest version of the release instead. I tested the code using the Solana test validator, and it worked. So we should be good.

This is risky, we should keep using the master versions since geyser usually has breaking changes between minor versions

pmantica11 · 2024-11-26T18:21:19Z

yellowstone-grpc-geyser/config.json

@@ -5,13 +5,13 @@
  },
  "grpc": {
    "address": "0.0.0.0:10000",
-    "tls_config": {


Remove this tls config from the sample config because it's not a valid tls config.

pmantica11 · 2024-11-26T18:22:37Z

yellowstone-grpc-geyser/src/monitor.rs

+use tokio::time::interval;
+
+pub const HEALTH_CHECK_SLOT_DISTANCE: u64 = 100;
+pub const IS_NODE_UNHEALTHY: Lazy<Arc<AtomicBool>> = Lazy::new(|| Arc::new(AtomicBool::new(false)));


I considered disconnecting clients only when the node was unhealthy by some amount of time. But I think that was unnecessarily complicated. If a node is behind by 100 slots, then it must have been unhealthy for about 30 seconds, which is enough.

pmantica11 · 2024-11-26T18:23:27Z

yellowstone-grpc-geyser/src/monitor.rs

+use solana_rpc_client_api::{client_error, request};
+use tokio::time::interval;
+
+pub const HEALTH_CHECK_SLOT_DISTANCE: u64 = 100;


I did not make this an env variable because of speed. I want to get this deployed asap to fix customer impact. I am a RPC land rookie and don't even know how to configure env variables.

You can make it part of the grpc config and load it that way

pmantica11 · 2024-11-26T18:23:39Z

yellowstone-grpc-geyser/src/plugin.rs

@@ -75,6 +77,10 @@ impl GeyserPlugin for Plugin {
            .build()
            .map_err(|error| GeyserPluginError::Custom(Box::new(error)))?;

+        // Monitor node health
+        let rpc_client = RpcClient::new("http://localhost:8899".to_string());


I did not make this an env variable because of speed. I want to get this deployed asap to fix customer impact. I am a RPC land rookie and don't even know how to configure env variables.

NicolasPennie · 2024-11-27T17:58:29Z

yellowstone-grpc-geyser/src/monitor.rs

+use solana_rpc_client_api::{client_error, request};
+use tokio::time::interval;
+
+pub const HEALTH_CHECK_SLOT_DISTANCE: u64 = 100;


You can make it part of the grpc config and load it that way

NicolasPennie · 2024-11-27T18:02:13Z

yellowstone-grpc-geyser/src/grpc.rs

@@ -838,6 +839,13 @@ impl GrpcService {
                        }
                    }
                    message = messages_rx.recv() => {
+                        let num_slots_behind = NUM_SLOTS_BEHIND.load(Ordering::SeqCst);
+                        if num_slots_behind > HEALTH_CHECK_SLOT_DISTANCE {


Not required in first release, but I would prefer if this is configured as two checks:

Auto-disconnect is >100 slots

Disconnect if >20 slots for last 5 checks

vovkman · 2024-11-27T19:45:00Z

yellowstone-grpc-geyser/src/monitor.rs

+pub const HEALTH_CHECK_SLOT_DISTANCE: u64 = 100;
+pub static NUM_SLOTS_BEHIND: Lazy<Arc<AtomicU64>> = Lazy::new(|| Arc::new(AtomicU64::new(0)));
+
+pub async fn fetch_node_blocks_behind_with_infinite_retry(client: &RpcClient) -> u64 {


this should be slots behind not blocks behind

vovkman · 2024-11-27T19:46:41Z

Is it the case that we will still accept connections and then disconnect when a message comes through and its behind? We should probably prevent connecting in general if behind

pmantica11 · 2024-11-27T19:48:21Z

Is it the case that we will still accept connections and then disconnect when a message comes through and its behind? We should probably prevent connecting in general if behind

Good point. I'll also update the code to handle that.

pmantica11 added 9 commits November 26, 2024 02:16

Intermediate commit

7ba3d98

Intermediate commit

24f2685

Intermediate commit

2590087

Intermediate commit

ba6d282

Nit

18050ed

Add health check v2

66b6447

Nit

e551d76

Nits

810f25a

Nit

f694b55

pmantica11 commented Nov 26, 2024

View reviewed changes

pmantica11 added 4 commits November 27, 2024 03:48

Fix

4a90d24

Intermediate commit

a52d654

Nits

5bd3f39

Improve error message

b1c7beb

NicolasPennie approved these changes Nov 27, 2024

View reviewed changes

vovkman reviewed Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force client disconnects when node is unhealthy #13

Force client disconnects when node is unhealthy #13

pmantica11 commented Nov 26, 2024

pmantica11 Nov 26, 2024

NicolasPennie Dec 4, 2024

pmantica11 Nov 26, 2024

pmantica11 Nov 26, 2024

pmantica11 Nov 26, 2024

NicolasPennie Nov 27, 2024

pmantica11 Nov 26, 2024

NicolasPennie Nov 27, 2024

NicolasPennie Nov 27, 2024

vovkman Nov 27, 2024

vovkman commented Nov 27, 2024

pmantica11 commented Nov 27, 2024

Force client disconnects when node is unhealthy #13

Are you sure you want to change the base?

Force client disconnects when node is unhealthy #13

Conversation

pmantica11 commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vovkman commented Nov 27, 2024

pmantica11 commented Nov 27, 2024