-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Force client disconnects when node is unhealthy #13
base: master-2.2
Are you sure you want to change the base?
Conversation
solana-sdk = "~2.1.2" | ||
solana-transaction-status = "~2.1.2" | ||
solana-client = "~2.1.2" | ||
solana-rpc-client-api = "~2.1.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the master git versions was giving me package conflict errors. So I will use the latest version of the release instead. I tested the code using the Solana test validator, and it worked. So we should be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is risky, we should keep using the master versions since geyser usually has breaking changes between minor versions
@@ -5,13 +5,13 @@ | |||
}, | |||
"grpc": { | |||
"address": "0.0.0.0:10000", | |||
"tls_config": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this tls config from the sample config because it's not a valid tls config.
use tokio::time::interval; | ||
|
||
pub const HEALTH_CHECK_SLOT_DISTANCE: u64 = 100; | ||
pub const IS_NODE_UNHEALTHY: Lazy<Arc<AtomicBool>> = Lazy::new(|| Arc::new(AtomicBool::new(false))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered disconnecting clients only when the node was unhealthy by some amount of time. But I think that was unnecessarily complicated. If a node is behind by 100 slots, then it must have been unhealthy for about 30 seconds, which is enough.
use solana_rpc_client_api::{client_error, request}; | ||
use tokio::time::interval; | ||
|
||
pub const HEALTH_CHECK_SLOT_DISTANCE: u64 = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not make this an env variable because of speed. I want to get this deployed asap to fix customer impact. I am a RPC land rookie and don't even know how to configure env variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make it part of the grpc config and load it that way
@@ -75,6 +77,10 @@ impl GeyserPlugin for Plugin { | |||
.build() | |||
.map_err(|error| GeyserPluginError::Custom(Box::new(error)))?; | |||
|
|||
// Monitor node health | |||
let rpc_client = RpcClient::new("http://localhost:8899".to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not make this an env variable because of speed. I want to get this deployed asap to fix customer impact. I am a RPC land rookie and don't even know how to configure env variables.
use solana_rpc_client_api::{client_error, request}; | ||
use tokio::time::interval; | ||
|
||
pub const HEALTH_CHECK_SLOT_DISTANCE: u64 = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make it part of the grpc config and load it that way
@@ -838,6 +839,13 @@ impl GrpcService { | |||
} | |||
} | |||
message = messages_rx.recv() => { | |||
let num_slots_behind = NUM_SLOTS_BEHIND.load(Ordering::SeqCst); | |||
if num_slots_behind > HEALTH_CHECK_SLOT_DISTANCE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required in first release, but I would prefer if this is configured as two checks:
- Auto-disconnect is >100 slots
- Disconnect if >20 slots for last 5 checks
pub const HEALTH_CHECK_SLOT_DISTANCE: u64 = 100; | ||
pub static NUM_SLOTS_BEHIND: Lazy<Arc<AtomicU64>> = Lazy::new(|| Arc::new(AtomicU64::new(0))); | ||
|
||
pub async fn fetch_node_blocks_behind_with_infinite_retry(client: &RpcClient) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be slots behind not blocks behind
Is it the case that we will still accept connections and then disconnect when a message comes through and its behind? We should probably prevent connecting in general if behind |
Good point. I'll also update the code to handle that. |
Clients are prevented from connecting to an unhealthy gRPC instance but are not disconnected from a lagging one. In this PR we disconnect clients clients so that they are forced to reconnect to a healthy gRPC instance.