-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make approval-distribution aggression a bit more robust and less spammy #6696
base: alexggh/fix_possible_bug
Are you sure you want to change the base?
Make approval-distribution aggression a bit more robust and less spammy #6696
Conversation
Signed-off-by: Alexandru Gheorghe <[email protected]>
All GitHub workflows were cancelled due to failure one of the required jobs. |
Is this only approval votes? Or also approval assigment annoucements? It's maybe both buit you do not notice the annoucements much since we merged tranche zero? I've worried that our topology sucks for ages now: We've two disjoint topologies which do not benefit from each other, a 2d gride and a random graph. I've no idea how one justifies two disjoint topologies with no synergy between them. At 1000 node the grid needs 31.6-1=30 message per layer for two layers. And the random graph gets pretty dense too. Instead I'd propose roughly some unified slimmer topology, like.. A 3d grid, so 10-1=9 message per layer for three layers, but which unifies with log2(1000) = 10 extra random hops, or maybe its ln(1000)=7, and which enforce some unified ttl slightly larger than 3. If Alice -> Bob -> Carol is a grid path, then Carol still tries sending to all 18 of her neighbors who are not neighbors of Bob, as well as 10 random ones. Alice and Bob also sent to 10 randoms. If Dave is one of Carol's neighbors, and the ttl>4, then Dave tries sending to all 18 of his neighbors that're not neighbors of Carol, and another 10 randoms. If Rob is someone's random, then Rob sends to all 27 of his neighbors, and another 10 randoms. We still have an aggression problem: Are we sending the message directly or asking the other side if they'd like the message? We'd ideally keep all messages small and send them unrequested. You recieve a message whenever someone in your 27 3d grid neighbors recieves it, so this does not get worse vs the 2d grid. In expectation, you also recieve a message 10 times from random guys too, so that's 37 incoming messages per message on the network, likely much less than currently. We might dampen the grid at exactly layer 3, meaning Carol sends to like 20 randoms or something, which assumes Bob was honest. We still have a finishing problem caused by our ttl: What happens if I never get a message? We should think about how this impacts everything. Anyways I've discussed this some with @chenda-w3f but it turns out the distributed systems theory says very little about the really optimal scheme is here. It's maybe worth discussing at the retreat.. |
@@ -514,6 +514,8 @@ struct BlockEntry { | |||
vrf_story: RelayVRFStory, | |||
/// The block slot. | |||
slot: Slot, | |||
/// Backing of from re-sending messages to peers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Backing of from re-sending messages to peers. | |
/// Backing off from re-sending messages to peers. |
// multiples times. This means it is safe to accept duplicate messages without punishing the | ||
// peer and reduce the reputation and can end up banning the Peer, which in turn will create | ||
// more no-shows. | ||
fn accept_duplicates_from_validators( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would safer to accept only a bounded amount of duplicates over time.
@@ -2255,18 +2307,42 @@ impl State { | |||
&self.topologies, | |||
|block_entry| { | |||
let block_age = max_age - block_entry.number; | |||
// We want to resend only for blocks between min_age and min_age + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it should be sufficient that we trigger aggression only for oldest unfinalized block. From what I've seen in production and testing is that usuallt are a few unapproved candidates holding finality on a particular block.
Resending for all unfinalized blocks doesn't really make sense, it's better to make small incremental progress than send large bursts of messages. So I would just remove the resend_unfinalized_period
parameter and code.
Am I missing anything ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing anything ?
I don't think you are, I was think about the same thing, but I did not know the initial reason for using resend_unfinalized_period
so I was trying to avoid it entirely, so yeah I'll look into resending it just for the last unfinalized.
modify_reputation( | ||
&mut self.reputation, | ||
network_sender, | ||
if !Self::accept_duplicates_from_validators( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
above log could show if duplicate is accepted.
After finality started lagging on kusama around
2025-11-25 15:55:40
nodes started being overloaded with messages and some restarted withI think this happened because our aggression in the current form is way too spammy and create problems in situation where we already constructed blocks with a load of candidates to check which what happened around
#25933682
before and after. However aggression, does help in the nightmare scenario where the network is segmented and sparsely connected, so I tend to think we shouldn't completely remove it.The current configuration is:
The way aggression works right now :
Proposed improvements: