Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p: dead/solo subnets remedy (enhanced) #1949

Open
wants to merge 31 commits into
base: stage
Choose a base branch
from

Conversation

iurii-ssv
Copy link
Contributor

Replaces #1917 additionally improving a bunch of p2p-handling code, with the main contributing factor being "selective discovery".

@iurii-ssv iurii-ssv force-pushed the testing/deploy/deadsub-remedy branch 2 times, most recently from 5dac0bc to a364c2c Compare December 18, 2024 13:31
Comment on lines 22 to 38
var (
TrimmedRecently = ttlcache.New(ttlcache.WithTTL[peer.ID, struct{}](30 * time.Minute))
)

func init() {
go TrimmedRecently.Start() // start cleanup go-routine
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know global vars aren't best way to do it, but I think we should leave it global for now and refactor it later (once this code has been running in production for a while and we've proven this solution really does help). Since we might want/have to swap it out for something else.

Comment on lines 231 to 233
if len(topicPeers) >= 2 {
continue // this topic has enough peers
}
Copy link
Contributor Author

@iurii-ssv iurii-ssv Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we are looking (in discovery) for peers who help with either dead or solo subnets (without actually prioritizing the dead ones, and also this kind of filtering is maybe too aggressive overall since in my testing I see that we aren't keeping at MaxPeers limit for long - and rather falling 10-20% below it and keeping at that level; which probably means we can probably bump inboundLimitRatio up from 0.5 to let incoming connections fill in more spots - but all that varies from run to run)

obviously this ^ isn't the optimal way to do it, but in my testing on prod I found it to be working reasonably good (with 60 and 90MaxPeers) - so we probably want to improve this, just maybe not in the upcoming release

Comment on lines 50 to 52
// inboundLimitRatio is the ratio of inbound connections to the total connections
// we allow (both inbound and outbound).
inboundLimitRatio = float64(0.75)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing now if 0.75 is a good fit, but so far

  • I found the value of 0.5 to be working well
  • and the value of 0.8 seems to be noticeably worse (eg. making it 30 minutes instead of 15 to get rid of dead subnets)

Copy link
Contributor Author

@iurii-ssv iurii-ssv Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 0.75 is too high still, even with MaxPeers = 90 sometimes it can take up to 90 minutes to get rid of dead subnets (or even longer, if we are unlucky enough),

it's probably better to set inboundLimitRatio = 0.6 or even lower (the lower, the better)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with 0.5 if there aren't any reasons not to

@iurii-ssv
Copy link
Contributor Author

Another thing I've realized is happening - because we only trim once we've reached MaxPeers limit, at some point we might stop trimming completely - for example, we've got to max amount of the incoming connections we allow via 0.5 ratio (say, 30) + we've discovered some outgoing peers but not enough to reach MaxPeers limit,

in other words, we've discovered everything we could (and we keep discovering, just at a very slow rate - due to quite an aggressive filter there isn't many connections we are interested in) but that also means we stopped re-cycling(trimming) our incoming connections periodically because we don't trim incoming/outgoing connections separately

I've added a commit to address that - d230c4a - it's pretty "hacky/ad-hoc" way do the job - ideally (to simplify this) I think we probably want to have separate limits per incoming/outgoing connections and somewhat separate trimming logic, WDYT ?

Comment on lines 60 to 63
// withExtraFiltering specifies whether we want to additionally filter discovered peers,
// since operator node and boot node are using the same discovery code (`DiscV5Service`) we
// want to do this extra filtering for operator node only
withExtraFiltering bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bootnode doesn't pull out peers from discovery to filter them so this is not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why it needs newDiscV5Service service then ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iurii-ssv
Bootnode is still a regular node in terms of discovery, it has a routing table which collects other peers it has discovery interactions with, it does "pull" peers from this table to supply other peers that query it for network peers it knows. But it doesn't go through our filtering, our filtering is only used for pulling peers out of the routing table to be used for protocol connections (TCP).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, removed withExtraFiltering then, and so we are using this func now

func (n *p2pNetwork) peerScore

both:

  • to rank "proposed" peers
  • to rank peers when trimming (peer "protection")

@iurii-ssv iurii-ssv force-pushed the testing/deploy/deadsub-remedy branch from 63b2c45 to 88a752c Compare January 9, 2025 13:14
@iurii-ssv iurii-ssv force-pushed the testing/deploy/deadsub-remedy branch from 88a752c to 44665d0 Compare January 9, 2025 13:58
Copy link

codecov bot commented Jan 9, 2025

@iurii-ssv iurii-ssv force-pushed the testing/deploy/deadsub-remedy branch from 75c0d37 to 25950ac Compare January 13, 2025 12:50
@iurii-ssv iurii-ssv force-pushed the testing/deploy/deadsub-remedy branch from e828500 to 416fd27 Compare January 14, 2025 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants