-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible relay discovery defect #2676
Comments
I found out that the random walk part of the discovery is essentially an infinite async iterable (until the relevant signal is aborted), so it's not exactly the case that I described in my previous comment. By the way, shouldn't we add |
Yes, so it should continue to discover peers until it eventually finds enough relays to bring the number with reservations up to It sounds like you're seeing something different?
I think that would make it easier for users to discover misconfigurations, yes.
Could you expand on this a little? |
What I'm saying here is that, in my opinion, it's wrong to suppose random walk always succeeds. If any error is thrown inside of the random walk, peer routing, or any related code, we get a
If it's as simple as adding it to the list of transport
Based on what I understood from the relay discovery code, it has two parts: It first tries to find some relays in the peer store, and then starts the random walk. Putting my answer to the first question aside and supposing random walk always succeeds, the peer store search is still only run once. It's neither an infinite iterable like random walk, nor is re-run through some other mechanism. We may not be able to connect to peer store relays at the moment of disconnection for a reason (say a simple network partition), but we may connect them a moment after if we try. In other word, there may be no need to run random walk for a long time at all: we simply need to give the peer store another chance. |
Because no peer router is configured for RoseNet, we should restart the relay discovery service manually so that it is not halted forever. More details can be found here: libp2p/js-libp2p#2676
This comment was marked as resolved.
This comment was marked as resolved.
My previous comment contains the answers to @achingbrain questions. |
Severity: High
Description:
I have a p2p network of some relay and non-relay nodes. The non-relay nodes are configured to discover 3 relays, configured both with and without reservation concurrency. For some reason (maybe connection issues, or anything else - it's irrelevant for the main issue), the connection between nodes and relays breaks, and then after a time, the node is disconnected from all of the relays.
After checking the logs and digging into the implementation of the circuit relay transport, I found that there is an already-implemented relay discovery mechanism, and it works this way in simple terms:
discoverRelays
running
flag inRelayDiscovery
instancediscoverRelays
I think there is a critical issue here. If for any reason, the discovery doesn't discover enough relays, or we cannot connect the discovered ones, the relays count keeps under the
discoverRelays
, while the discovery is locked and cannot be run again. Over time, relays disconnect one by one, and we don't reconnect to them.As an example:
discoverRelays
of 3I wonder why other possible solutions are not implemented, for example, changing the condition with which the lock is released, or running the discovery periodically.
Steps to reproduce the error:
It is clarified in description. There are some other configurations that may (or may not) affect the scenario, though. As an example, the auto dial should be disabled.
The text was updated successfully, but these errors were encountered: