Description
- Version: 1.9.2
- Platform: Darwin / Linux
- Subsystem: Circuit relay transport
Severity: High
Description:
I have a p2p network of some relay and non-relay nodes. The non-relay nodes are configured to discover 3 relays, configured both with and without reservation concurrency. For some reason (maybe connection issues, or anything else - it's irrelevant for the main issue), the connection between nodes and relays breaks, and then after a time, the node is disconnected from all of the relays.
After checking the logs and digging into the implementation of the circuit relay transport, I found that there is an already-implemented relay discovery mechanism, and it works this way in simple terms:
- When a relay disconnects, check the number of connected relays against
discoverRelays
- If not enough relays are connected, start the discovery process, and grab discovery lock, by setting a
running
flag inRelayDiscovery
instance - After discovering relays, try to connect them until we reach
discoverRelays
- If enough relays are connected, release the lock, letting the discovery to be run again on relays disconnection
I think there is a critical issue here. If for any reason, the discovery doesn't discover enough relays, or we cannot connect the discovered ones, the relays count keeps under the discoverRelays
, while the discovery is locked and cannot be run again. Over time, relays disconnect one by one, and we don't reconnect to them.
As an example:
- Suppose a
discoverRelays
of 3 - We connect to 3 relays
- We disconnect from 2 for some reason
- The discovery starts, and finds 4 new relays
- For some reason, we only connect to 1 of 4, resulting in 1 new + 1 old connected relays, which is below 3 threshold, and so the discovery won't be run again
- The process continues, until no relays are connected
I wonder why other possible solutions are not implemented, for example, changing the condition with which the lock is released, or running the discovery periodically.
Steps to reproduce the error:
It is clarified in description. There are some other configurations that may (or may not) affect the scenario, though. As an example, the auto dial should be disabled.