Skip to content

Possible relay discovery defect #2676

Open
@mkermani144

Description

@mkermani144
  • Version: 1.9.2
  • Platform: Darwin / Linux
  • Subsystem: Circuit relay transport

Severity: High

Description:

I have a p2p network of some relay and non-relay nodes. The non-relay nodes are configured to discover 3 relays, configured both with and without reservation concurrency. For some reason (maybe connection issues, or anything else - it's irrelevant for the main issue), the connection between nodes and relays breaks, and then after a time, the node is disconnected from all of the relays.

After checking the logs and digging into the implementation of the circuit relay transport, I found that there is an already-implemented relay discovery mechanism, and it works this way in simple terms:

  1. When a relay disconnects, check the number of connected relays against discoverRelays
  2. If not enough relays are connected, start the discovery process, and grab discovery lock, by setting a running flag in RelayDiscovery instance
  3. After discovering relays, try to connect them until we reach discoverRelays
  4. If enough relays are connected, release the lock, letting the discovery to be run again on relays disconnection

I think there is a critical issue here. If for any reason, the discovery doesn't discover enough relays, or we cannot connect the discovered ones, the relays count keeps under the discoverRelays, while the discovery is locked and cannot be run again. Over time, relays disconnect one by one, and we don't reconnect to them.

As an example:

  1. Suppose a discoverRelays of 3
  2. We connect to 3 relays
  3. We disconnect from 2 for some reason
  4. The discovery starts, and finds 4 new relays
  5. For some reason, we only connect to 1 of 4, resulting in 1 new + 1 old connected relays, which is below 3 threshold, and so the discovery won't be run again
  6. The process continues, until no relays are connected

I wonder why other possible solutions are not implemented, for example, changing the condition with which the lock is released, or running the discovery periodically.

Steps to reproduce the error:

It is clarified in description. There are some other configurations that may (or may not) affect the scenario, though. As an example, the auto dial should be disabled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions