Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peer: Synchronize net.Conn with a mutex #3478

Merged
merged 1 commit into from
Mar 4, 2025
Merged

Conversation

jrick
Copy link
Member

@jrick jrick commented Feb 27, 2025

The access guarded by an atomic int32 was incorrect. For example, access to the p.conn could be performed as long as p.connected was non-zero, but p.connected would be incremented before p.conn was ever assigned by AssociateConnection.

The access guarded by an atomic int32 was incorrect.  For example, access to
the p.conn could be performed as long as p.connected was non-zero, but
p.connected would be incremented before p.conn was ever assigned by
AssociateConnection.

While here, also add missing mutex locking protecting timeConnected and na.
Copy link
Member

@davecgh davecgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any issue with this being made harder to misuse, particularly because it's exported code, but I don't see any issue with how it is used currently in dcrd.

It is only ever initialized via inboundPeerConnected and outboundPeerConnected.

For inboundPeerConnected, it's given a net.Conn from the listener and it then calls peer.NewInboundPeer which only creates the struct, followed by the call to sp.AssociateConnection which sets the connected atomic and then assigns the conn.

Similarly, for outboundPeerConnected, it's given a *connmgr.ConnReq and net.Conn from the connection manager and it then calls peer.NewOutboundPeer which, similar to peer.NewInboundPeer, only creates the struct and does some other misc setup not involving the connection, followed by the call to sp.AssociateConnection which, the same as before, sets the connected atomic and then assigns the conn.

You are correct that there could theoretically be a race between setting the atomic and then assigning the conn if there were multiple goroutines racing on them, but no other goroutines aside from the one launched inside AssociateConnection itself have access to the peer until after AssocateConnection returns in either case, so there would be no race.

It is only after the initial fields are set in AssociateConnection that it launches the internal peer.start() in a goroutine and returns at which point the newly created and initialized peer is then further used in another goroutine running sp.Run() and finally returned to other code for use.

It's for the same reason the stats mutex is not really needed for the timeConnected and na fields. No other goroutines are racing on them at the point they are initialized.

@davecgh davecgh added this to the 2.1.0 milestone Mar 4, 2025
@jrick
Copy link
Member Author

jrick commented Mar 4, 2025

That's right, it's because these are exported methods. I didn't see a race with how dcrd is using them.

Speaking of, I found it odd that since both inboundPeerConnected and outboundPeerConnected already have the connection available, why wouldn't peer.New{Inbound,Outbound}Peer require passing this to the constructor? It's only test code (banmanager) that ever creates (outbound) peers without an underlying connection.

@jrick
Copy link
Member Author

jrick commented Mar 4, 2025

And my initial concern that got me looking into this code is that peer.Disconnect, which is allowed to be called multiple times, would only actually close the underlying connection if it had already been associated. But if this order was called:

  1. peer.Disconnect()
  2. peer.AssociateConnection(conn)
  3. peer.Disconnect()

then neither the AssociateConnection call, nor the final disconnect would close the underlying connection.

@davecgh
Copy link
Member

davecgh commented Mar 4, 2025

Speaking of, I found it odd that since both inboundPeerConnected and outboundPeerConnected already have the connection available, why wouldn't peer.New{Inbound,Outbound}Peer require passing this to the constructor? It's only test code (banmanager) that ever creates (outbound) peers without an underlying connection.

Aside from requiring a new major module version since changing it would be a breaking API change, it's mostly just an artifact of much older code.

Originally, all of this code lived in the main server code as opposed to a separate package (along with all of the connection management and listening code too) and so it wasn't really written with the goal of being a separate package. The person who split it out pretty closely followed how things were already implemented as opposed to a more clean room package design, so there are certainly various aspects that could be improved.

Back at that time, the initial outbound connection was made and assigned by NewOutboundPeer itself and the version negotiation was handled in the main read and write code paths as opposed to the current cleaner separated method that was introduced by commit f3d759d (which itself could be further improved).

On the other hand, NewInboundPeer required a connection to be associated since the peer package doesn't handle listening and, due to a combination of factors that no longer apply (for example, the aforementioned version negotiation happening in the main read and write code paths), there were other things at the time that had to be done by the caller in between the creation of the peer instance and associating the connection with it. Hence, AssociateConnection was born.

Rather obviously, having NewOutboundPeer attempt to establish connections didn't play nicely with overall connection management, so that logic was pulled out and since AssociateConnection already existed, it was used instead of changing the API.

@davecgh
Copy link
Member

davecgh commented Mar 4, 2025

And my initial concern that got me looking into this code is that peer.Disconnect, which is allowed to be called multiple times, would only actually close the underlying connection if it had already been associated. But if this order was called:
...
then neither the AssociateConnection call, nor the final disconnect would close the underlying connection.

Right. That doesn't actually happen with dcrd's usage, but since it's exported code, that sequence of events is indeed possible if a caller were to try to disconnect in between calling the constructor and associating the connection for some reason.

This PR accomplishes making it more robust against misuse without having to change the API and require a new major module version.

@davecgh davecgh merged commit 77c9cee into decred:master Mar 4, 2025
2 checks passed
@jrick jrick deleted the conn_mutex branch March 4, 2025 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants