Skip to content

Conversation

@lexnv
Copy link
Contributor

@lexnv lexnv commented Nov 28, 2025

There's a possible race case between peer connectivity and collation advertisement:

  • The advertisement was generated
  • peer disconnected before receiving the advertisement

As a result of that, when the peer reconnects, the previous collation (C0) is not sent.
This happens when the collator has produced another collation (C1).
However, from the logs it looks like the collation C1 is advertising, but C0 is skipped.

  • T0: peer disconnects without receiving C0
  • T1: peer reconnects
  • T2: collator advertises C1, but not C0

This PR aims to resubmit collations on PeerConect events to mitigate these cases

Closes #10463

@lexnv lexnv self-assigned this Nov 28, 2025
Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct. Can we have some test demonstrating the fixed race condition?

//
// The `advertise_collation` ensures we are not readvertising the same collation
// multiple times.
if let Some(para_id) = state.collating_on {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled by

?

Aka when the peer view is announced, it should be informed about the collations.

Copy link
Contributor

@sandreim sandreim Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work, we already set the collation as advertised in here. So it will not advertise again when peer reconnects.

Maybe we'd want to reset this bit to 0 when the validator disconnects if the status is not Requested. But, because it is a race, it might be that the validator has already seen the advertisement. We just don't know from the collator side. In that case, we'd have to check if the collator is punished in any way (for advertising twice).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think this should be an exceptional case, likely a side-effect of some other underlying networking issue. Why was the validator disconnecting in the first place ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this is a race-case I've only encountered once.

It might happen due to network congestion or the following:

This debugging rabbit hole might improve the stability of litep2p even more 🙏 If the previous issue turns out to be correct, we are terminating the connections on fragmented socket reads due to a tiny offset mismatch in the poll_next implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

collator-protocol: Readvertise collations after peer disconnects

5 participants