Skip to content
This repository has been archived by the owner on Mar 24, 2023. It is now read-only.

Add fork choice rule with DAS analysis #179

Merged
merged 8 commits into from
Jul 7, 2021
Merged

Conversation

adlerjohn
Copy link
Member

@adlerjohn adlerjohn commented Jun 14, 2021

Add fork choice rule explanation (the basic is trivial since Tendermint is fork-free). Also add a rationale doc to explain how the fork choice rule interacts with DAS which is not monotonic.

Ref celestiaorg/celestia-app#735, celestiaorg/celestia-core#321

@adlerjohn adlerjohn self-assigned this Jun 14, 2021
@liamsi liamsi requested a review from musalbas June 14, 2021 13:49

Light nodes follow consensus (i.e. validator set changes and commits) and perform DAS. If a block is seen as unavailable but has a commit, DAS is performed on the block continuously until either DAS passes, or the weak subjectivity window is exceeded at which point the node halts.

Full nodes fully download and execute blocks. If a block is seen as unavailable but has a commit, full downloading is re-attempted continuously until either it succeeds, or the weak subjectivity window is exceeded at which point the node halts.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liamsi this is one point that might warrant a closer look. I suggest here just doing re-downloading/DAS continuously because it's simpler. But you could also do it by triggering a downloading/DAS attempt when you see a new block with a commit. I figured the latter would be more complicated for full nodes since you'd have to fall back to light client mode, but maybe I'm wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a block is seen as unavailable but has a commit

Isn't that basically against the assumption that at least one validator (e.g. the block proposer) has the data and has gossiped it to the network of validators (and fullnodes). How else would there be a commit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is a purely theoretical case, then this does not matter much.

Copy link
Member Author

@adlerjohn adlerjohn Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this case isn't theoretical; it could happen if 2/3 of stake is malicious. Even in that case, we want full nodes and light nodes to agree on the same head of the chain. Note that there won't be a fork, since this scenario involves no equivocation, but the two nodes could still be on different heads without continuous re-downloading/DAS attempts: different heights on the same fork.

Copy link
Member

@liamsi liamsi Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this scenario to occur you'd need:

  • 2/3 of stake malicious
  • 2/3 of stake signed and finalized a block (and the data during that consensus round was broadcasted to all peers running the consensus reactor/protocol–even to those that are just passively following along and not signing)
  • validators and all other nodes that saw the data during or after consensus withhold that data – not only 2/3 of the stake but everyone else too

It seems to me that this means that this is an extreme edge-case (and basically is equivalent to the data not being available).

IMO, in practice if a node does not get the data in a few minutes (or max hours), it should not not continue attempting to download the block simply. Most likely this would rather require human interaction instead.

@adlerjohn adlerjohn requested a review from liamsi June 14, 2021 14:00

Light nodes follow consensus (i.e. validator set changes and commits) and perform DAS. If a block is seen as unavailable but has a commit, DAS is performed on the block continuously until either DAS passes, or the weak subjectivity window is exceeded at which point the node halts.

Full nodes fully download and execute blocks. If a block is seen as unavailable but has a commit, full downloading is re-attempted continuously until either it succeeds, or the weak subjectivity window is exceeded at which point the node halts.
Copy link
Member

@liamsi liamsi Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full nodes fully download and execute blocks.

Does the time of this matter at all? Do they download and execute blocks as they are produced or could that also be at a much later point in time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This paragraph only really covers what happens in the consensus reactor, or more specifically what happens to a node that keeps up with consensus within the weak subjectivity window. IBD isn't covered, but that's because there's not much to cover: just get a trusted checkpoint and that's your fork choice rule. Uninteresting. If the node wants to do IBD by first downloading all blocks, then executing them, that's an implementation detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does IBD stand for?

just get a trusted checkpoint and that's your fork choice rule. Uninteresting.

IIRC, trusted checkpoints are only used in combination with state sync (state sync requires a checkpoint), full nodes that sync from genesis and replay all blocks do not require a checkpoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does IBD stand for?

Initial Block Download, i.e. what happens in the blockchain reactor.

full nodes that sync from genesis and replay all blocks do not require a checkpoint.

This is actually a problem. The weak subjectivity trusted checkpoint informs the fork choice rule when doing any sync (full, state, only headers) when offline for longer than the unbonding window. It does not affect block validity. We absolutely need to have a trusted checkpoint even when point full sync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a problem.

I think in the case of IBD, it will switch to the consensus reactor once it caught up and any full node would at least quickly detect that it was fed a fork. In the case that 2/3 of validators are byzantine, the network has to resolve this outside of the protocol anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that it was fed a fork

Over a long enough time scale, the probability of there being a long-range fork in the distant past approaches 100%. But as was previously discussed in a call, it doesn't actually matter if you feed in a trusted checkpoint before starting in expectation of at least one fork or after halting on fork detection. It's an implementation detail.

Copy link
Member

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text makes sense as it stands. Can you yourself revisit this from the point of view of implementing a system that matches this? Just to double-check if this actually answers all outstanding questions in this context? I'll think about this tomorrow myself.

I'd also like to hear if this fully clarifies @musalbas questions here: celestiaorg/celestia-core#321


A requirement is that full nodes and light nodes agree on the same head of the chain automatically in this case, i.e. without human intervention.

Light nodes follow consensus (i.e. validator set changes and commits) and perform DAS. If a block is seen as unavailable but has a commit, DAS is performed on the block continuously until either DAS passes, or the weak subjectivity window is exceeded at which point the node halts.

This comment was marked as resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume performing DAS continuously on a block is ok, because if it's the case that there was a fork, the chain should halt anyway.

Is this "halting" implementing in Tendermint in anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the Tendermint implementation currently in the case of 2/3 of the consensus equivocating?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B. I don't think Tendermint has a weak subjectivity window.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the Tendermint implementation currently in the case of 2/3 of the consensus equivocating?

I'd need to double-check but it is certainly something that would require social coordination to resolve. IIRC, this would lead to a) full nodes gossiping double sign evidence to notify each other, the 2/3 can still not include this in the block obviously and b) a consensus failure in case a tendermint (full) node would see both forks (panics and stops).

N.B. I don't think Tendermint has a weak subjectivity window.

N.B. == nota bene? Tendermint fullnodes can either catch up from genesis until they caught up with consensus (after that the above holds in case of forks) or use state sync. In the latter case they are indeed initialized subjectively like any tendermint light client (with a hash in that weak subjectivity window).

Copy link
Member

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-read this and it is not so clear to me what the implications for the implementation are.

The first scenario is how tendermint works right now.

For the second scenario, I'm not sure retrying DAS or retrying to download the block for a few weeks, is what we'd actually implement. Theoretically that seems fine and maybe that could be the default. Though I think node operators should still be able to set a timeout instead.
IMO, if was a node operator and my node was stuck for longer than a few hours, I would act and manually fix that (if possible). Is there a realistic scenario where we want the node to timeout but continue retrying for the unbonding window? And even without human interaction, retrying for a few weeks in the worst case seems extreme.

What happens if a more recent block has consensus (and is available) and the chain moves on with finalized blocks on top of that one, but there is a past was still not available (either not downloadable, or, not DAS-able). As far as I understand this here, this means that the client node does not make any progress, right? Because from their point of view, they will be stuck at the unavailable height.

@musalbas
Copy link
Member

Is there a realistic scenario where we want the node to timeout but continue retrying for the unbonding window?

IMO yes, because data being unavailable might not be the fault of the block producer, but the node's internet connection.

@adlerjohn
Copy link
Member Author

Theoretically that seems fine and maybe that could be the default. Though I think node operators should still be able to set a timeout instead.

Right, users can always set a shorter timeout before their node halts, but then they'd potentially have to manually intervene more even if a majority of the network is honest, but they have a poor connection:

Note that a node cannot distinguish a dishonest majority in this scenario from a transient network failure on their end and an honest majority.

And even without human interaction, retrying for a few weeks in the worst case seems extreme.

It guarantees (formal proof pending) that all honest nodes that follow this protocol will agree on a head of the chain without human intervention, assuming a synchrony assumption equal to the unbonding window. As above, it's certainly not a bad idea to allow users to be notified after a much shorter time. But that's an implementation detail.

@liamsi
Copy link
Member

liamsi commented Jun 25, 2021

But that's an implementation detail.

Can we still mention that implementation detail in the specs? Otherwise node operators might consult the specification to fully understand which timeout to set (as per implementation) and only see that they instead should be retrying for the unbonding period. Node operators might instead prefer to check their node's inet connection (manually) to rule out:

IMO yes, because data being unavailable might not be the fault of the block producer, but the node's internet connection.

@adlerjohn
Copy link
Member Author

Can we still mention that implementation detail in the specs?

Absolutely, both the intuition and implementation detail for timeouts. Will add.

@adlerjohn adlerjohn requested a review from liamsi June 27, 2021 15:28
Copy link
Member

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Note: It's not fully clear to me if we need to change anything in the face of long-range attacks for full nodes or not. It seems to me like as long as the fork would be detected we are good.

Should we move to two approving reviews for actual changes like the ones included here? (cc @musalbas)

@adlerjohn
Copy link
Member Author

It's not fully clear to me if we need to change anything in the face of long-range attacks for full nodes or not. It seems to me like as long as the fork would be detected we are good.

This PR shouldn't necessitate any changes to how Tendermint currently handles long-range forks. It only prescribes actions to be taken within the unbonding window. Improving how Tendermint handles long-range forks is orthogonal and can (and should) be revisited in a separate issue.

@musalbas
Copy link
Member

musalbas commented Jul 1, 2021

Just to clarify though, the specs mentions a weak subjectivity window, but no such thing is implemented in Tendermint. i.e. the weak subjectivity window is indefinite.

What are the implications of keeping it like this? If there is equivocation, the chain simply halts?

If we keep the weak subjectivity window indefinite, this needs to be documented, as we would technically be diverging from the spec.

@adlerjohn
Copy link
Member Author

Filed #186 to track investigating how Tendermint weak subjectivity might be different from how we might want it to be. It's orthogonal to the core of this PR.

@adlerjohn adlerjohn merged commit ec98170 into master Jul 7, 2021
@adlerjohn adlerjohn deleted the adlerjohn/fork_choice_das branch July 7, 2021 14:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants