Add fork choice rule with DAS analysis #179

adlerjohn · 2021-06-14T13:41:40Z

Add fork choice rule explanation (the basic is trivial since Tendermint is fork-free). Also add a rationale doc to explain how the fork choice rule interacts with DAS which is not monotonic.

Ref celestiaorg/celestia-app#735, celestiaorg/celestia-core#321

src/rationale/fork_choice_das.md

adlerjohn · 2021-06-14T13:52:46Z

src/rationale/fork_choice_das.md

+
+Light nodes follow consensus (i.e. validator set changes and commits) and perform DAS. If a block is seen as unavailable but has a commit, DAS is performed on the block continuously until either DAS passes, or the weak subjectivity window is exceeded at which point the node halts.
+
+Full nodes fully download and execute blocks. If a block is seen as unavailable but has a commit, full downloading is re-attempted continuously until either it succeeds, or the weak subjectivity window is exceeded at which point the node halts.


@liamsi this is one point that might warrant a closer look. I suggest here just doing re-downloading/DAS continuously because it's simpler. But you could also do it by triggering a downloading/DAS attempt when you see a new block with a commit. I figured the latter would be more complicated for full nodes since you'd have to fall back to light client mode, but maybe I'm wrong.

If a block is seen as unavailable but has a commit

Isn't that basically against the assumption that at least one validator (e.g. the block proposer) has the data and has gossiped it to the network of validators (and fullnodes). How else would there be a commit?

If that is a purely theoretical case, then this does not matter much.

No, this case isn't theoretical; it could happen if 2/3 of stake is malicious. Even in that case, we want full nodes and light nodes to agree on the same head of the chain. Note that there won't be a fork, since this scenario involves no equivocation, but the two nodes could still be on different heads without continuous re-downloading/DAS attempts: different heights on the same fork.

For this scenario to occur you'd need:

2/3 of stake malicious

2/3 of stake signed and finalized a block (and the data during that consensus round was broadcasted to all peers running the consensus reactor/protocol–even to those that are just passively following along and not signing)

validators and all other nodes that saw the data during or after consensus withhold that data – not only 2/3 of the stake but everyone else too

It seems to me that this means that this is an extreme edge-case (and basically is equivalent to the data not being available).

IMO, in practice if a node does not get the data in a few minutes (or max hours), it should not not continue attempting to download the block simply. Most likely this would rather require human interaction instead.

liamsi · 2021-06-14T20:23:14Z

src/rationale/fork_choice_das.md

+
+Light nodes follow consensus (i.e. validator set changes and commits) and perform DAS. If a block is seen as unavailable but has a commit, DAS is performed on the block continuously until either DAS passes, or the weak subjectivity window is exceeded at which point the node halts.
+
+Full nodes fully download and execute blocks. If a block is seen as unavailable but has a commit, full downloading is re-attempted continuously until either it succeeds, or the weak subjectivity window is exceeded at which point the node halts.


Full nodes fully download and execute blocks.

Does the time of this matter at all? Do they download and execute blocks as they are produced or could that also be at a much later point in time?

Hmm. This paragraph only really covers what happens in the consensus reactor, or more specifically what happens to a node that keeps up with consensus within the weak subjectivity window. IBD isn't covered, but that's because there's not much to cover: just get a trusted checkpoint and that's your fork choice rule. Uninteresting. If the node wants to do IBD by first downloading all blocks, then executing them, that's an implementation detail.

What does IBD stand for?

just get a trusted checkpoint and that's your fork choice rule. Uninteresting.

IIRC, trusted checkpoints are only used in combination with state sync (state sync requires a checkpoint), full nodes that sync from genesis and replay all blocks do not require a checkpoint.

What does IBD stand for?

Initial Block Download, i.e. what happens in the blockchain reactor.

full nodes that sync from genesis and replay all blocks do not require a checkpoint.

This is actually a problem. The weak subjectivity trusted checkpoint informs the fork choice rule when doing any sync (full, state, only headers) when offline for longer than the unbonding window. It does not affect block validity. We absolutely need to have a trusted checkpoint even when point full sync.

This is actually a problem.

I think in the case of IBD, it will switch to the consensus reactor once it caught up and any full node would at least quickly detect that it was fed a fork. In the case that 2/3 of validators are byzantine, the network has to resolve this outside of the protocol anyways.

that it was fed a fork

Over a long enough time scale, the probability of there being a long-range fork in the distant past approaches 100%. But as was previously discussed in a call, it doesn't actually matter if you feed in a trusted checkpoint before starting in expectation of at least one fork or after halting on fork detection. It's an implementation detail.

src/rationale/fork_choice_das.md

liamsi

The text makes sense as it stands. Can you yourself revisit this from the point of view of implementing a system that matches this? Just to double-check if this actually answers all outstanding questions in this context? I'll think about this tomorrow myself.

I'd also like to hear if this fully clarifies @musalbas questions here: celestiaorg/celestia-core#321

musalbas · 2021-06-25T13:40:37Z

src/rationale/fork_choice_das.md

+
+A requirement is that full nodes and light nodes agree on the same head of the chain automatically in this case, i.e. without human intervention.
+
+Light nodes follow consensus (i.e. validator set changes and commits) and perform DAS. If a block is seen as unavailable but has a commit, DAS is performed on the block continuously until either DAS passes, or the weak subjectivity window is exceeded at which point the node halts.


I assume performing DAS continuously on a block is ok, because if it's the case that there was a fork, the chain should halt anyway.

Is this "halting" implementing in Tendermint in anyway?

What happens in the Tendermint implementation currently in the case of 2/3 of the consensus equivocating?

N.B. I don't think Tendermint has a weak subjectivity window.

What happens in the Tendermint implementation currently in the case of 2/3 of the consensus equivocating?

I'd need to double-check but it is certainly something that would require social coordination to resolve. IIRC, this would lead to a) full nodes gossiping double sign evidence to notify each other, the 2/3 can still not include this in the block obviously and b) a consensus failure in case a tendermint (full) node would see both forks (panics and stops).

N.B. I don't think Tendermint has a weak subjectivity window.

N.B. == nota bene? Tendermint fullnodes can either catch up from genesis until they caught up with consensus (after that the above holds in case of forks) or use state sync. In the latter case they are indeed initialized subjectively like any tendermint light client (with a hash in that weak subjectivity window).

liamsi

I've re-read this and it is not so clear to me what the implications for the implementation are.

The first scenario is how tendermint works right now.

For the second scenario, I'm not sure retrying DAS or retrying to download the block for a few weeks, is what we'd actually implement. Theoretically that seems fine and maybe that could be the default. Though I think node operators should still be able to set a timeout instead.
IMO, if was a node operator and my node was stuck for longer than a few hours, I would act and manually fix that (if possible). Is there a realistic scenario where we want the node to timeout but continue retrying for the unbonding window? And even without human interaction, retrying for a few weeks in the worst case seems extreme.

What happens if a more recent block has consensus (and is available) and the chain moves on with finalized blocks on top of that one, but there is a past was still not available (either not downloadable, or, not DAS-able). As far as I understand this here, this means that the client node does not make any progress, right? Because from their point of view, they will be stuck at the unavailable height.

musalbas · 2021-06-25T13:47:47Z

Is there a realistic scenario where we want the node to timeout but continue retrying for the unbonding window?

IMO yes, because data being unavailable might not be the fault of the block producer, but the node's internet connection.

adlerjohn · 2021-06-25T13:54:02Z

Theoretically that seems fine and maybe that could be the default. Though I think node operators should still be able to set a timeout instead.

Right, users can always set a shorter timeout before their node halts, but then they'd potentially have to manually intervene more even if a majority of the network is honest, but they have a poor connection:

Note that a node cannot distinguish a dishonest majority in this scenario from a transient network failure on their end and an honest majority.

And even without human interaction, retrying for a few weeks in the worst case seems extreme.

It guarantees (formal proof pending) that all honest nodes that follow this protocol will agree on a head of the chain without human intervention, assuming a synchrony assumption equal to the unbonding window. As above, it's certainly not a bad idea to allow users to be notified after a much shorter time. But that's an implementation detail.

liamsi · 2021-06-25T13:57:01Z

But that's an implementation detail.

Can we still mention that implementation detail in the specs? Otherwise node operators might consult the specification to fully understand which timeout to set (as per implementation) and only see that they instead should be retrying for the unbonding period. Node operators might instead prefer to check their node's inet connection (manually) to rule out:

IMO yes, because data being unavailable might not be the fault of the block producer, but the node's internet connection.

adlerjohn · 2021-06-25T14:00:21Z

Can we still mention that implementation detail in the specs?

Absolutely, both the intuition and implementation detail for timeouts. Will add.

Co-authored-by: Ismail Khoffi <[email protected]>

liamsi

LGTM!

Note: It's not fully clear to me if we need to change anything in the face of long-range attacks for full nodes or not. It seems to me like as long as the fork would be detected we are good.

Should we move to two approving reviews for actual changes like the ones included here? (cc @musalbas)

adlerjohn · 2021-06-27T23:09:50Z

It's not fully clear to me if we need to change anything in the face of long-range attacks for full nodes or not. It seems to me like as long as the fork would be detected we are good.

This PR shouldn't necessitate any changes to how Tendermint currently handles long-range forks. It only prescribes actions to be taken within the unbonding window. Improving how Tendermint handles long-range forks is orthogonal and can (and should) be revisited in a separate issue.

musalbas · 2021-07-01T19:58:13Z

Just to clarify though, the specs mentions a weak subjectivity window, but no such thing is implemented in Tendermint. i.e. the weak subjectivity window is indefinite.

What are the implications of keeping it like this? If there is equivocation, the chain simply halts?

If we keep the weak subjectivity window indefinite, this needs to be documented, as we would technically be diverging from the spec.

adlerjohn · 2021-07-01T23:24:24Z

Filed #186 to track investigating how Tendermint weak subjectivity might be different from how we might want it to be. It's orthogonal to the core of this PR.

Add fork choice rule under DAS doc.

6fb3479

adlerjohn added the consensus label Jun 14, 2021

adlerjohn self-assigned this Jun 14, 2021

liamsi reviewed Jun 14, 2021

View reviewed changes

src/rationale/fork_choice_das.md Outdated Show resolved Hide resolved

liamsi requested a review from musalbas June 14, 2021 13:49

adlerjohn commented Jun 14, 2021

View reviewed changes

adlerjohn added 4 commits June 14, 2021 09:55

Clean up.

e68aff9

Add fork choice short.

df507db

stake -> voting power

8d4a58a

Clarify secure.

d76e65b

adlerjohn requested a review from liamsi June 14, 2021 14:00

liamsi reviewed Jun 14, 2021

View reviewed changes

src/rationale/fork_choice_das.md Show resolved Hide resolved

liamsi reviewed Jun 14, 2021

View reviewed changes

liamsi mentioned this pull request Sep 19, 2022

specs: consensus leader selection & fork choice celestiaorg/celestia-app#735

Closed

4 tasks

adlerjohn requested a review from liamsi June 21, 2021 13:39

liamsi reviewed Jun 22, 2021

View reviewed changes

Merge branch 'master' into adlerjohn/fork_choice_das

72cc379

adlerjohn and others added 2 commits June 25, 2021 10:00

Update src/rationale/fork_choice_das.md

b7b3488

Co-authored-by: Ismail Khoffi <[email protected]>

Clarify timeouts and impl details.

4e0092a

adlerjohn requested a review from liamsi June 27, 2021 15:28

liamsi approved these changes Jun 27, 2021

View reviewed changes

liamsi mentioned this pull request Jun 30, 2021

Deal with case where data becomes available after time out celestiaorg/celestia-core#321

Closed

2 tasks

adlerjohn mentioned this pull request Jul 1, 2021

Investigate Tendermint weak subjectivity #186

Open

musalbas approved these changes Jul 7, 2021

View reviewed changes

adlerjohn merged commit ec98170 into master Jul 7, 2021

adlerjohn deleted the adlerjohn/fork_choice_das branch July 7, 2021 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fork choice rule with DAS analysis #179

Add fork choice rule with DAS analysis #179

adlerjohn commented Jun 14, 2021 •

edited

Loading

adlerjohn Jun 14, 2021

liamsi Jun 14, 2021

liamsi Jun 14, 2021

adlerjohn Jun 15, 2021 •

edited

Loading

liamsi Jun 22, 2021 •

edited

Loading

liamsi Jun 14, 2021 •

edited

Loading

adlerjohn Jun 15, 2021

liamsi Jun 22, 2021

adlerjohn Jun 25, 2021

liamsi Jun 25, 2021

adlerjohn Jun 27, 2021

liamsi left a comment •

edited

Loading

This comment was marked as resolved.

musalbas Jun 25, 2021

musalbas Jun 25, 2021

musalbas Jun 25, 2021

liamsi Jun 25, 2021

liamsi left a comment •

edited

Loading

musalbas commented Jun 25, 2021

adlerjohn commented Jun 25, 2021

liamsi commented Jun 25, 2021 •

edited

Loading

adlerjohn commented Jun 25, 2021

liamsi left a comment

adlerjohn commented Jun 27, 2021

musalbas commented Jul 1, 2021

adlerjohn commented Jul 1, 2021


		Light nodes follow consensus (i.e. validator set changes and commits) and perform DAS. If a block is seen as unavailable but has a commit, DAS is performed on the block continuously until either DAS passes, or the weak subjectivity window is exceeded at which point the node halts.

		Full nodes fully download and execute blocks. If a block is seen as unavailable but has a commit, full downloading is re-attempted continuously until either it succeeds, or the weak subjectivity window is exceeded at which point the node halts.


		A requirement is that full nodes and light nodes agree on the same head of the chain automatically in this case, i.e. without human intervention.

		Light nodes follow consensus (i.e. validator set changes and commits) and perform DAS. If a block is seen as unavailable but has a commit, DAS is performed on the block continuously until either DAS passes, or the weak subjectivity window is exceeded at which point the node halts.

Add fork choice rule with DAS analysis #179

Add fork choice rule with DAS analysis #179

Conversation

adlerjohn commented Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adlerjohn Jun 15, 2021 • edited Loading

Choose a reason for hiding this comment

liamsi Jun 22, 2021 • edited Loading

Choose a reason for hiding this comment

liamsi Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liamsi left a comment • edited Loading

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liamsi left a comment • edited Loading

Choose a reason for hiding this comment

musalbas commented Jun 25, 2021

adlerjohn commented Jun 25, 2021

liamsi commented Jun 25, 2021 • edited Loading

adlerjohn commented Jun 25, 2021

liamsi left a comment

Choose a reason for hiding this comment

adlerjohn commented Jun 27, 2021

musalbas commented Jul 1, 2021

adlerjohn commented Jul 1, 2021

adlerjohn commented Jun 14, 2021 •

edited

Loading

adlerjohn Jun 15, 2021 •

edited

Loading

liamsi Jun 22, 2021 •

edited

Loading

liamsi Jun 14, 2021 •

edited

Loading

liamsi left a comment •

edited

Loading

liamsi left a comment •

edited

Loading

liamsi commented Jun 25, 2021 •

edited

Loading