Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFT] Node ejected before epoch recovery #6632

Open
wants to merge 14 commits into
base: feature/efm-recovery
Choose a base branch
from

Conversation

durkmurder
Copy link
Member

#6331

Context

This PR makes improvements on code clarity based on parent branch and implements a new test which ejects a node before submitting the recovery transaction which leads to a scenario where the DKG committee is a superset of consensus committee when recovery takes place. To support this behavior some changes to the tooling were made as well as vote processing logic to account for inequality of random beacon and consensus committees

@@ -55,9 +56,6 @@ func GenerateRecoverEpochTxArgs(log zerolog.Logger,

internalNodesMap := make(map[flow.Identifier]struct{})
for _, node := range internalNodes {
if !currentEpochIdentities.Exists(node.Identity()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally put here to communicate an unusual state to operators when running the efm-recover-args command. I agree it should not cause an error, but maybe we can replace this with a WARN log. Then when it is run as part of the CLI tool we will still surface this info to operators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


publicKeyShares := make([]crypto.PublicKey, 0, dkg.Size())
for i := uint(0); i < dkg.Size(); i++ {
nodeID, err := dkg.NodeID(i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both instances where we currently are using DKG.NodeID, we are using it in order to construct a list of participant public key shares. Since the underlying DKG implementation (backed by EpochCommit) already contains this data in the DKGParticipantKeys field, why don't we replace the DKG.NodeID method with DKG.AllKeyShares(), that returns EpochCommit.DKGParticipantKeys?

Then, we can replace lines 63-74 with:

publicKeyShares := dkg.AllKeyShares()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds good, I will add it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// prepare the staking public keys of participants
stakingKeys := make([]crypto.PublicKey, 0, len(allParticipants))
stakingBeaconKeys := make([]crypto.PublicKey, 0, len(allParticipants))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stakingBeaconKeys := make([]crypto.PublicKey, 0, len(allParticipants))
beaconKeys := make([]crypto.PublicKey, 0, len(allParticipants))

In other parts of the code, we generally use the term "beacon key". Since what we are mainly differentiating it from is the "staking key", I think omitting "staking" from the name makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a special case, it's basically beacon keys that take part in staking. There is a separate beaconKeys down the line that is used only for random beacon protocol.

}

stakingSigAggtor, err := signature.NewWeightedSignatureAggregator(allParticipants, stakingKeys, msg, msig.ConsensusVoteTag)
if err != nil {
return nil, fmt.Errorf("could not create aggregator for staking signatures: %w", err)
}

dkg, err := f.committee.DKG(block.View)
rbSigAggtor, err := signature.NewWeightedSignatureAggregator(allParticipants, stakingBeaconKeys, msg, msig.RandomBeaconTag)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rbSigAggtor, err := signature.NewWeightedSignatureAggregator(allParticipants, stakingBeaconKeys, msg, msig.RandomBeaconTag)
beaconAggregator, err := signature.NewWeightedSignatureAggregator(allParticipants, stakingBeaconKeys, msg, msig.RandomBeaconTag)

Nit: Reuse the "beacon" terminology from above, expand Aggtor so it's pronounceable. (Feel free to ignore if you disagree)

require.Equal(s.T(), events[0].Events[0].Type, eventType)

// 5. Ensure the network transitions into the recovery epoch and finalizes the first view of the recovery epoch.
startViewOfNextEpoch := uint64(txArgs[1].(cadence.UInt64))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried adding a conditional to wait for endViewOfNextEpoch+1 and validate s.AssertInEpoch(s.Ctx, 2)? (Successfully complete the DKG and epoch transition in the recovery epoch.) It would be nice to include this in one of the test cases if the runtime is manageable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried it but I can add it.

Comment on lines +218 to +220
// wait until the epoch setup phase to force network into EFM
s.AwaitEpochPhase(s.Ctx, 0, flow.EpochPhaseSetup, 10*time.Second, 500*time.Millisecond)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// wait until the epoch setup phase to force network into EFM
s.AwaitEpochPhase(s.Ctx, 0, flow.EpochPhaseSetup, 10*time.Second, 500*time.Millisecond)

In principle, stopping the Collection Node before entering the EpochSetup phase would be more reliable. The Consensus Nodes send off the first DKG message as soon as the EpochSetup phase starts, and can technically send their result submission after the DKG final view, so this could be race-prone.


// assert transition to second epoch did not happen
// if counter is still 0, epoch emergency fallback was triggered as expected
s.AssertInEpoch(s.Ctx, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this mechanism for checking EFM dates to before we added EpochPhaseFallback. We could retrieve the snapshot (currently line 248) a bit earlier, and assert that snapshot.EpochPhase() == EpochPhaseFallback, to make the test logic a little more explicit.

@durkmurder
Copy link
Member Author

durkmurder commented Nov 15, 2024

During addressing comments found out that DKG is very inconsistent and flaky under this setup, left more details here: #6729.

@AlexHentschel @jordanschalm

Base automatically changed from khalil/6164-efm-integration-test-part2 to feature/efm-recovery November 19, 2024 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants