Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(state-sync): set the sync hash only when it's final #12720

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

marcelo-gonzalez
Copy link
Contributor

There's currently a bug that can happen if the sync hash that we set doesn't end up on the canonical chain. Here we add a couple more test cases for that, and fix it by considering only finalized blocks when we set the state sync hash, and changing the snapshot logic to check whether the current head is the prev block of what will be the sync hash block. This means the sync hash is found a little later than it was before, but on a chain with long epoch lengths, it shouldn't be a problem.

to find the sync hash, and with an epoch length of 6
temporary_account_during_resharding() fails
in check_txs() because account0 has not synced the parent shard by the resharding block yet
@marcelo-gonzalez marcelo-gonzalez requested a review from a team as a code owner January 11, 2025 04:25
@marcelo-gonzalez
Copy link
Contributor Author

There are a couple more test failures that happen because clients are dropping transactions because epoch lengths are too short and they're not caught up by the time the transaction comes in, and so they drop it. Will fix

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach may be flawed - I think you still partially process the non-final blocks. Please see the last in-line comment for where I think it happens.

Would it work if you keep the logic as is but instead for the current block only call it for the last_final_block of the current block? Maybe with some added deduplication since the same final block may be the last_final_block of multiple future blocks (even without forks).

fn remove_old_blocks<T: ChainStoreAccess>(
/// Helper to turn DBNotFoundErr() or the genesis header into None. We might get DBNotFoundErr() in the case
/// of epoch sync where we save individual headers without having all headers that belong to the epoch.
fn get_block_header<T: ChainStoreAccess>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a method with the same name in Chain - can you rename to something unique to avoid confusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to treat genesis differently? Is this method the best place to deal with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm get_block_header -> maybe_get_block_header?

And yea actually good point, moved the genesis height check outside this function and only call it in one place now. I think it's safest to perform the check because the first epoch also has a zero epoch ID

Comment on lines +93 to +99
Ok(h) => {
if h.height() != chain_store.get_genesis_height() {
Ok(Some(h))
} else {
Ok(None)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: h -> block_header
mini nit: maybe move the genesis handling outside of this match - if it makes things cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

chain_store: &T,
store_update: &mut StoreUpdate,
header: &BlockHeader,
) -> Result<(), Error> {
if header.last_final_block() == &CryptoHash::default() {
let done = save_epoch_new_chunks(chain_store, store_update, header)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to save new chunks for a non final block header? It may very well be discarded and not included on the canonical chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what's the bug though?

We store the new chunks for every new header stored, and if a header that gets passed to save_epoch_new_chunks() doesn't end up on the canonical chain, the consequence is that we never look at or touch that row in the DB again until we delete_all(DBCol::StateSyncNewChunks).

We need to store the number of new chunks at some point, and if we wait until a block is finalized to do it, the logic gets more complicated. In this version where we just call save_epoch_new_chunks() on every new non-final head update, we can assume that the info for the prev_hash of the header currently being processed has already been stored. But if we only do save_epoch_new_chunks() on each new last_final_block, we possibly need to iterate backwards until we find one that has new chunks info stored. And this backwards iteration will possibly need to go further back than what we already have here. Actually I'm not really sure off the top of my head what the upper bound is on how far back we'd need to go (basically, the upper bound on the number of blocks a final head update could jump over)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, sorry, I thought the new chunks are stored per height but they are stored per block hash - it should be fine as is like you said.

Good catch about the last final block skipping some heights! I completely missed that property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch about the last final block skipping some heights! I completely missed that property.

Hah actually to be honest I didn't know that either until writing this PR :D

@marcelo-gonzalez
Copy link
Contributor Author

Agh also realize I misread it and there are more test failures than I thought... Looking quickly at them it looks like prob not a big deal, will fix them tomorrow

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sorry for the confusion!

Nice test coverage!

A few nits still stand but otherwise looks good.

chain_store: &T,
store_update: &mut StoreUpdate,
header: &BlockHeader,
) -> Result<(), Error> {
if header.last_final_block() == &CryptoHash::default() {
let done = save_epoch_new_chunks(chain_store, store_update, header)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, sorry, I thought the new chunks are stored per height but they are stored per block hash - it should be fine as is like you said.

Good catch about the last final block skipping some heights! I completely missed that property.

Comment on lines +107 to +114
fn has_enough_new_chunks(store: &Store, block_hash: &CryptoHash) -> Result<Option<bool>, Error> {
let Some(num_new_chunks) = get_state_sync_new_chunks(store, block_hash)? else {
// This might happen in the case of epoch sync where we save individual headers without having all
// headers that belong to the epoch.
return Ok(None);
};
Ok(Some(num_new_chunks.iter().all(|num_chunks| *num_chunks >= 2)))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you need the Option in the return type? Can you just return bool?

let old_header = chain_store.get_block_header(&block_hash)?;
if old_header.height() < last_final_header.height() {
store_update.delete(DBCol::StateSyncNewChunks, block_hash.as_ref());
loop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this loop always iterate from the current final block all the way back to the beginning of the epoch? You may want to add an early return if the StateSyncHashes is already set for the epoch_id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I found this check elsewhere. Can you add a comment that this method should only be called if the sync hash wasn't found yet? Or maybe even some debug assertions.

Comment on lines +208 to +209
/// `block_hash` is the prev_hash of the future "sync_hash" block iff the number of new chunks in the epoch in
/// each shard is at least 2, and at least one shard of the block `prev_hash` doesn't yet have 2 new chunks in it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe say "is the first block in the that has the X property" instead of "is a block that has the X property but the previous block does not"

Comment on lines +243 to +245
// Check that no block with height `skip_block_height` made it on the canonical chain, so we're testing
// what we think we should be.
fn assert_fork_happened(env: &TestLoopEnv, skip_sync_block_height: BlockHeight) {
fn assert_fork_happened(env: &TestLoopEnv, skip_block_height: BlockHeight) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sufficient to check that fork happened? Just skipping a block isn't the same as a fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants