-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(state-sync): set the sync hash only when it's final #12720
base: master
Are you sure you want to change the base?
Conversation
to find the sync hash, and with an epoch length of 6 temporary_account_during_resharding() fails in check_txs() because account0 has not synced the parent shard by the resharding block yet
There are a couple more test failures that happen because clients are dropping transactions because epoch lengths are too short and they're not caught up by the time the transaction comes in, and so they drop it. Will fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach may be flawed - I think you still partially process the non-final blocks. Please see the last in-line comment for where I think it happens.
Would it work if you keep the logic as is but instead for the current block only call it for the last_final_block
of the current block? Maybe with some added deduplication since the same final block may be the last_final_block
of multiple future blocks (even without forks).
fn remove_old_blocks<T: ChainStoreAccess>( | ||
/// Helper to turn DBNotFoundErr() or the genesis header into None. We might get DBNotFoundErr() in the case | ||
/// of epoch sync where we save individual headers without having all headers that belong to the epoch. | ||
fn get_block_header<T: ChainStoreAccess>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a method with the same name in Chain - can you rename to something unique to avoid confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to treat genesis differently? Is this method the best place to deal with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm get_block_header
-> maybe_get_block_header
?
And yea actually good point, moved the genesis height check outside this function and only call it in one place now. I think it's safest to perform the check because the first epoch also has a zero epoch ID
Ok(h) => { | ||
if h.height() != chain_store.get_genesis_height() { | ||
Ok(Some(h)) | ||
} else { | ||
Ok(None) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: h -> block_header
mini nit: maybe move the genesis handling outside of this match - if it makes things cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
chain_store: &T, | ||
store_update: &mut StoreUpdate, | ||
header: &BlockHeader, | ||
) -> Result<(), Error> { | ||
if header.last_final_block() == &CryptoHash::default() { | ||
let done = save_epoch_new_chunks(chain_store, store_update, header)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to save new chunks for a non final block header? It may very well be discarded and not included on the canonical chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what's the bug though?
We store the new chunks for every new header stored, and if a header that gets passed to save_epoch_new_chunks()
doesn't end up on the canonical chain, the consequence is that we never look at or touch that row in the DB again until we delete_all(DBCol::StateSyncNewChunks)
.
We need to store the number of new chunks at some point, and if we wait until a block is finalized to do it, the logic gets more complicated. In this version where we just call save_epoch_new_chunks()
on every new non-final head update, we can assume that the info for the prev_hash
of the header currently being processed has already been stored. But if we only do save_epoch_new_chunks()
on each new last_final_block
, we possibly need to iterate backwards until we find one that has new chunks info stored. And this backwards iteration will possibly need to go further back than what we already have here. Actually I'm not really sure off the top of my head what the upper bound is on how far back we'd need to go (basically, the upper bound on the number of blocks a final head update could jump over)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, sorry, I thought the new chunks are stored per height but they are stored per block hash - it should be fine as is like you said.
Good catch about the last final block skipping some heights! I completely missed that property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch about the last final block skipping some heights! I completely missed that property.
Hah actually to be honest I didn't know that either until writing this PR :D
Agh also realize I misread it and there are more test failures than I thought... Looking quickly at them it looks like prob not a big deal, will fix them tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, sorry for the confusion!
Nice test coverage!
A few nits still stand but otherwise looks good.
chain_store: &T, | ||
store_update: &mut StoreUpdate, | ||
header: &BlockHeader, | ||
) -> Result<(), Error> { | ||
if header.last_final_block() == &CryptoHash::default() { | ||
let done = save_epoch_new_chunks(chain_store, store_update, header)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, sorry, I thought the new chunks are stored per height but they are stored per block hash - it should be fine as is like you said.
Good catch about the last final block skipping some heights! I completely missed that property.
fn has_enough_new_chunks(store: &Store, block_hash: &CryptoHash) -> Result<Option<bool>, Error> { | ||
let Some(num_new_chunks) = get_state_sync_new_chunks(store, block_hash)? else { | ||
// This might happen in the case of epoch sync where we save individual headers without having all | ||
// headers that belong to the epoch. | ||
return Ok(None); | ||
}; | ||
Ok(Some(num_new_chunks.iter().all(|num_chunks| *num_chunks >= 2))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you need the Option in the return type? Can you just return bool?
let old_header = chain_store.get_block_header(&block_hash)?; | ||
if old_header.height() < last_final_header.height() { | ||
store_update.delete(DBCol::StateSyncNewChunks, block_hash.as_ref()); | ||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this loop always iterate from the current final block all the way back to the beginning of the epoch? You may want to add an early return if the StateSyncHashes
is already set for the epoch_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I found this check elsewhere. Can you add a comment that this method should only be called if the sync hash wasn't found yet? Or maybe even some debug assertions.
/// `block_hash` is the prev_hash of the future "sync_hash" block iff the number of new chunks in the epoch in | ||
/// each shard is at least 2, and at least one shard of the block `prev_hash` doesn't yet have 2 new chunks in it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe say "is the first block in the that has the X property" instead of "is a block that has the X property but the previous block does not"
// Check that no block with height `skip_block_height` made it on the canonical chain, so we're testing | ||
// what we think we should be. | ||
fn assert_fork_happened(env: &TestLoopEnv, skip_sync_block_height: BlockHeight) { | ||
fn assert_fork_happened(env: &TestLoopEnv, skip_block_height: BlockHeight) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sufficient to check that fork happened? Just skipping a block isn't the same as a fork.
There's currently a bug that can happen if the sync hash that we set doesn't end up on the canonical chain. Here we add a couple more test cases for that, and fix it by considering only finalized blocks when we set the state sync hash, and changing the snapshot logic to check whether the current head is the prev block of what will be the sync hash block. This means the sync hash is found a little later than it was before, but on a chain with long epoch lengths, it shouldn't be a problem.