fix(state-sync): set the sync hash only when it's final #12720

marcelo-gonzalez · 2025-01-11T04:25:01Z

There's currently a bug that can happen if the sync hash that we set doesn't end up on the canonical chain. Here we add a couple more test cases for that, and fix it by considering only finalized blocks when we set the state sync hash, and changing the snapshot logic to check whether the current head is the prev block of what will be the sync hash block. This means the sync hash is found a little later than it was before, but on a chain with long epoch lengths, it shouldn't be a problem.

to find the sync hash, and with an epoch length of 6 temporary_account_during_resharding() fails in check_txs() because account0 has not synced the parent shard by the resharding block yet

marcelo-gonzalez · 2025-01-11T04:25:49Z

There are a couple more test failures that happen because clients are dropping transactions because epoch lengths are too short and they're not caught up by the time the transaction comes in, and so they drop it. Will fix

wacban

This approach may be flawed - I think you still partially process the non-final blocks. Please see the last in-line comment for where I think it happens.

Would it work if you keep the logic as is but instead for the current block only call it for the last_final_block of the current block? Maybe with some added deduplication since the same final block may be the last_final_block of multiple future blocks (even without forks).

wacban · 2025-01-12T11:29:35Z

chain/chain/src/state_sync.rs

-fn remove_old_blocks<T: ChainStoreAccess>(
+/// Helper to turn DBNotFoundErr() or the genesis header into None. We might get DBNotFoundErr() in the case
+/// of epoch sync where we save individual headers without having all headers that belong to the epoch.
+fn get_block_header<T: ChainStoreAccess>(


There is a method with the same name in Chain - can you rename to something unique to avoid confusion?

Why do you need to treat genesis differently? Is this method the best place to deal with that?

hmm get_block_header -> maybe_get_block_header?

And yea actually good point, moved the genesis height check outside this function and only call it in one place now. I think it's safest to perform the check because the first epoch also has a zero epoch ID

wacban · 2025-01-12T11:31:28Z

chain/chain/src/state_sync.rs

+        Ok(h) => {
+            if h.height() != chain_store.get_genesis_height() {
+                Ok(Some(h))
+            } else {
+                Ok(None)
+            }
+        }


nit: h -> block_header
mini nit: maybe move the genesis handling outside of this match - if it makes things cleaner

wacban · 2025-01-12T18:19:25Z

chain/chain/src/state_sync.rs

    chain_store: &T,
    store_update: &mut StoreUpdate,
    header: &BlockHeader,
 ) -> Result<(), Error> {
-    if header.last_final_block() == &CryptoHash::default() {
+    let done = save_epoch_new_chunks(chain_store, store_update, header)?;


Is it ok to save new chunks for a non final block header? It may very well be discarded and not included on the canonical chain.

Hmm what's the bug though?

We store the new chunks for every new header stored, and if a header that gets passed to save_epoch_new_chunks() doesn't end up on the canonical chain, the consequence is that we never look at or touch that row in the DB again until we delete_all(DBCol::StateSyncNewChunks).

We need to store the number of new chunks at some point, and if we wait until a block is finalized to do it, the logic gets more complicated. In this version where we just call save_epoch_new_chunks() on every new non-final head update, we can assume that the info for the prev_hash of the header currently being processed has already been stored. But if we only do save_epoch_new_chunks() on each new last_final_block, we possibly need to iterate backwards until we find one that has new chunks info stored. And this backwards iteration will possibly need to go further back than what we already have here. Actually I'm not really sure off the top of my head what the upper bound is on how far back we'd need to go (basically, the upper bound on the number of blocks a final head update could jump over)

Ah yeah, sorry, I thought the new chunks are stored per height but they are stored per block hash - it should be fine as is like you said.

Good catch about the last final block skipping some heights! I completely missed that property.

Good catch about the last final block skipping some heights! I completely missed that property.

Hah actually to be honest I didn't know that either until writing this PR :D

marcelo-gonzalez · 2025-01-14T06:03:40Z

Agh also realize I misread it and there are more test failures than I thought... Looking quickly at them it looks like prob not a big deal, will fix them tomorrow

wacban

LGTM, sorry for the confusion!

Nice test coverage!

A few nits still stand but otherwise looks good.

wacban · 2025-01-14T09:49:59Z

chain/chain/src/state_sync.rs

    chain_store: &T,
    store_update: &mut StoreUpdate,
    header: &BlockHeader,
 ) -> Result<(), Error> {
-    if header.last_final_block() == &CryptoHash::default() {
+    let done = save_epoch_new_chunks(chain_store, store_update, header)?;


Ah yeah, sorry, I thought the new chunks are stored per height but they are stored per block hash - it should be fine as is like you said.

Good catch about the last final block skipping some heights! I completely missed that property.

wacban · 2025-01-14T09:53:16Z

chain/chain/src/state_sync.rs

+fn has_enough_new_chunks(store: &Store, block_hash: &CryptoHash) -> Result<Option<bool>, Error> {
+    let Some(num_new_chunks) = get_state_sync_new_chunks(store, block_hash)? else {
+        // This might happen in the case of epoch sync where we save individual headers without having all
+        // headers that belong to the epoch.
+        return Ok(None);
+    };
+    Ok(Some(num_new_chunks.iter().all(|num_chunks| *num_chunks >= 2)))
+}


Why you need the Option in the return type? Can you just return bool?

wacban · 2025-01-14T09:58:09Z

chain/chain/src/state_sync.rs

-        let old_header = chain_store.get_block_header(&block_hash)?;
-        if old_header.height() < last_final_header.height() {
-            store_update.delete(DBCol::StateSyncNewChunks, block_hash.as_ref());
+    loop {


Does this loop always iterate from the current final block all the way back to the beginning of the epoch? You may want to add an early return if the StateSyncHashes is already set for the epoch_id.

Ah I found this check elsewhere. Can you add a comment that this method should only be called if the sync hash wasn't found yet? Or maybe even some debug assertions.

wacban · 2025-01-14T10:05:16Z

chain/chain/src/state_sync.rs

+/// `block_hash` is the prev_hash of the future "sync_hash" block iff the number of new chunks in the epoch in
+/// each shard is at least 2, and at least one shard of the block `prev_hash` doesn't yet have 2 new chunks in it.


nit: Maybe say "is the first block in the that has the X property" instead of "is a block that has the X property but the previous block does not"

wacban · 2025-01-14T10:09:49Z

integration-tests/src/test_loop/tests/state_sync.rs

+// Check that no block with height `skip_block_height` made it on the canonical chain, so we're testing
 // what we think we should be.
-fn assert_fork_happened(env: &TestLoopEnv, skip_sync_block_height: BlockHeight) {
+fn assert_fork_happened(env: &TestLoopEnv, skip_block_height: BlockHeight) {


Is this sufficient to check that fork happened? Just skipping a block isn't the same as a fork.

marcelo-gonzalez added 5 commits January 10, 2025 19:23

add another test case

b170b13

add test skipping one after the sync block and one before

14a1b50

get rid of remove_old_blocks()

4008661

fix it

8374ec4

bump resharding test epoch length because nodes take a little longer

0fc16d8

to find the sync hash, and with an epoch length of 6 temporary_account_during_resharding() fails in check_txs() because account0 has not synced the parent shard by the resharding block yet

marcelo-gonzalez requested a review from wacban January 11, 2025 04:25

marcelo-gonzalez requested a review from a team as a code owner January 11, 2025 04:25

wacban requested changes Jan 12, 2025

View reviewed changes

wacban approved these changes Jan 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(state-sync): set the sync hash only when it's final #12720

fix(state-sync): set the sync hash only when it's final #12720

marcelo-gonzalez commented Jan 11, 2025

marcelo-gonzalez commented Jan 11, 2025

wacban left a comment

wacban Jan 12, 2025

wacban Jan 12, 2025

marcelo-gonzalez Jan 14, 2025

wacban Jan 12, 2025

marcelo-gonzalez Jan 14, 2025

wacban Jan 12, 2025

marcelo-gonzalez Jan 14, 2025

wacban Jan 14, 2025

marcelo-gonzalez Jan 14, 2025

marcelo-gonzalez commented Jan 14, 2025

wacban left a comment

wacban Jan 14, 2025

wacban Jan 14, 2025

wacban Jan 14, 2025

wacban Jan 14, 2025

wacban Jan 14, 2025

wacban Jan 14, 2025

		/// `block_hash` is the prev_hash of the future "sync_hash" block iff the number of new chunks in the epoch in
		/// each shard is at least 2, and at least one shard of the block `prev_hash` doesn't yet have 2 new chunks in it.

fix(state-sync): set the sync hash only when it's final #12720

Are you sure you want to change the base?

fix(state-sync): set the sync hash only when it's final #12720

Conversation

marcelo-gonzalez commented Jan 11, 2025

marcelo-gonzalez commented Jan 11, 2025

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcelo-gonzalez commented Jan 14, 2025

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment