feat: single shard tracking State cleanup #12734

staffik · 2025-01-14T13:37:39Z

Cleanup parent shard State if neither child was tracked for GC window epochs.
That also implements shards garbage collection in general, see #11883.

TODO: add more tests

Summary

We cleanup the unused shards State with the delay of GC window epochs. One reason is that GC modifies the State, and removing the State earlier would result at least in negative refcounts, if not more serious problems.
For that, we need to know if a shard was not tracked since GC window epochs. One caveat is that validator operator could potentially changed the validator key in this period, so we should not rely on the current validator key (or even tracking config) to tell what shards were tracked in past.
We use TrieChanges column, to determine what shards were tracked at the given epoch. We rely on TrieChanges being saved to the last block of an epoch, for all shards that were tracked at given epoch. TODO: add a test that focuses on that
The cleanup for shards is only triggered when we gc-ed the last block of an epoch, always a final block and in canonical chain.
For each shard that we cleaned up, we remove State mapping for it, as the shard being deleted means we do not have the State for any descendant shard too.
Of course we should not remove the State of shards that are currently tracked. And we do not remove State of shards that we care about in the next epoch.

Testing

GC num epochs to keep set to 3.

Notation

P - parent shard
C - child shard
U - unrelated shard
Schedule: (epoch before resharding) | (epoch after resharding) | ... next epochs

Tested scenarios

P | C | U ... test_resharding_v3_state_cleanup
P | U ... test_resharding_v3_do_not_track_children_after_resharding
P | C | U | U | U | U | U | C ... test_resharding_v3_stop_track_child_for_5_epochs (in the end we do not map to parent)
P | C1 | U | U | C2 | U | U | C1 ... test_resharding_v3_stop_track_child_for_5_epochs_with_sibling_in_between (in the end we map to parent)
P | U | C ... test_resharding_v3_shard_shuffling_untrack_then_track
U | U | C ... test_resharding_v3_sync_child

codecov · 2025-01-14T16:22:18Z

Codecov Report

Attention: Patch coverage is 92.99065% with 15 lines in your changes missing coverage. Please review.

Project coverage is 70.71%. Comparing base (668b0d6) to head (14ff014).
Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
chain/chain/src/garbage_collection.rs	89.70%	0 Missing and 14 partials ⚠️
core/store/src/adapter/flat_store.rs	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12734      +/-   ##
==========================================
+ Coverage   70.66%   70.71%   +0.05%     
==========================================
  Files         849      849              
  Lines      174430   174612     +182     
  Branches   174430   174612     +182     
==========================================
+ Hits       123256   123473     +217     
+ Misses      46030    45978      -52     
- Partials     5144     5161      +17

Flag	Coverage Δ
backward-compatibility	`0.16% <0.00%> (-0.01%)`	⬇️
db-migration	`0.16% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.35% <0.00%> (-0.01%)`	⬇️
linux	`69.17% <79.43%> (+<0.01%)`	⬆️
linux-nightly	`70.31% <92.99%> (+0.04%)`	⬆️
pytests	`1.65% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.46% <0.00%> (-0.01%)`	⬇️
unittests	`70.54% <92.99%> (+0.05%)`	⬆️
upgradability	`0.20% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

chain/client/src/gc_actor.rs

chain/epoch-manager/src/shard_tracker.rs

core/store/src/adapter/flat_store.rs

wacban

I didn't finish the review but so far looks good. I'll let others approve or come back to it tomorrow.

wacban · 2025-01-15T16:17:27Z

chain/chain/src/garbage_collection.rs

+                let tracked_shards_in_gced_epoch_to_check_for_cleanup = if !shard_tracker
+                    .tracks_all_shards()
+                    && epoch_manager.is_last_block_in_finished_epoch(block_hash)?
+                {
+                    Some(get_tracked_shards_in_past_epoch(
+                        &chain_store_update,
+                        &epoch_manager,
+                        block_hash,
+                    )?)
+                } else {
+                    None
+                };


nit: Maybe move to a helper method?

wacban · 2025-01-15T16:20:03Z

chain/chain/src/garbage_collection.rs

+                if let Some(potential_shards_for_cleanup) =
+                    tracked_shards_in_gced_epoch_to_check_for_cleanup


nit: Keep consistent naming e.g. call both potential_shards_for_cleanup.

wacban · 2025-01-15T16:24:00Z

chain/chain/src/garbage_collection.rs

+fn get_tracked_shards_in_past_epoch(
+    chain_store_update: &ChainStoreUpdate,
+    epoch_manager: &Arc<dyn EpochManagerAdapter>,
+    past_epoch_block_hash: &CryptoHash,
+) -> Result<Vec<ShardUId>, Error> {


I don't think this method relies on the fact that the block is in past epoch or that it is the last block of that epoch. I would suggest removing past_epoch from the method method and argument names.

wacban · 2025-01-15T16:25:17Z

chain/chain/src/garbage_collection.rs

+/// State cleanup for single shard tracking. Removes State of shards that are no longer in use.
+///
+/// It has to be run after we clear block data for the `last_block_hash_in_gced_epoch`.
+/// `tracked_shards_in_gced_epoch` are shards that were tracked in the gc-ed epoch,
+/// and these are shards that we potentially no longer use and that can be cleaned up.
+/// We do not clean up a shard if it has been tracked in any epoch later,
+/// or we care about it in the current or the next epoch (relative to Head).
+///
+/// With ReshardingV3, we use State mapping (see DBCol::StateShardUIdMapping),
+/// where each `ShardUId` is potentially mapped to its ancestor to get the database key prefix.
+/// We only remove a shard State if all its descendants are ready to be cleaned up,
+/// in which case, we also remove the mapping from `StateShardUIdMapping`.


wacban

LGTM, great stuff!

Trisfald

🚀

Trisfald · 2025-01-16T12:54:59Z

chain/chain/src/garbage_collection.rs

                    return Err(Error::GCError(
                        "block on canonical chain shouldn't have refcount 0".into(),
                    ));
                }
+                debug_assert_eq!(blocks_current_height.len(), 1);
+
+                // Do not clean up immediatelly, as we still need the State to run gc for this block.


Suggested change

// Do not clean up immediatelly, as we still need the State to run gc for this block.

// Do not clean up immediately, as we still need the State to run gc for this block.

Trisfald · 2025-01-16T13:01:45Z

chain/chain/src/sharding.rs

@@ -13,6 +15,18 @@ pub fn shuffle_receipt_proofs<ReceiptProofType>(
    receipt_proofs.shuffle(&mut rng);
 }

+pub fn cares_about_shard_this_or_next_epoch(


nit: As I read this function, it might as well be a member of shard tracker.
Feel free to ignore

Trisfald · 2025-01-16T13:03:42Z

integration-tests/src/test_loop/tests/resharding_v3.rs

-/// Maximum number of epochs under which the test should finish.
-const TESTLOOP_NUM_EPOCHS_TO_WAIT: u64 = 8;
+/// Default number of epochs for resharding testloop to run.
+// TODO(resharding) Fix nearcore and set it to 10.


Do you have any insight on why it fails with 10? If you do it might be good to capture that in a comment or issue.

If you don't have any relevant clue, feel free to ignore this comment

Yes, this is the issue described here: https://near.zulipchat.com/#narrow/channel/407288-core.2Fresharding/topic/forknet/near/494174933

…tate-cleanup-impl

staffik · 2025-01-16T20:55:23Z

Had to add this 14ff014 to pass nayduck tests. @Trisfald @wacban if you are against please let me know. If not, will merge tomorrow.

wacban · 2025-01-16T22:14:56Z

Had to add this 14ff014 to pass nayduck tests. @Trisfald @wacban if you are against please let me know. If not, will merge tomorrow.

Do you know what made this change needed?

Some sort of error -> let's fix the error
Some sort of assumption in the test no longer holds when the state is cleaned -> that's fine

staffik · 2025-01-17T10:00:23Z

Some sort of assumption in the test no longer holds when the state is cleaned -> that's fine

The debug assert I added failed at the beginning of a test: in sanity/simple.py and sanity/split_storage.py, one archival node was created without setting tracked_shards=[0].

staffik and others added 4 commits January 14, 2025 14:36

Single shard tracking State cleanup

73c6ce4

clippy

66befc4

Merge branch 'master' into stafik/resharding/state-cleanup-impl

b58dfb7

comments

29bfb4f

staffik requested review from wacban, Trisfald, shreyan-gupta and marcelo-gonzalez January 14, 2025 16:07

Trisfald reviewed Jan 15, 2025

View reviewed changes

chain/client/src/gc_actor.rs Show resolved Hide resolved

Trisfald reviewed Jan 15, 2025

View reviewed changes

chain/epoch-manager/src/shard_tracker.rs Outdated Show resolved Hide resolved

Trisfald reviewed Jan 15, 2025

View reviewed changes

core/store/src/adapter/flat_store.rs Show resolved Hide resolved

staffik added 2 commits January 15, 2025 12:53

tests

6b191d9

fmt

abb0454

staffik marked this pull request as ready for review January 15, 2025 13:23

staffik requested a review from a team as a code owner January 15, 2025 13:23

wacban reviewed Jan 15, 2025

View reviewed changes

wacban approved these changes Jan 16, 2025

View reviewed changes

Trisfald approved these changes Jan 16, 2025

View reviewed changes

address review comments

250fd18

staffik enabled auto-merge January 16, 2025 16:51

staffik disabled auto-merge January 16, 2025 16:57

staffik added 2 commits January 16, 2025 17:58

Merge remote-tracking branch 'origin/master' into stafik/resharding/s…

130aec9

…tate-cleanup-impl

fix compile with newest master

3805f2f

staffik enabled auto-merge January 16, 2025 17:05

staffik added this pull request to the merge queue Jan 16, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 16, 2025

fix nayduck

14ff014

staffik added this pull request to the merge queue Jan 17, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 17, 2025

staffik added this pull request to the merge queue Jan 17, 2025

Merged via the queue into master with commit 3d8336a Jan 17, 2025
28 checks passed

staffik deleted the stafik/resharding/state-cleanup-impl branch January 17, 2025 13:48

staffik mentioned this pull request Jan 20, 2025

Shards garbage collection #11883

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: single shard tracking State cleanup #12734

feat: single shard tracking State cleanup #12734

staffik commented Jan 14, 2025 •

edited

Loading

codecov bot commented Jan 14, 2025 •

edited

Loading

wacban left a comment

wacban Jan 15, 2025

wacban Jan 15, 2025

wacban Jan 15, 2025

wacban Jan 15, 2025

wacban left a comment

Trisfald left a comment

Trisfald Jan 16, 2025

Trisfald Jan 16, 2025

Trisfald Jan 16, 2025

staffik Jan 16, 2025

staffik commented Jan 16, 2025

wacban commented Jan 16, 2025

staffik commented Jan 17, 2025

		if let Some(potential_shards_for_cleanup) =
		tracked_shards_in_gced_epoch_to_check_for_cleanup

	// Do not clean up immediatelly, as we still need the State to run gc for this block.
	// Do not clean up immediately, as we still need the State to run gc for this block.

feat: single shard tracking State cleanup #12734

feat: single shard tracking State cleanup #12734

Conversation

staffik commented Jan 14, 2025 • edited Loading

Summary

Testing

Notation

Tested scenarios

codecov bot commented Jan 14, 2025 • edited Loading

Codecov Report

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

Trisfald left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

staffik commented Jan 16, 2025

wacban commented Jan 16, 2025

staffik commented Jan 17, 2025

staffik commented Jan 14, 2025 •

edited

Loading

codecov bot commented Jan 14, 2025 •

edited

Loading