Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: single shard tracking State cleanup #12734

Merged
merged 10 commits into from
Jan 17, 2025

Conversation

staffik
Copy link
Contributor

@staffik staffik commented Jan 14, 2025

Cleanup parent shard State if neither child was tracked for GC window epochs.
That also implements shards garbage collection in general, see #11883.

TODO: add more tests

Summary

  • We cleanup the unused shards State with the delay of GC window epochs. One reason is that GC modifies the State, and removing the State earlier would result at least in negative refcounts, if not more serious problems.
  • For that, we need to know if a shard was not tracked since GC window epochs. One caveat is that validator operator could potentially changed the validator key in this period, so we should not rely on the current validator key (or even tracking config) to tell what shards were tracked in past.
  • We use TrieChanges column, to determine what shards were tracked at the given epoch. We rely on TrieChanges being saved to the last block of an epoch, for all shards that were tracked at given epoch. TODO: add a test that focuses on that
  • The cleanup for shards is only triggered when we gc-ed the last block of an epoch, always a final block and in canonical chain.
  • For each shard that we cleaned up, we remove State mapping for it, as the shard being deleted means we do not have the State for any descendant shard too.
  • Of course we should not remove the State of shards that are currently tracked. And we do not remove State of shards that we care about in the next epoch.

Testing

GC num epochs to keep set to 3.

Notation

P - parent shard
C - child shard
U - unrelated shard
Schedule: (epoch before resharding) | (epoch after resharding) | ... next epochs

Tested scenarios

  • P | C | U ... test_resharding_v3_state_cleanup
  • P | U ... test_resharding_v3_do_not_track_children_after_resharding
  • P | C | U | U | U | U | U | C ... test_resharding_v3_stop_track_child_for_5_epochs (in the end we do not map to parent)
  • P | C1 | U | U | C2 | U | U | C1 ... test_resharding_v3_stop_track_child_for_5_epochs_with_sibling_in_between (in the end we map to parent)
  • P | U | C ... test_resharding_v3_shard_shuffling_untrack_then_track
  • U | U | C ... test_resharding_v3_sync_child

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 92.99065% with 15 lines in your changes missing coverage. Please review.

Project coverage is 70.71%. Comparing base (668b0d6) to head (14ff014).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
chain/chain/src/garbage_collection.rs 89.70% 0 Missing and 14 partials ⚠️
core/store/src/adapter/flat_store.rs 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12734      +/-   ##
==========================================
+ Coverage   70.66%   70.71%   +0.05%     
==========================================
  Files         849      849              
  Lines      174430   174612     +182     
  Branches   174430   174612     +182     
==========================================
+ Hits       123256   123473     +217     
+ Misses      46030    45978      -52     
- Partials     5144     5161      +17     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.35% <0.00%> (-0.01%) ⬇️
linux 69.17% <79.43%> (+<0.01%) ⬆️
linux-nightly 70.31% <92.99%> (+0.04%) ⬆️
pytests 1.65% <0.00%> (-0.01%) ⬇️
sanity-checks 1.46% <0.00%> (-0.01%) ⬇️
unittests 70.54% <92.99%> (+0.05%) ⬆️
upgradability 0.20% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@staffik staffik marked this pull request as ready for review January 15, 2025 13:23
@staffik staffik requested a review from a team as a code owner January 15, 2025 13:23
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't finish the review but so far looks good. I'll let others approve or come back to it tomorrow.

Comment on lines 241 to 252
let tracked_shards_in_gced_epoch_to_check_for_cleanup = if !shard_tracker
.tracks_all_shards()
&& epoch_manager.is_last_block_in_finished_epoch(block_hash)?
{
Some(get_tracked_shards_in_past_epoch(
&chain_store_update,
&epoch_manager,
block_hash,
)?)
} else {
None
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe move to a helper method?

Comment on lines 261 to 262
if let Some(potential_shards_for_cleanup) =
tracked_shards_in_gced_epoch_to_check_for_cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Keep consistent naming e.g. call both potential_shards_for_cleanup.

Comment on lines 1092 to 1096
fn get_tracked_shards_in_past_epoch(
chain_store_update: &ChainStoreUpdate,
epoch_manager: &Arc<dyn EpochManagerAdapter>,
past_epoch_block_hash: &CryptoHash,
) -> Result<Vec<ShardUId>, Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this method relies on the fact that the block is in past epoch or that it is the last block of that epoch. I would suggest removing past_epoch from the method method and argument names.

Comment on lines 1114 to 1125
/// State cleanup for single shard tracking. Removes State of shards that are no longer in use.
///
/// It has to be run after we clear block data for the `last_block_hash_in_gced_epoch`.
/// `tracked_shards_in_gced_epoch` are shards that were tracked in the gc-ed epoch,
/// and these are shards that we potentially no longer use and that can be cleaned up.
/// We do not clean up a shard if it has been tracked in any epoch later,
/// or we care about it in the current or the next epoch (relative to Head).
///
/// With ReshardingV3, we use State mapping (see DBCol::StateShardUIdMapping),
/// where each `ShardUId` is potentially mapped to its ancestor to get the database key prefix.
/// We only remove a shard State if all its descendants are ready to be cleaned up,
/// in which case, we also remove the mapping from `StateShardUIdMapping`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great stuff!

Copy link
Contributor

@Trisfald Trisfald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

return Err(Error::GCError(
"block on canonical chain shouldn't have refcount 0".into(),
));
}
debug_assert_eq!(blocks_current_height.len(), 1);

// Do not clean up immediatelly, as we still need the State to run gc for this block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Do not clean up immediatelly, as we still need the State to run gc for this block.
// Do not clean up immediately, as we still need the State to run gc for this block.

@@ -13,6 +15,18 @@ pub fn shuffle_receipt_proofs<ReceiptProofType>(
receipt_proofs.shuffle(&mut rng);
}

pub fn cares_about_shard_this_or_next_epoch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: As I read this function, it might as well be a member of shard tracker.
Feel free to ignore

/// Maximum number of epochs under which the test should finish.
const TESTLOOP_NUM_EPOCHS_TO_WAIT: u64 = 8;
/// Default number of epochs for resharding testloop to run.
// TODO(resharding) Fix nearcore and set it to 10.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any insight on why it fails with 10? If you do it might be good to capture that in a comment or issue.

If you don't have any relevant clue, feel free to ignore this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staffik staffik enabled auto-merge January 16, 2025 16:51
@staffik staffik disabled auto-merge January 16, 2025 16:57
@staffik staffik enabled auto-merge January 16, 2025 17:05
@staffik staffik added this pull request to the merge queue Jan 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 16, 2025
@staffik
Copy link
Contributor Author

staffik commented Jan 16, 2025

Had to add this 14ff014 to pass nayduck tests. @Trisfald @wacban if you are against please let me know. If not, will merge tomorrow.

@wacban
Copy link
Contributor

wacban commented Jan 16, 2025

Had to add this 14ff014 to pass nayduck tests. @Trisfald @wacban if you are against please let me know. If not, will merge tomorrow.

Do you know what made this change needed?

  • Some sort of error -> let's fix the error
  • Some sort of assumption in the test no longer holds when the state is cleaned -> that's fine

@staffik
Copy link
Contributor Author

staffik commented Jan 17, 2025

  • Some sort of assumption in the test no longer holds when the state is cleaned -> that's fine

The debug assert I added failed at the beginning of a test: in sanity/simple.py and sanity/split_storage.py, one archival node was created without setting tracked_shards=[0].

@staffik staffik added this pull request to the merge queue Jan 17, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 17, 2025
@staffik staffik added this pull request to the merge queue Jan 17, 2025
Merged via the queue into master with commit 3d8336a Jan 17, 2025
28 checks passed
@staffik staffik deleted the stafik/resharding/state-cleanup-impl branch January 17, 2025 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants