Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(resharding): Full forks support in flat storage #12727

Merged
merged 9 commits into from
Jan 14, 2025

Conversation

Trisfald
Copy link
Contributor

This PR fixes the remaining issue in FlatStorageResharder that prevented the chain from progressing at resharding boundary in some corner cases.

Two possible solutions have been considered:

  • Postponing the split task until the last block of the epoch becomes final
  • Postponing the split task until one candidate resharding block becomes final

I've decided to proceed with the second option, as it was building on existing code, and the changes were limited in scope to the inner impl of FlatStorageResharder. In this way the resharder remains entirely decoupled from epoch manager.

Also in this PR:

  • Minor refactoring: task scheduling became a bit more involved in order to handle multiple resharding requests that can get intertwined.
  • New unit test to validate the implementation

Now the existing tests for forks in resharding_v3 have been finally unignored 💪

Copy link

codecov bot commented Jan 13, 2025

Codecov Report

Attention: Patch coverage is 90.86294% with 18 lines in your changes missing coverage. Please review.

Project coverage is 70.73%. Comparing base (11a64fd) to head (9bf9658).
Report is 10 commits behind head on master.

Files with missing lines Patch % Lines
chain/chain/src/flat_storage_resharder.rs 90.86% 7 Missing and 11 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12727      +/-   ##
==========================================
+ Coverage   70.64%   70.73%   +0.09%     
==========================================
  Files         848      848              
  Lines      173878   174207     +329     
  Branches   173878   174207     +329     
==========================================
+ Hits       122836   123229     +393     
+ Misses      45903    45835      -68     
- Partials     5139     5143       +4     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.35% <0.00%> (-0.01%) ⬇️
linux 69.25% <90.00%> (+0.05%) ⬆️
linux-nightly 70.31% <90.86%> (+0.06%) ⬆️
pytests 1.65% <0.00%> (-0.01%) ⬇️
sanity-checks 1.46% <0.00%> (-0.01%) ⬇️
unittests 70.56% <90.86%> (+0.09%) ⬆️
upgradability 0.20% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Trisfald Trisfald marked this pull request as ready for review January 13, 2025 17:57
@Trisfald Trisfald requested a review from a team as a code owner January 13, 2025 17:57
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -181,6 +181,22 @@ impl FlatStorageResharder {
info!(target: "resharding", ?split_params, "initiating flat storage shard split");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to the method about the fork handling?
github doesn't let me comment there but I mean split_shard a few lines above here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Comment on lines +191 to +194
if parent_shard == parent
&& params.left_child_shard == left_child_shard
&& params.right_child_shard == right_child_shard
&& params.shard_layout == *shard_layout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need those checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not strictly necessary. The checks are a safeguard to discard the old candidate resharding block if the shard layout of the new resharding is different. It can't really happen today since shard layout is fixed.

Comment on lines 362 to 363
let (resharding_blocks, parent_shard, split_params, execution_status) = if let Some(event) =
event_lock_guard.as_mut()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would the following work?

let Some(event) = event_lock_guard.as_mut() else {
  info!(target: "resharding", "flat storage shard split task cancelled: no resharding event");
  return SplitShardSchedulingStatus::Cancelled;
}
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks!

Comment on lines 365 to 368
if event.has_started() {
info!(target: "resharding", "flat storage shard split task cancelled: resharding already in progress");
return SplitShardSchedulingStatus::Cancelled;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to happen under normal circumstances, under fork or under error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if there are forks involved

for block_info in resharding_blocks {
match self.compute_scheduled_task_status_at_block(&block_info, chain_store) {
CanStart(block_info) => return CanStart(block_info),
Failed => {} // Nothing to do on failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention that this may happen during forks and isn't actually "the end of the world"?

Comment on lines 727 to 730
// TODO(resharding): duplicate this test so that in one case resharding is performed on block
// B(height=13) and in another case resharding is performed on block B'(height=13).
// In the current scenario the real resharding happens on block B'. Low priority TODO
// since it's a very rare corner case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this todo still to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this must be done still. split_shard_in_forks_when_older_branch_is_finalized is doing that but only for flat storage unit testing; it would be nice to have the test loop version with all the components.

@wacban
Copy link
Contributor

wacban commented Jan 14, 2025

Two possible solutions have been considered:
Postponing the split task until the last block of the epoch becomes final
Postponing the split task until one candidate resharding block becomes final

Is the first option worth considering as a better engineering or good first issue project for later?

@Trisfald
Copy link
Contributor Author

Two possible solutions have been considered:
Postponing the split task until the last block of the epoch becomes final
Postponing the split task until one candidate resharding block becomes final

Is the first option worth considering as a better engineering or good first issue project for later?

It's an alternative with different tradeoffs. It makes easier handling the state, since instead of block hashes we just need one epoch_id. On the other hand, it makes resharder less decoupled because of the dependency on epochs.
Complexity wise - I expect both to be on a similar level, since dealing with blocks is somewhat easier than dealing with epochs.

TL;DR Not a straight, simple improvement IMO

@Trisfald Trisfald added this pull request to the merge queue Jan 14, 2025
@@ -709,8 +709,6 @@ fn test_resharding_v3_drop_chunks_all() {
}

#[test]
// TODO(resharding): fix nearcore and un-ignore this test
#[ignore]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Merged via the queue into near:master with commit cf45cc5 Jan 14, 2025
28 checks passed
@Trisfald Trisfald deleted the resharding-flat-store-all-forks branch January 14, 2025 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants