Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[garbage_colection] Fix crash in clear_resharding_data while fetching prev_epoch_id #10195

Merged
merged 3 commits into from
Nov 17, 2023

Conversation

shreyan-gupta
Copy link
Contributor

@shreyan-gupta shreyan-gupta commented Nov 17, 2023

We had an issue in the clear_resharding_data while fetching the prev_epoch_id. The implementation of prev_epoch_id relied on fetching the block_info of the last block of prev_epoch to get the epoch_id. This unfortunately failed for the case of GC as the block_info was already garbage collected.

The new implementation here relies on using the block_header to get the epoch_id instead of block_info.

This was unfortunately only caught in mocknet and not integration tests as having a small enough epoch_length lead to the block_info being cached in the epoch_manager (even though it was GC'd)

Zulip post: https://near.zulipchat.com/#narrow/stream/295558-pagoda.2Fcore/topic/Master.20binary.20Can't.20clear.20old.20data/near/402240517

Pending: Figure out a way to include test from this PR: #10193
We would need to enable no_cache feature for the test.

@shreyan-gupta shreyan-gupta requested a review from a team as a code owner November 17, 2023 10:26
@shreyan-gupta shreyan-gupta requested a review from wacban November 17, 2023 10:28
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 2347 to 2349
// We are using the block header to get the epoch_id and shard_layout here as BlockHeader
// isn't garbage collected.
let block_info = epoch_manager.get_block_info(&block_hash)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing that the comment says we're using the headers and immediately in the first line after you get the block info :) Maybe clarify for which blocks we expect to have the info and for which only the headers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comments with more detailed explanation.

Copy link
Member

@posvyatokum posvyatokum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work from gc point of view even after we eventually implement partial BlockHeader garbage collection.
But I wonder, do you really want to wait at least 3 (default 5) epochs to gc resharding data? Currently clear_data is only called for blocks that fall out of gc_num_epochs_to_keep boundary. I thought you were thinking about deleting this right after resharding epoch is finished.
If so, it will require bigger refactoring of garbage collection (can be done outside of this PR).

@shreyan-gupta
Copy link
Contributor Author

This should work from gc point of view even after we eventually implement partial BlockHeader garbage collection. But I wonder, do you really want to wait at least 3 (default 5) epochs to gc resharding data? Currently clear_data is only called for blocks that fall out of gc_num_epochs_to_keep boundary. I thought you were thinking about deleting this right after resharding epoch is finished. If so, it will require bigger refactoring of garbage collection (can be done outside of this PR).

I'd made this change from the perspective of it being the least disruptive and most aligned with the current GC behavior at the cost of storage for 3 to 5 epochs. If we'd like to delete the old shard data as soon as resharding is done, we can do that as well. I however see a thread on Zulip to reduce the GC window to basically when we finalize block (for BP)

@shreyan-gupta shreyan-gupta added this pull request to the merge queue Nov 17, 2023
Merged via the queue into near:master with commit 3eff971 Nov 17, 2023
@shreyan-gupta shreyan-gupta deleted the shreyan/resharding/fix_gc branch November 17, 2023 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants