-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[garbage_colection] Fix crash in clear_resharding_data while fetching prev_epoch_id #10195
[garbage_colection] Fix crash in clear_resharding_data while fetching prev_epoch_id #10195
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
chain/chain/src/store.rs
Outdated
// We are using the block header to get the epoch_id and shard_layout here as BlockHeader | ||
// isn't garbage collected. | ||
let block_info = epoch_manager.get_block_info(&block_hash)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confusing that the comment says we're using the headers and immediately in the first line after you get the block info :) Maybe clarify for which blocks we expect to have the info and for which only the headers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated comments with more detailed explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work from gc point of view even after we eventually implement partial BlockHeader garbage collection.
But I wonder, do you really want to wait at least 3 (default 5) epochs to gc resharding data? Currently clear_data
is only called for blocks that fall out of gc_num_epochs_to_keep
boundary. I thought you were thinking about deleting this right after resharding epoch is finished.
If so, it will require bigger refactoring of garbage collection (can be done outside of this PR).
I'd made this change from the perspective of it being the least disruptive and most aligned with the current GC behavior at the cost of storage for 3 to 5 epochs. If we'd like to delete the old shard data as soon as resharding is done, we can do that as well. I however see a thread on Zulip to reduce the GC window to basically when we finalize block (for BP) |
We had an issue in the
clear_resharding_data
while fetching theprev_epoch_id
. The implementation ofprev_epoch_id
relied on fetching the block_info of the last block of prev_epoch to get the epoch_id. This unfortunately failed for the case of GC as the block_info was already garbage collected.The new implementation here relies on using the block_header to get the epoch_id instead of block_info.
This was unfortunately only caught in mocknet and not integration tests as having a small enough epoch_length lead to the block_info being cached in the epoch_manager (even though it was GC'd)
Zulip post: https://near.zulipchat.com/#narrow/stream/295558-pagoda.2Fcore/topic/Master.20binary.20Can't.20clear.20old.20data/near/402240517
Pending: Figure out a way to include test from this PR: #10193
We would need to enable no_cache feature for the test.