[garbage_colection] Fix crash in clear_resharding_data while fetching prev_epoch_id #10195

shreyan-gupta · 2023-11-17T10:26:07Z

We had an issue in the clear_resharding_data while fetching the prev_epoch_id. The implementation of prev_epoch_id relied on fetching the block_info of the last block of prev_epoch to get the epoch_id. This unfortunately failed for the case of GC as the block_info was already garbage collected.

The new implementation here relies on using the block_header to get the epoch_id instead of block_info.

This was unfortunately only caught in mocknet and not integration tests as having a small enough epoch_length lead to the block_info being cached in the epoch_manager (even though it was GC'd)

Zulip post: https://near.zulipchat.com/#narrow/stream/295558-pagoda.2Fcore/topic/Master.20binary.20Can't.20clear.20old.20data/near/402240517

Pending: Figure out a way to include test from this PR: #10193
We would need to enable no_cache feature for the test.

… prev_epoch_id

wacban

LGTM

wacban · 2023-11-17T10:33:50Z

chain/chain/src/store.rs

+        // We are using the block header to get the epoch_id and shard_layout here as BlockHeader
+        // isn't garbage collected.
+        let block_info = epoch_manager.get_block_info(&block_hash)?;


It's a bit confusing that the comment says we're using the headers and immediately in the first line after you get the block info :) Maybe clarify for which blocks we expect to have the info and for which only the headers?

Updated comments with more detailed explanation.

posvyatokum

This should work from gc point of view even after we eventually implement partial BlockHeader garbage collection.
But I wonder, do you really want to wait at least 3 (default 5) epochs to gc resharding data? Currently clear_data is only called for blocks that fall out of gc_num_epochs_to_keep boundary. I thought you were thinking about deleting this right after resharding epoch is finished.
If so, it will require bigger refactoring of garbage collection (can be done outside of this PR).

shreyan-gupta · 2023-11-17T12:11:33Z

This should work from gc point of view even after we eventually implement partial BlockHeader garbage collection. But I wonder, do you really want to wait at least 3 (default 5) epochs to gc resharding data? Currently clear_data is only called for blocks that fall out of gc_num_epochs_to_keep boundary. I thought you were thinking about deleting this right after resharding epoch is finished. If so, it will require bigger refactoring of garbage collection (can be done outside of this PR).

I'd made this change from the perspective of it being the least disruptive and most aligned with the current GC behavior at the cost of storage for 3 to 5 epochs. If we'd like to delete the old shard data as soon as resharding is done, we can do that as well. I however see a thread on Zulip to reduce the GC window to basically when we finalize block (for BP)

[garbage_colection] Fix crash in clear_resharding_data while fetching…

3fb0c0f

… prev_epoch_id

shreyan-gupta requested a review from posvyatokum November 17, 2023 10:26

shreyan-gupta requested a review from a team as a code owner November 17, 2023 10:26

shreyan-gupta requested a review from wacban November 17, 2023 10:28

wacban approved these changes Nov 17, 2023

View reviewed changes

Shreyan Gupta added 2 commits November 17, 2023 16:35

mini fix

1bec71a

Update comments

d67b415

posvyatokum approved these changes Nov 17, 2023

View reviewed changes

shreyan-gupta added this pull request to the merge queue Nov 17, 2023

Merged via the queue into near:master with commit 3eff971 Nov 17, 2023

shreyan-gupta deleted the shreyan/resharding/fix_gc branch November 17, 2023 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[garbage_colection] Fix crash in clear_resharding_data while fetching prev_epoch_id #10195

[garbage_colection] Fix crash in clear_resharding_data while fetching prev_epoch_id #10195

shreyan-gupta commented Nov 17, 2023 •

edited

Loading

wacban left a comment

wacban Nov 17, 2023

shreyan-gupta Nov 17, 2023

posvyatokum left a comment

shreyan-gupta commented Nov 17, 2023

[garbage_colection] Fix crash in clear_resharding_data while fetching prev_epoch_id #10195

[garbage_colection] Fix crash in clear_resharding_data while fetching prev_epoch_id #10195

Conversation

shreyan-gupta commented Nov 17, 2023 • edited Loading

wacban left a comment

Choose a reason for hiding this comment

wacban Nov 17, 2023

Choose a reason for hiding this comment

shreyan-gupta Nov 17, 2023

Choose a reason for hiding this comment

posvyatokum left a comment

Choose a reason for hiding this comment

shreyan-gupta commented Nov 17, 2023

shreyan-gupta commented Nov 17, 2023 •

edited

Loading