-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage scrubber: GC ancestor shard layers #8196
base: main
Are you sure you want to change the base?
Conversation
2958 tests run: 2841 passed, 0 failed, 117 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
88dfeaa at 2024-07-01T11:29:26.502Z :recycle: |
824449d
to
88dfeaa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clean up the variable names when speaking about tenants (tenant_shard_id
vs tenant_id
).
I would prefer we call it parent shard, so at least there is a minimal distinction from the word ancestor which we already use for Timeline
.
What if the remote storage lists some but not all shards, e.g. because we're in the middle of Tenant::split_prepare
?
Edit: ah, nice, you handled that.
Kind of an important thing to regress-test. Shouldn't be too hard to test this with a pausable failpoint, should it?
Edit 2: but, did you also consider the same condition for the stream_tenant_timelines
call made by gc_ancestor
? I think it can observe a tenant shard in S3 but some of its timelines' IndexPart
s haven't been uploaded by Tenant::split_prepare
yet.
Let's make sure we understand which races can happen and why they are fine / how they are handled. And: where will we document this? module-level comment doesn't seem like the worst choice, but, hard to discover.
About AncestorRefs
: why is it not a more phyiscal refcount map, like on the absolute_key
? Sure, a bit less memory efficient but much harder to screw up. Similarly, I don't think we should go through the trouble of only tracking the cross-ancestor references.
I'm feeling kind of strongly about this, I think the code could be much simpler and lower risk.
# We will use a 1s threshold for deletion, let it pass | ||
time.sleep(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that 1s referring to the
min_age_secs=1` below?
# We will use a 1s threshold for deletion, let it pass | |
time.sleep(2) | |
# We will use a min_age_secs=1 second threshold for deletion, let it pass | |
time.sleep(2) |
@@ -111,6 +112,14 @@ def test_scrubber_tenant_snapshot(neon_env_builder: NeonEnvBuilder, shard_count: | |||
workload.validate() | |||
|
|||
|
|||
def drop_local_state(env, tenant_id): | |||
env.storage_controller.tenant_policy_update(tenant_id, {"placement": "Detached"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hinting is useful for "find callers" down the road.
(This suggestion requires NeonEnv
to be added to the imports)
env.storage_controller.tenant_policy_update(tenant_id, {"placement": "Detached"}) | |
def drop_local_state(env: NeonEnv, tenant_id): |
while let Some(i) = timelines.next().await { | ||
let tl_summary = i?; | ||
|
||
summary.indices_deleted += tl_summary.indices_deleted; | ||
summary.remote_storage_errors += tl_summary.remote_storage_errors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not summing up ancestor_layers_deleted
.
Let's assert it to be zero at this point.
Also, I would recommend destructuring to get compile-time errors when we extend GcSummary
in the future
while let Some(i) = timelines.next().await { | |
let tl_summary = i?; | |
summary.indices_deleted += tl_summary.indices_deleted; | |
summary.remote_storage_errors += tl_summary.remote_storage_errors; | |
while let Some(i) = timelines.next().await { | |
let GcSummary { indices_deleted, remote_storage_errors, ancestor_layers_deleted } = i?; | |
summary.indices_deleted += indices_deleted; | |
summary.remote_storage_errors += remote_storage_errors; | |
assert_eq!(anecstor_layers_deleted, , "we do ancestor layer deletions below in gc_ancestor()"); |
// Check that we have a complete view of the latest shard count: this should always be the case unless we happened | ||
// to scan the S3 bucket halfway through a shard split. | ||
if shard_indices | ||
.iter() | ||
.filter(|i| i.shard_count == latest_count) | ||
.count() | ||
!= latest_count.count() as usize | ||
{ | ||
// This should be extremely rare, so we warn on it. | ||
tracing::warn!(%tenant_id, "Missed some shards at count {:?}", latest_count); | ||
continue; | ||
} | ||
|
||
// Check if we have any non-latest-count shards | ||
if shard_indices.len() == latest_count.count() as usize { | ||
tracing::debug!(%tenant_id, "No ancestor shards to clean up"); | ||
continue; | ||
} | ||
|
||
// GC ancestor shards | ||
for ancestor_shard in shard_indices | ||
.into_iter() | ||
.filter(|i| i.shard_count != latest_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readability: please use some itertools::partition
here so the .filter()
condition doesn't have to be repeated and we operate on proper names instead of shard_indices
.
// Map of cross-shard layer references, giving a refcount for each layer in each shard that is referenced by some other | ||
// shard in the same tenant. This is sparse! The vast majority of timelines will have no cross-shard refs, and those that | ||
// do have cross shard refs should eventually drop most of them via compaction. | ||
type AncestorRefs = BTreeMap<TenantTimelineId, HashMap<(ShardIndex, LayerName), usize>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this to be a newtype with a well-defined interface instead of the broad BTreeMap
API. Ideally in its own mod { ... }
block so visibiltiy rules are enforced.
Mostly because it makes the callsites more expressive.
@@ -206,6 +444,8 @@ pub async fn pageserver_physical_gc( | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we bail here, the accumulator doesn't contain this ttid
's references. Thus we don't have the global view and must not do gc_ancestor. But currently we do.
// match ttid_refs { | ||
// None => { | ||
// tracing::debug!(%ttid, "no refs for this ttid, {} ttids have refs", refs.len()); | ||
// }, | ||
// Some(ttid_refs) => { | ||
|
||
// tracing::debug!("references for this ttid:"); | ||
// for (layer_name, shard_index) in ttid_refs { | ||
// tracing::debug!("{}: {}", layer_name, shard_index); | ||
// } | ||
// } | ||
|
||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented out code tends to rot. I'm ok with if false { ... }
@@ -92,7 +163,7 @@ async fn maybe_delete_index( | |||
None => { | |||
tracing::warn!("Missing last_modified"); | |||
summary.remote_storage_errors += 1; | |||
return; | |||
return false; | |||
} | |||
Some(last_modified) => { | |||
let last_modified = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The S3 SDK implements a
impl TryFrom<DateTime> for SystemTime {
tracing::info!( | ||
"Skipping young object {} < {}", | ||
age.as_secs_f64(), | ||
min_age.as_secs_f64() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sprinkling in some humantime::Duration
would be appreciated by whatever poor soul will have to read these logs?
Problem
After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space.
Summary of changes
--mode=full
, which includes cleaning up unreferenced ancestor shard layerstest_scrubber_physical_gc_ancestors
Checklist before requesting a review
Checklist before merging