Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage scrubber: GC ancestor shard layers #8196

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jun 28, 2024

Problem

After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space.

Summary of changes

  • Extend the physical GC command with --mode=full, which includes cleaning up unreferenced ancestor shard layers
  • Add test test_scrubber_physical_gc_ancestors
  • Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver c/storage/scrubber Component: s3_scrubber labels Jun 28, 2024
@jcsp jcsp changed the title Jcsp/issue 7043 ancestral gc storage scrubber: GC ancestor shard layers Jun 28, 2024
Copy link

github-actions bot commented Jun 28, 2024

2958 tests run: 2841 passed, 0 failed, 117 skipped (full report)


Flaky tests (1)

Postgres 15

  • test_pg_regress[None]: debug

Code coverage* (full report)

  • functions: 32.7% (6913 of 21142 functions)
  • lines: 50.0% (54203 of 108388 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
88dfeaa at 2024-07-01T11:29:26.502Z :recycle:

@jcsp jcsp force-pushed the jcsp/issue-7043-ancestral-gc branch from 824449d to 88dfeaa Compare July 1, 2024 10:40
@jcsp jcsp requested a review from problame July 1, 2024 18:35
@jcsp jcsp marked this pull request as ready for review July 1, 2024 18:35
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clean up the variable names when speaking about tenants (tenant_shard_id vs tenant_id).


I would prefer we call it parent shard, so at least there is a minimal distinction from the word ancestor which we already use for Timeline.


What if the remote storage lists some but not all shards, e.g. because we're in the middle of Tenant::split_prepare ?

Edit: ah, nice, you handled that.

Kind of an important thing to regress-test. Shouldn't be too hard to test this with a pausable failpoint, should it?

Edit 2: but, did you also consider the same condition for the stream_tenant_timelines call made by gc_ancestor? I think it can observe a tenant shard in S3 but some of its timelines' IndexParts haven't been uploaded by Tenant::split_prepare yet.

Let's make sure we understand which races can happen and why they are fine / how they are handled. And: where will we document this? module-level comment doesn't seem like the worst choice, but, hard to discover.


About AncestorRefs: why is it not a more phyiscal refcount map, like on the absolute_key ? Sure, a bit less memory efficient but much harder to screw up. Similarly, I don't think we should go through the trouble of only tracking the cross-ancestor references.

I'm feeling kind of strongly about this, I think the code could be much simpler and lower risk.

Comment on lines +237 to +238
# We will use a 1s threshold for deletion, let it pass
time.sleep(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that 1s referring to the min_age_secs=1` below?

Suggested change
# We will use a 1s threshold for deletion, let it pass
time.sleep(2)
# We will use a min_age_secs=1 second threshold for deletion, let it pass
time.sleep(2)

@@ -111,6 +112,14 @@ def test_scrubber_tenant_snapshot(neon_env_builder: NeonEnvBuilder, shard_count:
workload.validate()


def drop_local_state(env, tenant_id):
env.storage_controller.tenant_policy_update(tenant_id, {"placement": "Detached"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hinting is useful for "find callers" down the road.
(This suggestion requires NeonEnv to be added to the imports)

Suggested change
env.storage_controller.tenant_policy_update(tenant_id, {"placement": "Detached"})
def drop_local_state(env: NeonEnv, tenant_id):

Comment on lines +483 to +487
while let Some(i) = timelines.next().await {
let tl_summary = i?;

summary.indices_deleted += tl_summary.indices_deleted;
summary.remote_storage_errors += tl_summary.remote_storage_errors;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not summing up ancestor_layers_deleted.
Let's assert it to be zero at this point.

Also, I would recommend destructuring to get compile-time errors when we extend GcSummary in the future

Suggested change
while let Some(i) = timelines.next().await {
let tl_summary = i?;
summary.indices_deleted += tl_summary.indices_deleted;
summary.remote_storage_errors += tl_summary.remote_storage_errors;
while let Some(i) = timelines.next().await {
let GcSummary { indices_deleted, remote_storage_errors, ancestor_layers_deleted } = i?;
summary.indices_deleted += indices_deleted;
summary.remote_storage_errors += remote_storage_errors;
assert_eq!(anecstor_layers_deleted, , "we do ancestor layer deletions below in gc_ancestor()");

Comment on lines +109 to +131
// Check that we have a complete view of the latest shard count: this should always be the case unless we happened
// to scan the S3 bucket halfway through a shard split.
if shard_indices
.iter()
.filter(|i| i.shard_count == latest_count)
.count()
!= latest_count.count() as usize
{
// This should be extremely rare, so we warn on it.
tracing::warn!(%tenant_id, "Missed some shards at count {:?}", latest_count);
continue;
}

// Check if we have any non-latest-count shards
if shard_indices.len() == latest_count.count() as usize {
tracing::debug!(%tenant_id, "No ancestor shards to clean up");
continue;
}

// GC ancestor shards
for ancestor_shard in shard_indices
.into_iter()
.filter(|i| i.shard_count != latest_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readability: please use some itertools::partition here so the .filter() condition doesn't have to be repeated and we operate on proper names instead of shard_indices.

Comment on lines +49 to +52
// Map of cross-shard layer references, giving a refcount for each layer in each shard that is referenced by some other
// shard in the same tenant. This is sparse! The vast majority of timelines will have no cross-shard refs, and those that
// do have cross shard refs should eventually drop most of them via compaction.
type AncestorRefs = BTreeMap<TenantTimelineId, HashMap<(ShardIndex, LayerName), usize>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this to be a newtype with a well-defined interface instead of the broad BTreeMap API. Ideally in its own mod { ... } block so visibiltiy rules are enforced.
Mostly because it makes the callsites more expressive.

@@ -206,6 +444,8 @@ pub async fn pageserver_physical_gc(
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we bail here, the accumulator doesn't contain this ttid's references. Thus we don't have the global view and must not do gc_ancestor. But currently we do.

Comment on lines +323 to +335
// match ttid_refs {
// None => {
// tracing::debug!(%ttid, "no refs for this ttid, {} ttids have refs", refs.len());
// },
// Some(ttid_refs) => {

// tracing::debug!("references for this ttid:");
// for (layer_name, shard_index) in ttid_refs {
// tracing::debug!("{}: {}", layer_name, shard_index);
// }
// }

// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out code tends to rot. I'm ok with if false { ... }

@@ -92,7 +163,7 @@ async fn maybe_delete_index(
None => {
tracing::warn!("Missing last_modified");
summary.remote_storage_errors += 1;
return;
return false;
}
Some(last_modified) => {
let last_modified =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The S3 SDK implements a

impl TryFrom<DateTime> for SystemTime {

Comment on lines 189 to 192
tracing::info!(
"Skipping young object {} < {}",
age.as_secs_f64(),
min_age.as_secs_f64()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sprinkling in some humantime::Duration would be appreciated by whatever poor soul will have to read these logs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver c/storage/scrubber Component: s3_scrubber t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants