storage scrubber: GC ancestor shard layers #8196

jcsp · 2024-06-28T10:01:58Z

Problem

After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space.

Summary of changes

Extend the physical GC command with --mode=full, which includes cleaning up unreferenced ancestor shard layers
Add test test_scrubber_physical_gc_ancestors
Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-06-28T10:41:55Z

2958 tests run: 2841 passed, 0 failed, 117 skipped (full report)

Flaky tests (1)

Postgres 15

test_pg_regress[None]: debug

Code coverage* (full report)

functions: 32.7% (6913 of 21142 functions)
lines: 50.0% (54203 of 108388 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
88dfeaa at 2024-07-01T11:29:26.502Z :recycle:}

problame

We should clean up the variable names when speaking about tenants (tenant_shard_id vs tenant_id).

I would prefer we call it parent shard, so at least there is a minimal distinction from the word ancestor which we already use for Timeline.

What if the remote storage lists some but not all shards, e.g. because we're in the middle of Tenant::split_prepare ?

Edit: ah, nice, you handled that.

Kind of an important thing to regress-test. Shouldn't be too hard to test this with a pausable failpoint, should it?

Edit 2: but, did you also consider the same condition for the stream_tenant_timelines call made by gc_ancestor? I think it can observe a tenant shard in S3 but some of its timelines' IndexParts haven't been uploaded by Tenant::split_prepare yet.

Let's make sure we understand which races can happen and why they are fine / how they are handled. And: where will we document this? module-level comment doesn't seem like the worst choice, but, hard to discover.

About AncestorRefs: why is it not a more phyiscal refcount map, like on the absolute_key ? Sure, a bit less memory efficient but much harder to screw up. Similarly, I don't think we should go through the trouble of only tracking the cross-ancestor references.

I'm feeling kind of strongly about this, I think the code could be much simpler and lower risk.

problame · 2024-07-03T11:19:03Z

test_runner/regress/test_storage_scrubber.py

+    # We will use a 1s threshold for deletion, let it pass
+    time.sleep(2)


Is that 1s referring to the min_age_secs=1` below?

Suggested change

# We will use a 1s threshold for deletion, let it pass

time.sleep(2)

# We will use a min_age_secs=1 second threshold for deletion, let it pass

time.sleep(2)

problame · 2024-07-03T11:20:56Z

test_runner/regress/test_storage_scrubber.py

@@ -111,6 +112,14 @@ def test_scrubber_tenant_snapshot(neon_env_builder: NeonEnvBuilder, shard_count:
    workload.validate()


+def drop_local_state(env, tenant_id):
+    env.storage_controller.tenant_policy_update(tenant_id, {"placement": "Detached"})


Type hinting is useful for "find callers" down the road.
(This suggestion requires NeonEnv to be added to the imports)

Suggested change

env.storage_controller.tenant_policy_update(tenant_id, {"placement": "Detached"})

def drop_local_state(env: NeonEnv, tenant_id):

problame · 2024-07-03T11:28:40Z