Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compaction_level0_phase1: bypass PS PageCache for data blocks #8543

Merged
merged 14 commits into from
Jul 31, 2024

Conversation

problame
Copy link
Contributor

@problame problame commented Jul 29, 2024

part of #8184

Problem

We want to bypass PS PageCache for all data block reads, but compact_level0_phase1 currently uses ValueRef::load to load the WAL records from delta layers.
Internally, that maps to FileBlockReader:read_blk which hits the PageCache here.

Solution

This PR adds a mode for compact_level0_phase1 that uses the MergeIterator for reading the Values from the delta layer files.

MergeIterator is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks.

Other notable changes:

  • change the DiskBtreeReader::into_stream to buffer the node, instead of holding a PageCache PageReadGuard.
    • Without this, we run out of page cache slots in test_pageserver_compaction_smoke.
    • Generally, PageReadGuards aren't supposed to be held across await points, so, this is a general bugfix.

Testing / Validation / Performance

MergeIterator has not yet been used in production; it's being developed as part of

Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item.
If they're not identical, we log a warning / fail the unit/regression test.
To avoid flooding the logs, we apply a global rate limit of once per 10 seconds.
In any case, we use the existing approach's value.

Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod:

  • with validation:
    • increased CPU usage
    • ~doubled VirtualFile read bytes/second metric
    • no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read
  • without validation:
    • slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer)
    • less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver)
    • lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads)

Rollout

The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in

  • Rust unit tests
  • Python tests
  • Nightly pagebench (shouldn't really matter)
  • Staging

Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior:

Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240).

Future Work

The streaming k-merge's memory usage is proportional to the amount of memory per participating layer.

But compact_level0_phase1 still loads all keys into memory for all_keys_iter.
Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction.

Future work should replace all_keys_iter with a streaming keys iterator.
This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.

Copy link

github-actions bot commented Jul 29, 2024

3150 tests run: 3029 passed, 0 failed, 121 skipped (full report)


Code coverage* (full report)

  • functions: 32.5% (7029 of 21632 functions)
  • lines: 49.9% (55933 of 112088 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
296a694 at 2024-07-31T09:54:42.125Z :recycle:

@problame
Copy link
Contributor Author

problame commented Jul 29, 2024

So, interesting test failures:

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8543/10147986168/index.html

test_pageserver_compaction_smoke[release-pg16]

fixtures.pageserver.http.PageserverApiException: IoError: Failed to read immutable buf: timeout: there were page guards alive for all page cache slots

Looking at the test code:

# Effectively disable the page cache to rely only on image layers
# to shorten reads.
neon_env_builder.pageserver_config_override = """
page_cache_size=10
"""

I think we run out of PageCache slots because each delta layer's iterator holds one BlockLease::PageReadGuard open here:

let node_buf = block_cursor
.read_blk(self.start_blk + node_blknum, ctx)
.await?;

This is a trade-off, but it's probably the right trade-off at this time.
See comments added for details.
@problame problame changed the title [DO NOT REVIEW] bypass PS PageCache for compaction_level0_phase1 compaction_level0_phase1: bypass PS PageCache for data blocks Jul 29, 2024
@problame problame requested a review from skyzh July 29, 2024 18:29
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the k-merge iterator, the same-key handling looks correct under the original logic. Also to do runtime sanity checks, we can assert if the key-lsns from all_values_iter is in exactly the same order as what we had before (all_keys).

pageserver/src/tenant/timeline/compaction.rs Show resolved Hide resolved
@problame problame marked this pull request as ready for review July 30, 2024 17:53
@problame problame requested a review from a team as a code owner July 30, 2024 17:53
@problame problame requested review from VladLazar and skyzh July 30, 2024 17:53
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing plan LGTM to me

pageserver/src/tenant/timeline/compaction.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/timeline/compaction.rs Show resolved Hide resolved
@problame problame enabled auto-merge (squash) July 31, 2024 09:34
@problame problame disabled auto-merge July 31, 2024 09:55
@problame
Copy link
Contributor Author

@problame problame merged commit 4825b0f into main Jul 31, 2024
65 checks passed
@problame problame deleted the problame/byapss-pagecache-for-compactlevel0phase1 branch July 31, 2024 12:18
@problame
Copy link
Contributor Author

Post-merge testing / validation update:

  1. CI pipeline duration: from ~30min to
  2. Validation overhead in staging

CI pipeline duration

No impact beyond noise on regress tests duration:

Before: https://github.com/neondatabase/neon/actions/runs/10168623495/job/28124265193

After: https://github.com/neondatabase/neon/actions/runs/10180033285/job/28157886382

Staging basic testing

Before

Setup: pump a bunch of data into a fresh staging project with 2 vCPUs. Pumping a lot of data creates compaction work for the pageserver. Watch the Pageserver's Page Cache Dashboard before and after deploying this PR.

To create a bunch of compaction work, I ran

admin@neon-us-east-2-pgbench:~$ pgbench -i -s 3424 -I dtG 'postgresql://neondb_owner:<MASKED>@ep-flat-bonus-w2d18yi4.us-east-2.aws.neon.build/neondb?sslmode=require'

It ran from 11:58 UTC to +850s = 12:12UTC.

However, above pgbench ingests data faster than compaction, and we don't have compaction backpressure, so we run into

 skipping image layer generation due to L0 compaction did not include all layers.

until 2024-07-31 13:42:32.909

The grafana logs from that compaction run are here.
The last iteration that then actually created image layers took 1379 seconds 🤯.
There are more image layer creations after that.
Anyways, not the problem of this PR.

PageCache dashboard during this period:

image

After

The PR merged as tag 5753 at 12:17 and deployed to staging via this pipeline which finished deploying at at 13:57.

I deleted the above project to stop the image layer creations.

Then I created a new staging project and repeated above pgbench against it.

postgresql://neondb_owner:<MASKED>@ep-lucky-hill-w2y61c8c.us-east-2.aws.neon.build/neondb?sslmode=require

Started at ~14:02 ended after 825s at 14:15.

Staging has validation mode enabled, so, we'd expect the PageCache footprint to still show up. We laid out expectations in the PR description.

Same pageserver (pageserver-23)

The compactions do take a bit longer, about 3.5min before vs about 4.5 min after.

Cant' identify expected changes in

  • higher CPU usage
  • VirtualFile read bytes

That could be due to to the fact there was concurrently running load on that pageserver-23 during the "before" run, though:

image

arpad-m pushed a commit that referenced this pull request Aug 5, 2024
part of #8184

# Problem

We want to bypass PS PageCache for all data block reads, but
`compact_level0_phase1` currently uses `ValueRef::load` to load the WAL
records from delta layers.
Internally, that maps to `FileBlockReader:read_blk` which hits the
PageCache
[here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236).

# Solution

This PR adds a mode for `compact_level0_phase1` that uses the
`MergeIterator` for reading the `Value`s from the delta layer files.

`MergeIterator` is a streaming k-merge that uses vectored blob_io under
the hood, which bypasses the PS PageCache for data blocks.

Other notable changes:
* change the `DiskBtreeReader::into_stream` to buffer the node, instead
of holding a `PageCache` `PageReadGuard`.
* Without this, we run out of page cache slots in
`test_pageserver_compaction_smoke`.
* Generally, `PageReadGuard`s aren't supposed to be held across await
points, so, this is a general bugfix.

# Testing / Validation / Performance

`MergeIterator` has not yet been used in production; it's being
developed as part of
* #8002

Therefore, this PR adds a validation mode that compares the existing
approach's value iterator with the new approach's stream output, item by
item.
If they're not identical, we log a warning / fail the unit/regression
test.
To avoid flooding the logs, we apply a global rate limit of once per 10
seconds.
In any case, we use the existing approach's value.

Expected performance impact that will be monitored in staging / nightly
benchmarks / eventually pre-prod:
* with validation:
  * increased CPU usage
  * ~doubled VirtualFile read bytes/second metric
* no change in disk IO usage because the kernel page cache will likely
have the pages buffered on the second read
* without validation:
* slightly higher DRAM usage because each iterator participating in the
k-merge has a dedicated buffer (as opposed to before, where compactions
would rely on the PS PageCaceh as a shared evicting buffer)
* less disk IO if previously there were repeat PageCache misses (likely
case on a busy production Pageserver)
* lower CPU usage: PageCache out of the picture, fewer syscalls are made
(vectored blob io batches reads)

# Rollout

The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in
- Rust unit tests
- Python tests
- Nightly pagebench (shouldn't really matter)
- Staging

Before the next release, I'll merge the following aws.git PR that
configures prod to continue using the existing behavior:

* neondatabase/infra#1663

# Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because
Direct IO would double the IOPS & latency for each compaction read
(#8240).

# Future Work

The streaming k-merge's memory usage is proportional to the amount of
memory per participating layer.

But `compact_level0_phase1` still loads all keys into memory for
`all_keys_iter`.
Thus, it continues to have active memory usage proportional to the
number of keys involved in the compaction.

Future work should replace `all_keys_iter` with a streaming keys
iterator.
This PR has a draft in its first commit, which I later reverted because
it's not necessary to achieve the goal of this PR / issue #8184.
skyzh added a commit that referenced this pull request Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants