RFC: direct IO for Pageserver #8240

problame · 2024-07-02T17:47:04Z

Rendered

refs #8130

github-actions · 2024-07-02T18:32:37Z

3042 tests run: 2925 passed, 2 failed, 115 skipped (full report)

Failures on Postgres 15

test_pageserver_metrics_removed_after_detach: debug

Failures on Postgres 14

test_pg_regress[4]: debug

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_pg_regress[debug-pg14-4] or test_pageserver_metrics_removed_after_detach[debug-pg15]"

Flaky tests (2)

Postgres 14

test_pg_regress[4]: debug
test_tenant_creation_fails: debug

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
02ce39f at 2024-07-09T09:45:25.659Z :recycle:}

jcsp · 2024-07-08T12:00:29Z

docs/rfcs/034-direct-io-for-pageserver.md

+2. all indirect blocks (=disk btree blocks) are cached in the PS `PageCache`.
+The norm will be very low baseline replacement rates in PS `PageCache`.
+High baseline replacement rates will be treated as a signal of resource exhaustion (page cache insufficient to host working set of the PS).
+It will be remediated by the storage controller, migrating tenants away to relieve pressure.


Probably worth caveating this with a note that this RFC/project doesn't cover such migration.

jcsp · 2024-07-08T12:04:28Z

docs/rfcs/034-direct-io-for-pageserver.md

+The bulk of the design & coding work is to ensure adherence to the alignment requirements.
+
+Our automed benchmarks are insufficient to rule out performance regressions.
+Manual benchmarking / new automated benchmarks will be required for the last two items (new PS PageCache size, avoiding regressions).


Let's specify what these benchmarks should be, to avoid open-ended benchmarking work

02ce39f

I'll leave this open because I can imagine this won't be sufficient for your taste, but ENOTIME at this point.

docs/rfcs/034-direct-io-for-pageserver.md

yliang412 · 2024-07-09T13:45:03Z

docs/rfcs/034-direct-io-for-pageserver.md

+
+The **buffer pool** mentioned to above will be a load-bearing component.
+Its basic function is to provide callers with a memory buffer of adequate alignment and size (statx `Dio_mem_align` / `Dio_offset_align`).
+Callers `get()` a buffer from the pool. Size is specified at `get` time and is fixed (not growable).


My understanding is that the buffers are non-growable, but each buffer can have different size? I was thinking the buffer pool could give out buffers of the same size. Would this be easier to solve the alignment problem?

I might be missing some details, but are the buffer pool memory buffers allocated using the regular memory allocator or does it require some special mechanism?

yliang412 · 2024-07-09T14:01:50Z

docs/rfcs/034-direct-io-for-pageserver.md

+For example, a 10 byte sized read or write to offset 5000 in a file will load the file contents
+at offset `[4096,8192)` into a free page in the kernel page cache. If necessary, it will evict
+other pages to make room (cf eviction). Then, the kernel performs a memory-to-memory copy of 10 bytes
+from/to the offset `4` (`5000 = 4096 + 4`) within the cached page. If it's a write, the kernel keeps


Suggested change

from/to the offset `4` (`5000 = 4096 + 4`) within the cached page. If it's a write, the kernel keeps

from/to the offset `904` (`5000 = 4096 + 904`) within the cached page. If it's a write, the kernel keeps

koivunej · 2024-07-10T15:49:13Z

docs/rfcs/034-direct-io-for-pageserver.md

+each buffer with all thread-local executors. However, above API requirements for the buffer pool implicitly require the buffer
+handle that's returned by `get()` to be a custom smart pointer type. We will be able to extend it in the future to include the
+io_uring registered buffer index without having to touch the entire code base.
+


During the call I was thinking that even with current day apis alloc::alloc::alloc or GlobalAlloc::alloc we are able to construct at runtime alloc::alloc::Layout for properly aligned and sized buffers. With the likely move towards jemalloc, we might be able to get away with buffer pool == tokio::sync::Semaphore + jemalloc (initially?).

For this idea, jemalloc is probably not a requirement. Pretty sure all allocators have thread local pools.

Together with the work stealing executor it might be there this has outweighed risk of one thread ending up doing all of the allocations, growing it's thread local pool, but then again, could be that on average it actually works out.

problame

Clarifications / discussions:

Virtual File will not be used for buffered IO anymore
- => tenant conf and similar should just use tokio::fs
- Arpad: concerns about not using VirtualFile fd cache
RFC is unclear about whether buffer pool buffers are all the same size or not
- Christian thinks we can get away with one single buffer / io size (8K)
- No Plan B for if we cannot (except using jemalloc, see "option C" in "Discussion" section below)

Discussion:

Vlad: prefetch: do we currently rely on the kernel prefetch for perf?
- compaction, esp the upcomping k-merge compaction, most likely does
- Will it still be there with direct IO?
fixed buffer pool size? What if we run out of memory?
- option A: "wait for free" mechanism has synchronization overhead and can deadlock
- option B: "fail with error, ask client to retry with back-off"
  - this would probably be the correct approach for responding to overload
  - page_service protocol doesn't provide this option, though
- option C: use jemalloc as the buffer pool impl, do not limit buffer pool size
  - use jemalloc for both regular allocations and buffer pool
  - PS DRAM consumption would then be: fixed PageCache + jemalloc
  - => this seems attractive because it punts on drawbacks of (A) and (B) and de-risks OOMs
Vlad: plans for new metrics such as queue depth?

problame · 2024-07-31T08:52:47Z

cc #8543

Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240).

part of #8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * #8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * neondatabase/infra#1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.

problame added 6 commits July 8, 2024 11:51

placeholder

5ce932a

summary, terminology, history

82c30ac

motivation

b3c95a5

glossary: mention alignment requirements

e831072

wrap up

f92138d

rename file

b7b9be1

problame force-pushed the problame/direct-io-rfc branch from c05178d to b7b9be1 Compare July 8, 2024 11:52

jcsp reviewed Jul 8, 2024

View reviewed changes

problame added 4 commits July 9, 2024 08:10

implement efficient buffer in phase 1

730a66a

io_uring registered buffers

36f9be1

caveat

255f822

benchmarking: refer to Definition of Done for metrics + bit more detail

02ce39f

problame changed the title ~~RFC: direct IO for reads~~ RFC: direct IO for Pageserver Jul 9, 2024

yliang412 reviewed Jul 9, 2024

View reviewed changes

koivunej reviewed Jul 10, 2024

View reviewed changes

problame commented Jul 10, 2024

View reviewed changes

problame mentioned this pull request Jul 30, 2024

compaction_level0_phase1: bypass PS PageCache for data blocks #8543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: direct IO for Pageserver #8240

RFC: direct IO for Pageserver #8240

problame commented Jul 2, 2024 •

edited

Loading

github-actions bot commented Jul 2, 2024 •

edited

Loading

Postgres 14

jcsp Jul 8, 2024

problame Jul 9, 2024

jcsp Jul 8, 2024

problame Jul 9, 2024

yliang412 Jul 9, 2024

yliang412 Jul 9, 2024

yliang412 Jul 9, 2024

koivunej Jul 10, 2024

koivunej Jul 10, 2024

problame left a comment

problame commented Jul 31, 2024

Interactions With Other Features

	from/to the offset `4` (`5000 = 4096 + 4`) within the cached page. If it's a write, the kernel keeps
	from/to the offset `904` (`5000 = 4096 + 904`) within the cached page. If it's a write, the kernel keeps

RFC: direct IO for Pageserver #8240

Are you sure you want to change the base?

RFC: direct IO for Pageserver #8240

Conversation

problame commented Jul 2, 2024 • edited Loading

github-actions bot commented Jul 2, 2024 • edited Loading

3042 tests run: 2925 passed, 2 failed, 115 skipped (full report)

Failures on Postgres 15

Failures on Postgres 14

Postgres 14

Test coverage report is not available

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

problame left a comment

Choose a reason for hiding this comment

problame commented Jul 31, 2024

Interactions With Other Features

problame commented Jul 2, 2024 •

edited

Loading

github-actions bot commented Jul 2, 2024 •

edited

Loading