Undo changes in Postgres core for building GIST/GIN indexes #415

knizhnik · 2024-04-20T06:30:43Z

Some Postgres indexes (GIN,GIST,SPGIST...) are using two-phase build: at first phase relation pages are constructed and at second phase - all relation is wal-logged. It doesn't work with Neon because if dirty page was thrown away from shared buffer before been wal-logged, then its content will be lost.

We have added support of unlogged builds to SMGR API. But it requires changes in Postgre core. What is even worser some extensions (i.e. pgvector) are also using the same policy and have to be patched.

This PR tries to avoid changes in Postgres core and did it at Neon extension level.

Make smgr API pluggable. Add smgr_hook that can be used to define custom smgrs. Remove smgrsw[] array and smgr_sw selector. Instead, smgropen() loads f_smgr implementation using smgr_hook. Also add smgr_init_hook and smgr_shutdown_hook. And a lot of mechanical changes in smgr.c functions. This patch is proposed to community: https://commitfest.postgresql.org/33/3216/ Author: anastasia <[email protected]>

Add contrib/zenith that handles interaction with remote pagestore. To use it add 'shared_preload_library = zenith' to postgresql.conf. It adds a protocol for network communications - see libpagestore.c; and implements smgr API. Also it adds several custom GUC variables: - zenith.page_server_connstring - zenith.callmemaybe_connstring - zenith.zenith_timeline - zenith.wal_redo Authors: Stas Kelvich <[email protected]> Konstantin Knizhnik <[email protected]> Heikki Linnakangas <[email protected]>

Add WAL redo helper for zenith - alternative postgres operation mode to replay wal by pageserver request. To start postgres in wal-redo mode, run postgres with --wal-redo option It requires zenith shared library and zenith.wal_redo Author: Heikki Linnakangas <[email protected]>

Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver. Authors: Konstantin Knizhnik <[email protected]> Heikki Linnakangas <[email protected]>

In the test_createdb test, we created a new database, and created a new branch after that. I was seeing the test fail with: PANIC: could not open critical system index 2662 The WAL contained records like this: rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/0163E8F0, prev 0/0163C8A0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 1 FPW rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/01640940, prev 0/0163E8F0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 2 FPW rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642990, prev 0/01640940, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/016429C8, prev 0/01642990, desc: CHECKPOINT_ONLINE redo 0/163C8A0; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Database len (rec/tot): 42/ 42, tx: 540, lsn: 0/01642A40, prev 0/016429C8, desc: CREATE copy dir 1663/1 to 1663/16390 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642A70, prev 0/01642A40, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642AA8, prev 0/01642A70, desc: CHECKPOINT_ONLINE redo 0/1642A70; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Transaction len (rec/tot): 66/ 66, tx: 540, lsn: 0/01642B20, prev 0/01642AA8, desc: COMMIT 2021-05-21 15:55:46.363728 EEST; inval msgs: catcache 21; sync rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642B68, prev 0/01642B20, desc: CHECKPOINT_SHUTDOWN redo 0/1642B68; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown The compute node had correctly replayed all the WAL up to the last record, and opened up. But when you tried to connect to the new database, the very first requests for the critical relations, like pg_class, were made with request LSN 0/01642990. That's the last record that's applicable to a particular block. Because the database CREATE record didn't bump up the "last written LSN", the getpage requests were made with too old LSN. I fixed this by adding a SetLastWrittenLSN() call to the redo of database CREATE record. It probably wouldn't hurt to also throw in a call at the end of WAL replay, but let's see if we bump into more cases like this first. This doesn't seem to be happening with page server as of 'main'; I was testing with a version where I had temporarily reverted all the recent changes to reconstruct control file, checkpoints, relmapper files etc. from the WAL records in the page server, so that the compute node was redoing all the WAL. I'm pretty sure we need this fix even with 'main', even though this test case wasn't failing there right now.

Some operations in PostgreSQL are not WAL-logged at all (i.e. hint bits) or delay wal-logging till the end of operation (i.e. index build). So if such page is evicted, we will lose the update. To fix it, we introduce PD_WAL_LOGGED bit to track whether the page was wal-logged. If the page is evicted before it has been wal-logged, then zenith smgr creates FPI for it. Authors: Konstantin Knizhnik <[email protected]> anastasia <[email protected]>

Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors Author: Konstantin Knizhnik <[email protected]>

Ignore unlogged table qualifier. Add respective changes to regression test outputs. Author: Konstantin Knizhnik <[email protected]>

Request relation size via smgr function, not just stat(filepath).

Author: Konstantin Knizhnik <[email protected]>

…mmon error. TODO: add a comment, why this is fine for zenith.

…d of WAL page header, then return it back to the page origin

…of WAL at compute node + Check for presence of replication slot

…t inside. WAL proposer (as bgw without BGWORKER_BACKEND_DATABASE_CONNECTION) previously ignored SetLatch, so once caught up it stuck inside WalProposerPoll infinitely. Futher, WaitEventSetWait didn't have timeout, so we didn't try to reconnect if all connections are dead as well. Fix that. Also move break on latch set to the end of the loop to attempt ReconnectWalKeepers even if latch is constantly set. Per test_race_conditions (Python version now).

…kpoint from WAL + Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL + Pass prev_record_ptr through zenith.signal file to postgres

@knizhnik

This patch aims to make our bespoke WAL redo machinery more robust in the presence of untrusted (in other words, possibly malicious) inputs. Pageserver delegates complex WAL decoding duties to postgres, which means that the latter might fall victim to carefully designed malicious WAL records and start doing harmful things to the system. To prevent this, it has been decided to limit possible interactions with the outside world using the Secure Computing BPF mode. We use this mode to disable all syscalls not in the allowlist. Please refer to src/backend/postmaster/seccomp.c to learn more about the pros & cons of the current approach. + Fix some bugs in seccomp bpf wrapper * Use SCMP_ACT_TRAP instead of SCMP_ACT_KILL_PROCESS to receive signals. * Add a missing variant of select() syscall (thx to @knizhnik). * Write error messages to an fd stderr's currently pointing to.

…ause it cause memory leak in wal-redo-postgres 2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR + Call smgr_init_standard from smgr_init_zenith

this patch adds support for zenith_tenant variable. it has similar format as zenith_timeline. It is used in callmemaybe query to pass tenant to pageserver and in ServerInfo structure passed to wal acceptor

…recovery. Rust's postgres_backend currently is too dummy to handle it properly: reading happens in separate thread which just ignores CopyDone. Instead, writer thread must get aware of termination and send CommandComplete. Also reading socket must be transferred back to postgres_backend (or connection terminated completely after COPY). Let's do that after more basic safkeeper refactoring and right now cover this up to make tests pass. ref #388

…ion position in wal_proppser to segment boundary

…ugging. Now it contains only one function test_consume_xids() for xid wraparound testing.

Co-authored-by: Konstantin Knizhnik <[email protected]>

…extetnded Neon SMGR API (#299) Co-authored-by: Konstantin Knizhnik <[email protected]>

* Neon logical replication support for PG14 * Log heap rewrite file after creation. --------- Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Arseny Sher <[email protected]>

Co-authored-by: Konstantin Knizhnik <[email protected]>

* Update WAL buffers when restoring WAL at compute needed for LR * Fix copying data in WAL buffers --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

* Prevent output callbacks from hearing about neon-file messages

* On demand downloading of SLRU segments * Fix smgr_read_slru_segment * Fix bug in SimpleLruDownloadSegment * Determine SLRU kind in extension * Use ctl->PagePrecedes for SLRU page comparison in SimpleLruDownloadSegment to address wraparround --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

Fixes bug #18341. Backpatch to all supported versions. Discussion: https://www.postgresql.org/message-id/[email protected]

Co-authored-by: Konstantin Knizhnik <[email protected]>

…mary is not alive (#365) * Set wasShutdown=true during hot-standby replica startup only when primary is not alive * Report fatal error if hot standaby replica is started with oldestAcriveXid=0 Postgres part of neondatabase/neon#6705 --------- Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]>

…d for oldestActiveXid while replica startup (#389) Co-authored-by: Konstantin Knizhnik <[email protected]>

Co-authored-by: Konstantin Knizhnik <[email protected]>

This keeps the walproposer processes alive at shutdown, until after the shutdown checkpoint has been written. That gives the walproposers a chance to stream it to the safekeepers.

Co-authored-by: Konstantin Knizhnik <[email protected]>

* Revert "Add comment explaining why it is safe to use FirstNormalTransactionXid for oldestActiveXid while replica startup (#389)" This reverts commit 1eeab2d. * Revert "Set wasShutdown=true during hot-standby replica startup only when primary is not alive (#365)" This reverts commit b9336bc.

Signed-off-by: Alex Chi Z <[email protected]>

* Remember last written LSN when it is first requested * Use rnode instead of rlocator * Return updated LSN in SetLastWrittenLSN * Remove wrong new line --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

knizhnik · 2024-06-03T18:28:47Z

Re3placed by #433

lubennikovaav and others added 30 commits February 6, 2024 13:05

lastWrittenPageLSN.patch

0b90dab

Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver. Authors: Konstantin Knizhnik <[email protected]> Heikki Linnakangas <[email protected]>

[walproposer] wal_proposer.patch

8eba458

Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors Author: Konstantin Knizhnik <[email protected]>

persist_unlogged_tables.patch

94b84ef

Ignore unlogged table qualifier. Add respective changes to regression test outputs. Author: Konstantin Knizhnik <[email protected]>

fix_pg_table_size.patch

f29ac2c

Request relation size via smgr function, not just stat(filepath).

[walredo] fix_gin_redo.patch

0df3684

Author: Konstantin Knizhnik <[email protected]>

[walredo] fix_brin_redo.patch

de384f2

Author: Konstantin Knizhnik <[email protected]>

speculative_records_workaround.patch

860f839

wallog_t_ctid.patch

55dccb7

vacuumlazy_debug_stub.patch

94c7f11

[test] zenith_test_evict.patch

a5db8dc

fix_sequence_wallogging.patch

d782c1b

Bring back change that got lost in refactoring. silence ReadBuffer_co…

5718c7c

…mmon error. TODO: add a comment, why this is fine for zenith.

[contrib/zenith] [refer #225] if insert WAL position points at the en…

2fd3875

…d of WAL page header, then return it back to the page origin

[walproposer] Create replication slot for walproposer to avoid loose …

c58bf01

…of WAL at compute node + Check for presence of replication slot

[walproposer] Skip absent WAL segment removed by pg_resetwal

e7ef05c

[walproposer] Make it possible to start postgres without reading chec…

423f73b

…kpoint from WAL + Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL + Pass prev_record_ptr through zenith.signal file to postgres

[walproposer] Simplify WL_LATCH_SET testing in the walproposer

f459143

[smgr_api] [contrib/zenith] 1. Do not call mdinit from smgrinit() bec…

34b85e8

…ause it cause memory leak in wal-redo-postgres 2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR + Call smgr_init_standard from smgr_init_zenith

[walproposer] [contrib/zenith] support zenith_tenant

7e2b417

this patch adds support for zenith_tenant variable. it has similar format as zenith_timeline. It is used in callmemaybe query to pass tenant to pageserver and in ServerInfo structure passed to wal acceptor

[walproposer] [contrib/zenith] [refer #395] Do no align sart replicat…

616f45a

…ion position in wal_proppser to segment boundary

[test] Add contrib/zenith_test_utils with helpers for testing and deb…

65ce50b

…ugging. Now it contains only one function test_consume_xids() for xid wraparound testing.

[walproposer] Change condition for triggering recovery

ce222a3

knizhnik and others added 26 commits February 6, 2024 13:05

Make it possible to grant self created roles (#297)

56e27c3

Co-authored-by: Konstantin Knizhnik <[email protected]>

Define NEON_SMGR in smgr.h to make it possible for extensions to use …

394f32e

…extetnded Neon SMGR API (#299) Co-authored-by: Konstantin Knizhnik <[email protected]>

Request extension files and libraries from compute_ctl

4bbdda2

Neon logical replication support for PG14 (#309)

cadbbcc

* Neon logical replication support for PG14 * Log heap rewrite file after creation. --------- Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Arseny Sher <[email protected]>

Fix elog format error in wallog_mapping_file (#315)

1ce5469

Co-authored-by: Konstantin Knizhnik <[email protected]>

Remove excessive walsender reply logging.

ea71170

Update WAL buffers when restoring WAL at compute needed for LR (#325)

f56cd58

* Update WAL buffers when restoring WAL at compute needed for LR * Fix copying data in WAL buffers --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

Prevent output callbacks from hearing about neon-file messages (#330)

b2fc5c6

* Prevent output callbacks from hearing about neon-file messages

strncmp vs strcmp

546a6d1

Allow creating publications FOR ALL TABLES

ab9db70

Switch GetCurrentRoleId to GetUserId

0309285

Support creating subscriptions as neon_superuser

f378dcd

Fix 'mmap' DSM implementation with allocations larger than 4 GB

9dd9956

Fixes bug #18341. Backpatch to all supported versions. Discussion: https://www.postgresql.org/message-id/[email protected]

Flush logical messages with snapshots and replication origin (#381)

1710119

Co-authored-by: Konstantin Knizhnik <[email protected]>

Add comment explaining why it is safe to use FirstNormalTransactionXi…

1eeab2d

…d for oldestActiveXid while replica startup (#389) Co-authored-by: Konstantin Knizhnik <[email protected]>

Show information about local file cache in EXPLAIN ANALYZE (#386)

f49a962

Co-authored-by: Konstantin Knizhnik <[email protected]>

Treat walproposer like walsenders in postmaster.

b980d6f

This keeps the walproposer processes alive at shutdown, until after the shutdown checkpoint has been written. That gives the walproposers a chance to stream it to the safekeepers.

Fix bug introduced in 6969d90

3b09894

Remove Get/SetZenithCurrentClusterSize from Postgres core (#397)

c5d920a

Co-authored-by: Konstantin Knizhnik <[email protected]>

fix: XLogFlush replication slot drop (#396)

a7b4c66

Signed-off-by: Alex Chi Z <[email protected]>

Remember last written LSN when it is first requested (#412)

d9149dc

* Remember last written LSN when it is first requested * Use rnode instead of rlocator * Return updated LSN in SetLastWrittenLSN * Remove wrong new line --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

Undo changes in Postgres core for building GIST/GIN indexes

417dce7

Add log_newpage_range_callback

5ed0718

tristan957 force-pushed the REL_14_STABLE_neon branch 2 times, most recently from b8e5379 to 21ec61d Compare May 20, 2024 14:48

knizhnik closed this Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undo changes in Postgres core for building GIST/GIN indexes #415

Undo changes in Postgres core for building GIST/GIN indexes #415

knizhnik commented Apr 20, 2024

knizhnik commented Jun 3, 2024

Undo changes in Postgres core for building GIST/GIN indexes #415

Undo changes in Postgres core for building GIST/GIN indexes #415

Conversation

knizhnik commented Apr 20, 2024

knizhnik commented Jun 3, 2024