fix unnecessary header overwrites #253

Make smgr API pluggable. Add smgr_hook that can be used to define custom smgrs. Remove smgrsw[] array and smgr_sw selector. Instead, smgropen() loads f_smgr implementation using smgr_hook. Also add smgr_init_hook and smgr_shutdown_hook. And a lot of mechanical changes in smgr.c functions. This patch is proposed to community: https://commitfest.postgresql.org/33/3216/ Author: anastasia <[email protected]>

Add contrib/zenith that handles interaction with remote pagestore. To use it add 'shared_preload_library = zenith' to postgresql.conf. It adds a protocol for network communications - see libpagestore.c; and implements smgr API. Also it adds several custom GUC variables: - zenith.page_server_connstring - zenith.callmemaybe_connstring - zenith.zenith_timeline - zenith.wal_redo Authors: Stas Kelvich <[email protected]> Konstantin Knizhnik <[email protected]> Heikki Linnakangas <[email protected]>

Add WAL redo helper for zenith - alternative postgres operation mode to replay wal by pageserver request. To start postgres in wal-redo mode, run postgres with --wal-redo option It requires zenith shared library and zenith.wal_redo Author: Heikki Linnakangas <[email protected]>

Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver. Authors: Konstantin Knizhnik <[email protected]> Heikki Linnakangas <[email protected]>

In the test_createdb test, we created a new database, and created a new branch after that. I was seeing the test fail with: PANIC: could not open critical system index 2662 The WAL contained records like this: rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/0163E8F0, prev 0/0163C8A0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 1 FPW rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/01640940, prev 0/0163E8F0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 2 FPW rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642990, prev 0/01640940, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/016429C8, prev 0/01642990, desc: CHECKPOINT_ONLINE redo 0/163C8A0; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Database len (rec/tot): 42/ 42, tx: 540, lsn: 0/01642A40, prev 0/016429C8, desc: CREATE copy dir 1663/1 to 1663/16390 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642A70, prev 0/01642A40, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642AA8, prev 0/01642A70, desc: CHECKPOINT_ONLINE redo 0/1642A70; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Transaction len (rec/tot): 66/ 66, tx: 540, lsn: 0/01642B20, prev 0/01642AA8, desc: COMMIT 2021-05-21 15:55:46.363728 EEST; inval msgs: catcache 21; sync rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642B68, prev 0/01642B20, desc: CHECKPOINT_SHUTDOWN redo 0/1642B68; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown The compute node had correctly replayed all the WAL up to the last record, and opened up. But when you tried to connect to the new database, the very first requests for the critical relations, like pg_class, were made with request LSN 0/01642990. That's the last record that's applicable to a particular block. Because the database CREATE record didn't bump up the "last written LSN", the getpage requests were made with too old LSN. I fixed this by adding a SetLastWrittenLSN() call to the redo of database CREATE record. It probably wouldn't hurt to also throw in a call at the end of WAL replay, but let's see if we bump into more cases like this first. This doesn't seem to be happening with page server as of 'main'; I was testing with a version where I had temporarily reverted all the recent changes to reconstruct control file, checkpoints, relmapper files etc. from the WAL records in the page server, so that the compute node was redoing all the WAL. I'm pretty sure we need this fix even with 'main', even though this test case wasn't failing there right now.

Some operations in PostgreSQL are not WAL-logged at all (i.e. hint bits) or delay wal-logging till the end of operation (i.e. index build). So if such page is evicted, we will lose the update. To fix it, we introduce PD_WAL_LOGGED bit to track whether the page was wal-logged. If the page is evicted before it has been wal-logged, then zenith smgr creates FPI for it. Authors: Konstantin Knizhnik <[email protected]> anastasia <[email protected]>

Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors Author: Konstantin Knizhnik <[email protected]>

Ignore unlogged table qualifier. Add respective changes to regression test outputs. Author: Konstantin Knizhnik <[email protected]>

Request relation size via smgr function, not just stat(filepath).

Author: Konstantin Knizhnik <[email protected]>

…mmon error. TODO: add a comment, why this is fine for zenith.

…d of WAL page header, then return it back to the page origin

…of WAL at compute node + Check for presence of replication slot

…t inside. WAL proposer (as bgw without BGWORKER_BACKEND_DATABASE_CONNECTION) previously ignored SetLatch, so once caught up it stuck inside WalProposerPoll infinitely. Futher, WaitEventSetWait didn't have timeout, so we didn't try to reconnect if all connections are dead as well. Fix that. Also move break on latch set to the end of the loop to attempt ReconnectWalKeepers even if latch is constantly set. Per test_race_conditions (Python version now).

…kpoint from WAL + Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL + Pass prev_record_ptr through zenith.signal file to postgres

@knizhnik

This patch aims to make our bespoke WAL redo machinery more robust in the presence of untrusted (in other words, possibly malicious) inputs. Pageserver delegates complex WAL decoding duties to postgres, which means that the latter might fall victim to carefully designed malicious WAL records and start doing harmful things to the system. To prevent this, it has been decided to limit possible interactions with the outside world using the Secure Computing BPF mode. We use this mode to disable all syscalls not in the allowlist. Please refer to src/backend/postmaster/seccomp.c to learn more about the pros & cons of the current approach. + Fix some bugs in seccomp bpf wrapper * Use SCMP_ACT_TRAP instead of SCMP_ACT_KILL_PROCESS to receive signals. * Add a missing variant of select() syscall (thx to @knizhnik). * Write error messages to an fd stderr's currently pointing to.

…ause it cause memory leak in wal-redo-postgres 2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR + Call smgr_init_standard from smgr_init_zenith

this patch adds support for zenith_tenant variable. it has similar format as zenith_timeline. It is used in callmemaybe query to pass tenant to pageserver and in ServerInfo structure passed to wal acceptor

…recovery. Rust's postgres_backend currently is too dummy to handle it properly: reading happens in separate thread which just ignores CopyDone. Instead, writer thread must get aware of termination and send CommandComplete. Also reading socket must be transferred back to postgres_backend (or connection terminated completely after COPY). Let's do that after more basic safkeeper refactoring and right now cover this up to make tests pass. ref #388

…ion position in wal_proppser to segment boundary

…ugging. Now it contains only one function test_consume_xids() for xid wraparound testing.

…ble in connections to pageserver. Token is passed as cleartext password.

…o CRC errors. zenith_wallog_page() would call log_newpage() on a buffer, while holding merely a shared lock on the page. That's not cool, because another backend could modify the page concurrently. We allow changing hint bits while holding only a shared lock, and changes on FSM pages, at least. See comments in XLogSaveBufferForHint() for discussion of this problem. One instance of the race condition that I was able to capture on my laptop happened like this: 1. Backend A: needs to evict an FSM page from the buffer cache to make room for a new page, and calls zenith_wallog_page() on it. That is done while holding a share lock on the page. 2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record including the FSM page 3. Backend B: Updates the same FSM page while holding only a share lock 4. Backend A: Allocates space in the WAL buffers, and copies the WAL record header and the page to the buffers. At this point, the CRC that backend A computed earlier doesn't match the contents that were written out to the WAL buffers. The update of the FSM page in backend B happened from there (fsmpage.c): /* * Update the next-target pointer. Note that we do this even if we're only * holding a shared lock, on the grounds that it's better to use a shared * lock and get a garbled next pointer every now and then, than take the * concurrency hit of an exclusive lock. * * Wrap-around is handled at the beginning of this function. */ fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0); To fix, make a temporary copy of the page in zenith_wallog_page(), and WAL-log that. Just like XLogSaveBufferForHint() does. Fixes neondatabase/neon#413

The majority of work here is going to be heavily cleaned up soon, but it's worth giving a brief overview of the changes either way. * Adds libpqwalproposer, serving a similar function to the existing libpqwalreceiver -- to provide access to libpq functions without causing problems from directly linking them. * Adds two new state components, giving (a) the type of libpq-specific polling required to move on to the next protocol state and (b) the kind of socket events it's waiting on. (These are expected to be removed or heavily reworked soon.) * Changes `WalProposerPoll` to make use of a slightly more specialized `AdvancePollState`, which has been completely reworked.

Add alternative output for tablespace test, because tablespaces are not supported in zenith yet

On the walproposer side, - Change the voting flow so that acceptor tells his epoch along with giving the vote, not before it; otherwise it might get immediately stale. #294 - Adjust to using separate structs for disk and network. ref #315

epochStartLsn is the LSN since which new proposer writes its WAL in its epoch, let's be more explicit here. In several places it also actually meant something we call *commit_lsn* -- the latest lsn known to be reliably commited (it constantly moves within one wal proposer). truncate_lsn is LSN still needed by the most lagging safekeeper. restart_lsn is terminology from pg_replicaton_slots, but here we don't really have 'restart'; hopefully truncate word makes it clearer.

…nce. Cache relfilenode size returned by zenith_nblocks() and also update it when relation is extended. Don't update it from zenith_write() or zenith_wallog_page(), since there is no guarantee that these functions wouldn't be called for some page that is not the last one It can be configured with zenith.relsize_hash_size GUC parameter. Set it to 0 to disable caching.

Closes #66. Mostly corresponds to cleaning up the states we store. Goes back to single states for each WalKeeper, and we perform blocking writes for everything but sending the WAL itself. A few things have been factored out into libpqwalproposer for simplicity - like handling the nonblocking status of the connection (even though it's only changed once).

@knizhnik

Now pageserver tracks only last_record_lsn and ignores last_valids_lsn. We can cause deadlock at start or extreme slowness during the normal work if we call get_page with LSN of incomplete record. Patch by @knizhnik

…irst WAL record of started compute node

…epers (#439). It is intended to solve the following problems: a) Chicken-or-the-egg one: compute postgres needs data directory with non-rel files that are downloaded from pageserver by calling basebackup@LSN. This LSN is not arbitrary, it must include all previously committed transactions and defined through consensus voting, which happens... in walproposer, a part of compute node. b) Just warranting such LSN is not enough, we must also actually commit it and make sure there is a safekeeper who knows this LSN is committed so WAL before it can be streamed to pageserver -- otherwise basebackup will hang waiting for WAL. Advancing commit_lsn without playing consensus game is impossible, so speculative 'let's just poll safekeepers, learn start LSN of future epoch and run basebackup' won't work. Currently --sync-safekeepers is considered completed when 1) at least majority of safekeepers and 2) *all* safekeepers with live connection to walproposer switch to new epoch and advance commit_lsn allowing basebackup to proceed. 2) limits availablity, but that's because currently we don't have a mechanism defining which safekeeper should stream WAL into pageserver.

And take initial value from freshly created slot position. Thus proposer always starts streaming from the record beginning; it simplifies WAL decoding on safekeeper.

Send *all* entries (from the beginning, i.e. truncateLsn) to everyone but donor who doesn't need recovery at all and will receive only new entries. This can be optimized to avoid sending data which is already persisted (and correct), but previous such optimization was incorrect.

I forgot to do that in 42316a8. Fixes segfault related to attempt to send the (garbage collected) message second time and queue advancement when donor doesn't restart.

Safekeepers who are in the same epoch as donor definitely have correct WAL, so we can send to them since their flushLsn. This required some additionall fuss due to convention of always starting streaming at the record boundary.

contrib/zenith/libpagestore.c: In function ‘zenith_connect’: contrib/zenith/libpagestore.c:125:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] 125 | const char **keywords = malloc((noptions + 1) * sizeof(*keywords)); | ^~~~~ src/backend/tcop/zenith_wal_redo.c:294:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] 294 | bool enable_seccomp = true; | ^~~~ In the passing, also move the 'n_synced' local variable closer to where it's used.

These could be used to fetch SLRUs and other non-relation things from the page server. But we don't do that, and have no plans in the near future.

- Remove unused 'system_id' field from ZenithRequest. - Remove unused 'loaded' variable. - Remove unused to pack pageserver->client messages, and to unpack client->pageserver messages. - Fix printing the response in debug message (was printing the request twice) - Avoid the overhead of converting request/response to string, unless the debug message is really going to be printed - Formatting fixes.

- Use different message formats for different kinds of response messages. - Add an Error response message, for passing errors from page server to Postgres. An Error response now results in an ereport(ERROR - Add a flag to requests, to indicate that we actually want the latest page version on the timeline, and the LSN is just a hint that we know that there haven't been any modifications since that LSN. It is currently always set to 'true', but once we start supporting read-only replicas, they would set it to false. This changes the network postgres<->page server protocol, so this needs corresponding changes in the page server side Also refactor and fix the zm_to_string() function. The ZenithMessageStr array was broken, because the array indices didn't match the ZenithMessageTag enum values.

…oposer.c

PQgetCopyData can sometimes indicate that the copy is done if the backend returns an error response. So while we still expect that the walkeeper never sends CopyDone, we can't expect it to never produce errors.

Whatever the bug mentioned in the FIXME comment was with buffered I/O, it has been fixed now. This greatly reduces the amount of CPU time spent in WAL redo.

The fread() call required allowing the 'fstat' syscall in the seccomp configuration, and apparently on some platforms also 'newfstatat', as Max reported this error: Sep 28 15:56:55.522 ERRO wal-redo-postgres: --------------------------------------- Sep 28 15:56:55.522 ERRO wal-redo-postgres: seccomp: bad syscall 262 Sep 28 15:56:55.522 ERRO wal-redo-postgres: --------------------------------------- I'm afraid of allowing 'newfstatat', that seems like it's opening too much attack surface, since it allows access to files by filename. Maybe it's OK, but I'm not sure, but there isn't any fundamental reason why we'd need to call it, I'm not sure why glibc's fread() wants to call it. So let's avoid the trouble by writing our own simple buffer over plain read().

The smgr implementation needs to distinguish between unlogged/temp and regular 'permanent' relations, but the smgr API doesn't currently include that information. Add a 'relpersistence' field to SmgrRelationData, and as an argument to smgropen(). However, not all callers of smgropen() have a relcache entry at hand, so we allow some operations to pass 0, meaning 'unknown'. Now that we can store unlogged tables locally, use the same machinery to handle the buffered GiST and SP-GiST index builds. They populate the index by inserting all the tuples, and use the shared buffer cache while they do that. They don't WAL-log the pages while they do that, they log the whole relation as a separate bulk operation after the build has finished. That poses a problem for Zenith, where smgrwrite() is a no-op and we rely on WAL-logging to reconstruct the pages. Solve that problem by storing the pages locally in the compute node, like an unlogged relation, until the index build finishes and all the pages have been WAL-logged. To do that, the smgr needs to know when the caller is an unlogged build operation like that, so add functions to the Smgr API for that. With this commit, we no longer generate an FPI record whenever a rel is extended with an all-zeros page. See github issue #482. That greatly reduces the amount of WAL generated during bulk loading.

Queue was moved further than truncateLsn, when quorumLsn matched end of wal record in the middle of queue message. Fix cleanup of unreceived messages. Co-authored-by: Arseny Sher <[email protected]>

This changes the format of the 'zenith.signal' file. It is now a human-readable text file, with one line like "PREV LSN: 0/1234568", or "PREV LSN: none" if the prev LSN is not known, or "PREV LSN: invalid" if starting up in read-write is not allowed. Also, if 'zenith.signal' is present, don't try to read the checkpoint record from the WAL. Trust the copy in pg_control, instead.

warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] 364 | WalMessage *msgQueueAck = msgQueueHead;

To prevent loading them from pageserver. Author: Konstantin Knizhnik with my extension to VM as well.

At least currently risk of busy loop (e.g due to bugs) is much higher than benefit of additional availability if we immediately reconnect; add interval between the reconnection attempts.

Since c310932 safekeeper sometimes sends it. ref #843

truncateLsn is now advanced to `Min(walkeeper[i].feedback.flushLsn)` with taking epochs into account.

Also see 1632ea4 for details.

See corresponding zenith commit.

Move backpressure throttling from XlogInsert, to ProcessInterrupts(), to restrict writing operations outside of critical section.

This is needed for implementation of tenant rebalancing. With this change safekeeper becomes aware of which pageserver is supposed to be used for replication from this compute. This also changes logic of substitution of auth token inside the connection string. So it is substituted during config variable parsing and available for both, smgr pageserver connection and walproposer safekeeper connection.

Now docker images are being built in zenith repo as that way we have sequential version number that allows us to compare compute/storage versions.

Implement async wp <-> sk protocol, send WAL messages ahead of feedback replies. New SS_ACTIVE state is introduced instead of former SS_SEND_WAL / SS_SEND_WAL_FLUSH / SS_RECV_FEEDBACK.

…ero flushLsn. Clean up backpressure defaults.

Now functions in walproposer.c go in chronological order

* Clean up walproposer states * Migrate AsyncReadFixed to AsyncReadMessage * Handle flushWrite better a bit * Update SS_ACTIVE event set in single place Now event set is updated only in the end of HandleActiveState, after all handlers code was executed. * Add comment on SS_ACTIVE write event * Add TODO for SS_ACTIVE DesiredEvents

* Rename walkeeper to safekeeper * Rename message variables as request/response

In the passing, switch a few places to ereport() instead of elog(), to avoid the overhead of constructing the string when it's not logged. Fixes neondatabase/neon#1066

Add extensible ZenithFeedback part to AppendResponse messages Pass values sizes together with keys in ZenithFeedback message. Add standby_status_update fields into ZenithFeedback. Get rid of diskConsistentLsn field in AppendResponse, because now it is send via ZenithFeedback. Fix calculation of diskConsistentLsn and instanceSize - take values from latest reply from pageserver

refer #1077

Use GUC zenith.max_cluster_size to set the limit. If limit is reached, extend requests will throw out-of-space error. When current size is too close to the limit - throw a warning. Do not apply size quota to autovacuum process Add pg_cluster_size() funciton in zenith extension

This reverts commit 45dd891. It introduced stable test_isolation failure. There was an idea that adding strict backpressure settings would help, as absense of this commit could behave as natural backpressure, but that didn't help. No better fix is immediately available, so let's revert until sorting this out. ref neondatabase/neon#1238 ref neondatabase/neon#1239

The GUC is a 32-bit integer, so if the base unit is bytes, the max limit you can set is only 2 GB. Furthermore, the web console assumed that the unit is in MB, and set it to 10000 meaning 10 GB, but in reality it was set to just 10 kB. Remove the WARNINGs related to cluster size limit. That was probably supposed to be DEBUG5 or something, because it's extremely noisy currently. You get the WARNING for *every block* when a relation is extended. Some kind of a WARNING when you approach the limit would make sense, but it's difficult to do in a sensible way with WARNINGs from the server. Firstly, most applications will ignore WARNINGs, in which case they don't accomplish anything. If an application forwards them to the user, that's not great either unless the application user happens to be the DBA. If you're lucky, the WARNINGs end up in an application log and the DBA is alerted, but printing the message for every relation extension is too noisy for that too. An email alert would probably be best, outside Postgres. Also don't enforce the limit when extending a temporary or unlogged relation. They don't count towards the cluster size limit, so it seems weird to error out on them. And reword the error message a bit. Fixes neondatabase/neon#1233

If anything goes wrong while establishing a connection, don't leak the socket. Also, if you get an error while sending the GetPage request, kill the connection. It's not clear what state it's in, so better to reconnect.

Fixes neondatabase/neon#1224

Fixes neondatabase/neon#822

refer #1244

The constructed StringInfoData 'z' variable wasn't used for anything, we passed the original 's' StringInfo directly to ParseZenithFeedbackMessage. That's fine, but let's remove the dead code.

* Expose reading a relation page at a specific LSN * Addressing comments

Use function pointer to perform a cross-extension calls.

refer #1262

refer #1077

Postgres can perform an smgrnblocks() call on the relation right after creating it, and we don't update the last-written LSN on smgrcreate(). Perhaps we should update last-written LSN, instead. This isn't bulletproof.

It might jump back (on compute) this way, which is not fatal but violates sanity checks.

* Enable dumping corrupt WAL segments Add ability to dump WAL segment with corrupt page headers and recrods skips over missing/broken page headers skips over misformatted log recrods allows dumping log record from a particular file starting from an optional offset (without a need of carefully crafted input)

WAL is no longer in memory to prevent OOM in the compute. Removed in-memory queue because it's not needed anymore. When streaming, WAL is now read directly from disk. Every safekeeper has a separate XLogReader. walproposer will now read as much WAL as it can for a single AppendRequest message, it can help with recovering lagging safekeepers. Because Recovery needs to save WAL for streaming, now walproposer can write WAL to disk and `--sync-safekeepers` mode will create pg_wal directory if needed. Replication slot `restart_lsn` is now synced with `truncate_lsn` to prevent truncation of disk WAL until needed.

Enforces reconnection soon when packets are dropped, e.g. after turning ec2 instance off. ref neondatabase/neon#1491

* Avoid redundand memory allocation and sycnhronization in walredo * Address review comments * Reduce number of temp buffers and size of inmem file storage for wal redo postgres * Misc cleanup Add comments on 'inmem_smgr.c', remove superfluous copy-pasted comments, pgindent. Co-authored-by: Heikki Linnakangas <[email protected]>

* Fix missed include for InRecovery * Fix missed include for InRecovery (used only in debug version with --enable--cassert)

ExceptionalCondition calls getpid(), which is currently forbidden by seccomp. You only get there if something else went wrong, but the "bad syscall" error hides the underlying cause of the error, which makes debugging hard.

This error is happening in the 'pg_regress' test in the CI, but not on my laptop. Turn it into an ERROR, so that we get the error context and backtrace of it.

In the WAL redo process, even "permanent" buffers are stored in the local buffer cache. Need to pass RELPERSISTENCE_PERMANENT to smgropen() in that case.

That's a valid case, as edited comment says. neondatabase/neon#1303

* Perform inmem_smgr cleaup after processing each record * Prevent eviction of wal redo target page * Prevent eviction of wal redo target page frmo temp buffers

It's a waste of time, and otherwise you can run into the MAX_PAGES limit. Fixes neondatabase/neon#1615

…sages. To support remembering it on safekeeper. Currently compute doesn't know initial LSN on non-first boot (though it could get it from pageserver in theory), so we rely on safekeepers to fetch it back. While changing the protocol, also add node_id to AcceptorProposerGreeting.

If not, such basebackup (clog etc) is inconsistent and must be retaken. Basebackup LSN is taken by exposing xlog.c RedoStartLSN in shmem. ref neondatabase/neon#594

- extend zenith pageserver API to handle new request type; - add dbsize_hook to intercept db_dir_size() call.

To force making basebackup again.

This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.

I'm seeing a lot of these warnings from B-tree SPLIT records: WARNING: inmem_write() called for 1663/12990/16397.0 blk 2630: used_pages 0 CONTEXT: WAL redo at 1/235A1B50 for Btree/SPLIT_R: level 0, firstrightoff 368, newitemoff 408, postingoff 0 That seems OK, replaying a split record legitimately accesses many buffers: the left half, the right half, left sibling, right sibling, and child. We could bump up 'temp_buffers' (currently 4), but I didn't do that beceause it's also good to get some test coverage for the inmem_smgr.c.

At neondatabase/neon#1783 (comment), Kirill saw case where the WAL redo process failed to open /dev/null. That's pretty weird, and I have no idea what might be causing it, but with this patch we'll at least get a little more details if it happens again. This will print the OS error (with %m) if it happens, and also distinguishes between the two error cases that previously both emitted the 'failed to open a test file' error.

- zenith.page_server_connstring -> neon.pageserver_connstring - zenith.zenith_tenant -> neon.tenant_id - zenith.zenith_timeline -> neon.timeline_id - zenith.max_cluster_size -> neon.max_cluster_size

as basebackup LSN always skips over page header

Part of neondatabase/neon#1838

* Do not allocate shared memory for wal_redo process * Add comment

* Add check for NULL for malloc in InternalIpcMemoryCreate * apply pgindent

- Fix typos - Change Zenith -> Neon in the ZENITH_SMGR tag that's printed in error messages that is user-visible, and in various function names and comments that are not user-visible. - pgindent - Remove comment about zm_to_string() leaking memory. It doesn't. - Re-word some error messages to match PostgreSQL error message style guide - Cleanup logging style - Don't print JWT token to log

Maintain cache of last written LSN for each relation segment (8 Mb).

* Add uuid-ossp to the supported extensions Also update compile flags to `-O2` to trade compile time for PostgreSQL performance, and removes --enable-cassert.

* Update last written LSN for gin/gist index metadata * Replace SetLastWrittenLSN with family of SetLastWrittenLSNFFor* functions

…183) This reverts commit 7517d1c. Revert "Large last written lsn cache (#177)" This reverts commit 595ac69.

…192) * Eliminate UnkonwnXLogRecPtr and always use InvalidXLogRecPtr instead * Remove GetMinReplicaLsn function

* Initialize wal_redo_buffer after applying record with FPI refer #1915 * Update comment * Update src/backend/tcop/zenith_wal_redo.c Co-authored-by: Heikki Linnakangas <[email protected]> * Update src/backend/tcop/zenith_wal_redo.c Co-authored-by: Heikki Linnakangas <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]>

@localhost

Without this patch, on bootstrap XLP_FIRST_IS_CONTRECORD has been always put on header of a page where WAL writing continues. This confuses WAL decoding on safekeepers, making it think decoding starts in the middle of a record, leading to 2022-08-12T17:48:13.816665Z ERROR {tid=37}: query handler for 'START_WAL_PUSH postgresql://no_user:@localhost:15050' failed: failed to run ReceiveWalConn Caused by: 0: failed to process ProposerAcceptorMessage 1: invalid xlog page header: unexpected XLP_FIRST_IS_CONTRECORD at 0/2CF8000

* Pull 99% of walproposer code into extension. * Annotate nbytes to show it's used for asserts only, fixing one more warning. * Fix makefiles: - Include neon extensions into contrib Makefile - Configure libpqwalproposer more like other extensions * Add comment about lack of PG timelines, and make StartReplication static again. * Fix some compiler warnings in vendor/postgres, and pull libpqwalproposer into vendor/neon * Fix issue with makefile that didn't get caught in the normal test envs.

* Use ECR for image * Keep arg consistent across dockerfiles Co-authored-by: Rory de Zoete <[email protected]>

) It is not used anymore since neondatabase/neon#1872 Fixes neondatabase/cloud#2032

* Move backpressure throttling implementation to neon extension and function for monitoring throttling time * Update src/include/miscadmin.h Co-authored-by: Heikki Linnakangas <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]>

…R to main

…alue of enable_seqscan_prefetch

…ion because spec_token is not wal logged (#221) * Pin pages with speculative insert tuples to prevent their reconstruction because spec_token is not wal logged refer #2587 * Undo Neon trick in heap_xlog_insert which is not needed any more after pinning page for speulative insert * Update src/backend/access/heap/heapam.c Co-authored-by: Heikki Linnakangas <[email protected]> * Move ReleaseBuffer to the end of heap_finish_speculative function * Update src/backend/access/heap/heapam.c Co-authored-by: Heikki Linnakangas <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]>

* Fix shared memory initialization for last written LSN cache Replace (from,till) with (from,n_blocks) for SetLastWrittenLSNForBlockRange function * Fast exit from SetLastWrittenLSNForBlockRange for n_blocks == 0

…ForBlockRange (#230)

- Refactor the way the WalProposerMain function is called when started with --sync-safekeepers. The postgres binary now explicitly loads the 'neon.so' library and calls the WalProposerMain in it. This is simpler than the global function callback "hook" we previously used. - Move the WAL redo process code to a new library, neon_walredo.so, and use the same mechanism as for --sync-safekeepers to call the WalRedoMain function, when launched with --walredo argument. - Also move the seccomp code to neon_walredo.so library. I kept the configure check in the postgres side for now, though.

Fix indentation, remove unused definitions, resolve some FIXMEs.

Previously, we called PrefetchBuffer [NBlkScanned * seqscan_prefetch_buffers] times in each of those situations, but now only NBlkScanned. In addition, the prefetch mechanism for the vacuum scans is now based on blocks instead of tuples - improving the efficiency.

Parallel seqscans didn't take their parallelism into account when determining which block to prefetch, and vacuum's cleanup scan didn't correctly determine which blocks would need to be prefetched, and could get into an infinite loop.

* Use prefetch in pg_prewarm extension * Change prefetch order as suggested in review

* Update prefetch mechanisms: - **Enable enable_seqscan_prefetch by default** - Store prefetch distance in the relevant scan structs - Slow start sequential scan, to accommodate LIMIT clauses. - Replace seqscan_prefetch_buffer with the relations' tablespaces' *_io_concurrency; and drop seqscan_prefetch_buffer as a result. - Clarify enable_seqscan_prefetch GUC description - Fix prefetch in pg_prewarm - Add prefetching to autoprewarm worker - Fix an issue where we'd incorrectly not prefetch data when hitting a table wraparound. The same issue also resulted in assertion failures in debug builds. - Fix parallel scan prefetching - we didn't take into account that parallel scans have scan synchronization, too.

#244) * Maintain last written LSN for each page to enable prefetch on vacuum, delete and other massive update operations * Move PageSetLSN in heap_xlog_visible before MarkBufferDirty

- Prefetch the pages in index vacuum's sequential scans Implemented in NBTREE, GIST and SP-GIST. BRIN does not have a 2nd phase of vacuum, and both GIN and HASH clean up their indexes in a non-seqscan fashion: GIN scans the btree from left to right, and HASH only scans the initial buckets sequentially.

The compiler warning was correct and would have the potential to disable prefetching.

…251) refer #2807

use $(INSTALL_DATA) to copy header files, similar to in more recent v15 branch. this helps with unnecessary rebuilds of postgres_ffi in neon. Cc: neondatabase/neon#1873

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix unnecessary header overwrites #253

fix unnecessary header overwrites #253

Commits on Nov 21, 2022

Commits on Nov 23, 2022

Commits on Nov 24, 2022

Commits on Dec 5, 2022

Commits on Dec 7, 2022

Commits on Dec 8, 2022

Commits on Dec 22, 2022

fix unnecessary header overwrites #253

Are you sure you want to change the base?

fix unnecessary header overwrites #253

Commits on Nov 21, 2022

Commits on Nov 23, 2022

Commits on Nov 24, 2022

Commits on Dec 5, 2022

Commits on Dec 7, 2022

Commits on Dec 8, 2022

Commits on Dec 22, 2022