-
Notifications
You must be signed in to change notification settings - Fork 1.8k
ZIL: "crash" the ZIL if the pool suspends during fallback #17398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read the big last commit deep enough to properly understand the whole concept, will do another time, but I have subtle feeling that you are overdoing something with the zl_lwb_crash_list
. ZIL already has a mechanism to chain errors via lwb_error
, and error on one LWB automatically mean the same error on all the following, without even issuing them, unless they are already issued. And proper ZIL operation is restored only when some transaction commit finally succeed to restart the ZIL chain, which means the pool iv obviously back alive. I got your idea of removing ITX'es and calling callbacks for potentially stuck LWB ZIOs, it is much better than attempt to terminate stuck ZIOs. But do you need some other LWB list and special cleanup for that?
I wouldn't be surprised. All the way through I wondered if I was overdoing it, so I've tried to make it as safe as possible. My concern was that with disks gone, what if some IO never returned and/or some returned out of order? Could IO not return even after the pool had resumed? I don't even know if any of this is possible but I was trying to avoid assuming anything at all about the lost IO, and just make sure there was always some stub lying around in case it ever did come back. I'll have another read through next week; maybe with fresh eyes it'll be obvious where it's overdone (it's actually been months since I've though very hard about this). You definitely understand the ZIL better than I do though, so if anything jumps out at you as obviously unnecessary, please do let me know! |
I've spent most of the day studying how page writeback errors wire up to this (wrt #17398 (comment) above) and it seems it all needs work. As best I can tell, we're handling async/sync page writeback correctly (ie blocking when we need to block), but I'm pretty sure we're not really holding the kernel tools correctly to ensure errors are all propagated properly. Not really surprising; we never had to do it before! It doesn't affect the bulk of this work (in |
06449c0
to
a5bbbf9
Compare
Alright, last push has a new commit, sorting out the error reporting and page dirtying dance between Before the end of the week I will try to write or extend a test to exercise combinations of mapped and not-mapped writes and syncs when pool suspension is involved. |
I've got some more changes coming on this too, though I am fast running out of week for it:
|
Fairly coarse, but if it returns while the pool suspends, it must be with an error. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
The superblock pointer will always be set, as will z_log, so remove code supporting cases that can't occur (on Linux at least). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
If the pool is suspended, we'll just block in zil_commit(). If the system is shutting down, blocking wouldn't help anyone. So, we should keep this test for now, but at least return an error for anyone who is actually interested. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
If the kernel will honour our error returns, use them. If not, fool it by setting a writeback error on the superblock, if available. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. We change from a single zfs_log_write() to cover the whole range, to multiple calls, each for a single page. This is because if zfs_log_write() receives a write too large to fit into a single ZIL block, it will split it into multiple writes, but each will receive the same callback and argument. This will lead to the callback being called multiple times, but we don't which pages/range the call is for, and so can't clean them up. This way is not ideal, but is simple and correct, and so is a good start. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
These are all the same shape: set up the pool to suspend on first write, then perform some write+sync operation. The pool should suspend, and the sync operation should respond according to the failmode= property. We test fsync(), msync() and two forms of write() (open with O_SYNC, and async with sync=always), which all take slightly different paths to zil_commit() and back. A helper function is included to do the write+sync sequence with mmap() and msync(), since I didn't find a convenient tool to do that. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
This changes zil_commit() to have an int return, and updates all callers to check it. There are no corresponding internal changes yet; it will always return 0. Since zil_commit() is an indication that the caller _really_ wants the associated data to be durability stored, I've annotated it with the __warn_unused_result__ compiler attribute (via __must_check), to emit a warning if it's ever ussd without doing something with the return code. I hope this will mean we never misuse it in the future. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
ITX callbacks are used to signal that something can be cleaned up after a itx is committed. Presently that's only used when syncing out mapped pages (msync()) to mark dirty pages clean. This extends the callback interface so it can be passed an error, and take a different cleanup action if necessary. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
If the ZIL runs into trouble, it calls txg_wait_synced(), which blocks on suspend. We want it to not block on suspend, instead returning an error. On the surface, this is simple: change all calls to txg_wait_synced_flags(TXG_WAIT_SUSPEND), and then thread the error return back to the zil_commit() caller. Handling suspension means returning an error to all commit waiters. This is relatively straightforward, as zil_commit_waiter_t already has zcw_zio_error to hold the write IO error, which signals a fallback to txg_wait_synced_flags(TXG_WAIT_SUSPEND), which will fail, and so the waiter can now return an error from zil_commit(). However, commit waiters are normally signalled when their associated write (LWB) completes. If the pool has suspended, those IOs may not return for some time, or maybe not at all. We still want to signal those waiters so they can return from zil_commit(). We have a list of those in-flight LWBs on zl_lwb_list, so we can run through those, detach them and signal them. The LWB itself is still in-flight, but no longer has attached waiters, so when it returns there will be nothing to do. (As an aside, ITXs can also supply completion callbacks, which are called when they are destroyed. These are directly connected to LWBs though, so are passed the error code and destroyed there too). At this point, all ZIL waiters have been ejected, so we only have to consider the internal state. We potentially still have ITXs that have not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL is in an unknown state; some writes may have been written but not returned to us. We really can't rely on any of it; the best thing to do is abandon it entirely and start over when the pool returns to service. But, since we may have IO out that won't return until the pool resumes, we need something for it to return to. The simplest solution I could find, implemented here, is to "crash" the ZIL: accept no new ITXs, make no further updates, and let it empty out on its normal schedule, that is, as txgs complete and zil_sync() and zil_clean() are called. We set a "restart txg" to three txgs in the future (syncing + TXG_CONCURRENT_STATES), at which point all the internal state will have been cleared out, and the ZIL can resume operation (handled at the top of zil_clean()). This commit adds zil_crash(), which handles all of the above: - sets the restart txg - capture and signal all waiters - zero the header zil_crash() is called when txg_wait_synced_flags(TXG_WAIT_SUSPEND) returns because the pool suspended (ESHUTDOWN). The rest of the commit is just threading the errors through, and related housekeeping. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Page writeback is considered completed when the associated itx callback completes. A syncing writeback will receive the error in its callback directly, but an in-flight async writeback that was promoted to sync by the ZIL may also receive an error. Writeback errors, even syncing writeback errors, are not especially serious on their own, because the error will ultimately be returned to the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg msync()) or to zfs_putpage() itself for a syncing (WB_SYNC_ALL) writeback (kernel housekeeping or sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER). The only thing we need to do when a page writeback fails is to re-mark the page dirty, since we don't know if it made it to disk yet. This will ensure that it gets written out again in the future, either some scheduled async writeback or another explicit syncing call. On the other side, we need to make sure that if a syncing op arrives, any changes on dirty pages are written back to the DMU and/or the ZIL first. We do this by starting an _async_ (WB_SYNC_NONE) writeback on the file mapping at the start of the sync op (fsync(), msync(), etc). An async op will get an async itx created and logged, ready for the followup zfs_fsync()->zil_commit() to find, while avoiding a zil_commit() call for every page in the range. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Page writeback is considered completed when the associated itx callback completes. A syncing writeback will receive the error in its callback directly, but an in-flight async writeback that was promoted to sync by the ZIL may also receive an error. Writeback errors, even syncing writeback errors, are not especially serious on their own, because the error will ultimately be returned to the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg msync()) or to zfs_putpage() itself for a syncing (VM_PAGER_PUT_SYNC) writeback. The only thing we need to do when a page writeback fails is to skip marking the page clean ("undirty"), since we don't know if it made it to disk yet. This will ensure that it gets written out again in the future, either some scheduled async writeback or another explicit syncing call. On the other side, we need to make sure that if a syncing op arrives, any changes on dirty pages are written back to the DMU and/or the ZIL first. We do this by starting an async writeback on the vnode cache first, so any dirty data has been recorded in the ZIL, ready for the followup zfs_sync()->zil_commit() to find. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Last push rebased on #17420 and #17445, and includes matching changes to both, most importantly that the FreeBSD putpage callbacks now receive and handle errors too. I'm still chewing on variations of I feel like being able to start writing ahead of time is more a reflection of some existing structural problem. A true async version could be achieved by returning a commit waiter, or having a callback on the commit itx, or stashing the commit waiter for the next call and have some kind of cookie to reference it. But the actual question is "why?", because it seems like a call to I think that needs a bit more thought, but outside of this PR. It's not necessary for this, it'll just be nicer code at the end, and I'd rather not blow out this diff any further. |
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
Currently, if something goes wrong inside the ZIL,
zil_commit()
it will fall back to a full txg sync, which provides the same semantics, just slower. If pool suspends,txg_wait_synced_flags()
will block until the pool resumes. The end result is that syncing ops likefsync()
block regardless of howfailmode=
is set.Since #17355,
txg_wait_synced_flags()
can bail out if the pool suspends while waiting. We can use this feature to not blockzil_commit()
if the pool suspends, but instead unwind and return an error. Once wired back through, this allowsfsync()
and friends to returnEIO
rather than blocking forever.This is an obvious good, as it allows applications to take alternate action when the pool suspends, rather than just waiting forever.
(Further information in my BSDCan 2024 talk “Why fsync() on OpenZFS can’t fail, and what happens when it does”, if you like that sort of thing).
Description
Going commit-by-commit.
ZTS: test response of various sync methods under different failmodes
Test suite additions that confirm behaviour of various syncing ops (
fsync()
,msync()
,write()
afterO_SYNC
, andwrite()
withsync=always
), withfailmode=wait
andfailmode=continue
. The short of it is that when the pool suspends, syncing ops should block infailmode=wait
, and return error withfailmode=continue
.I couldn’t find a convenient program that would do a
dd
-like write+sync usingmmap()
+msync()
, so I wrote one just for this test.ZIL: allow zil_commit() to fail with error
This changes
zil_commit()
to returnint
, and then updates all call sites to have an appropriate error handling. Most of these are fairly mechanical, being the handling forsync=always
. Any tricky ones have comments.Of interest, I’ve made
zil_commit()
be__attribute__((__warn_unused_result__))
, causing the compiler to check that the result is being used and throw a warning (== error) if not. This was initially to help smoke out all the call sites, but I decided to leave it in, on the idea that if you are callingzil_commit()
, you are indicating that it is important to you that the data gets onto disk, and you need to take care of it if it’s not. I personally think it’s a slam-dunk for programmer confidence, but I’m happy to discuss it if you don’t feel great about it.ZIL: pass commit errors back to ITX callbacks
ZIL transactions (
itx_t
) can optionally have a callback, which will be called when it is destroyed (committed or not). This is used in one place: on Linux, a syncingwritepage
(that is,msync()
) will set a callback so that it can mark the page clean once the op has completed.Since that’s a syncing op that we want to fail if the ZIL crashes, we pass any error back to the callbacks as well. For the existing callbacks (
zfs_putpage_sync_commit_cb()
,zfs_putpage_async_commit_cb()
), we handle the error by keeping the page dirty, and also setting its error flag. This prevents the page being evicted until is written back cleanly, and is propagated back up tomsync()
later when the kernel callsvfs_sync_range()
->zpl_fsync()
->filemap_write_and_wait_range()
.With #17445, FreeBSD gets equivalent changes. We don't currently have the sync/async tracking there, so we simply force a writeback on in
zfs_freebsd_fsync()
, and if the callbacks receive errors, we unlock and notify, but leave the page dirty.ZIL: "crash" the ZIL if the pool suspends during fallback
The main event. Now we have the ability to pass errors back to callers, we need to generate some when things go wrong!
The basic idea is straighforward: instead of falling back to
txg_wait_synced()
, which blocks on suspend, we change all calls totxg_wait_synced_flags(TXG_WAIT_SUSPEND)
, and then thread the error return back to thezil_commit()
caller.Handling suspension means returning an error to all commit waiters. This is relatively straightforward, as
zil_commit_waiter_t
already haszcw_zio_error
to hold the write IO error, which signals a fallback totxg_wait_synced_flags(TXG_WAIT_SUSPEND)
, which will fail, and so the waiter can now return an error fromzil_commit()
.However, commit waiters are normally signalled when their associated write (LWB) completes. If the pool has suspended, those IOs may not return for some time, or maybe not at all. We still want to signal those waiters so they can return from
zil_commit()
. We have a list of those in-flight LWBs onzl_lwb_list
, so we can run through those, detach them and signal them. The LWB itself is still in-flight, but no longer has attached waiters, so when it returns there will be nothing to do. ITXs are directly connected to LWBs, so we can destroy them there too, passing the error to pass on to their completion callbacks.At this point, all ZIL waiters have been ejected, so we only have to consider the internal state. We potentially still have ITXs that have not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL is in an unknown state; some writes may have been written but not returned to us. We really can't rely on any of it; the best thing to do is abandon it entirely and start over when the pool returns to service. But, since we may have IO out that won't return until the pool resumes, we need something for it to return to.
The simplest solution, implemented here, is to "crash" the ZIL: accept no new ITXs, make no further updates, and let it empty out on its normal schedule, that is, as txgs complete and
zil_sync()
andzil_clean()
are called. We set a "restart txg" to four txgs in the future (syncing +TXG_SIZE
), at which point all the internal state will have been cleared out, and the ZIL can resume operation (handled at the top ofzil_clean()
).This commit adds
zil_crash()
, which handles all of the above:zil_crash()
is called whentxg_wait_synced_flags(TXG_WAIT_SUSPEND)
returns because the pool suspended (ESHUTDOWN
).The rest of the commit is just threading the errors through, and related housekeeping.
Discussion
The
failmode=
checkI’ve put the
failmode=
check at the bottom ofzil_commit()
, falling back to a “forever”txg_wait_synced()
if the pool suspends andfailmode=wait
. There are two possible related issues, one structural, one design.First, I think this isn’t the best place for this check. I feel like
zil_commit()
should always returnEIO
if the pool suspends, because it’s sort of an “off to the side” component. However, it’s very convenient to do the check here because of all the ZPL callers that would have to do those checks themselves if it wasn’t there. It might be better to do it in a wrapper, or with a flag.The design question is this. I think there is a case to be made that syncing ops should always return
EIO
if they cannot be serviced now, regardless of the setting offailmode=
. I admit it’s hard to draw a line, but I sort of see it that a basic asyncwrite()
is always kind of a best effort, while something likefsync()
is explicitly an application asking for something to be on disk. Any well-designedfsync()
caller has to have a solid error strategy; why should it be delayed from enacting it? It’s a situation where it almost by definition it needs anEIO
handler, while ops likewrite()
are often not checked as rigourously.On balance, I think gating it behind
failmode=continue
for now is a good idea since it represents such a fundamental change to how OpenZFS works that we should tread carefully, but I’m curious to hear if people have any thoughts.No error checks in tests
The tests only check whether or not the syncing op blocks or not, not whether or not it returns an error. This is deliberate. Until we resolve the problem that flush errors don’t propagate to write errors (both in the ZIL and in the main
spa_sync()
line), syncing ops will always be able to erroneously return success instead of failure. This PR is mostly only about unblocking the syncing op, though we return errors when we can.What I think will be the final change to handle flush errors as write errors should be in a PR in a week or two. It’s no secret that I’ve attempted this multiple times before, and I think I’ve got something workable this time. It’s extremely invasive though, so I don’t want to mix it up with this PR because it’ll just make it even harder to review.
FreeBSD page writeback errors
Resolved when #17445 lands. This PR is based on top of that, and makes the necessary changes to handle writeback errors on FreeBSD too.
Previous discussion below, for interest:
Our page writeback handler for FreeBSD doesn’t appear to have any ability to handle writeback errors in its current form, and I’m not sure what to do about it. From
zfs_putpages()
:If I’m understanding that correctly, it’s adding the ITX to the ZIL, then immediately marking the pages clean and error-free. If that’s true, then it might mean
msync()
will always return success and the page be made clean and evictable, even if it’s not on disk. That almost certainly needs to be fixed, though perhaps we can get away with it a little longer, since this whole PR only changes behaviour withfailmode=continue
, which isn’t widely used, and the fallback case is not totally reliable on errors, as above.In any case, I will be trying to rope in a FreeBSD person to help with this bit. Let me know if that’s you!
How Has This Been Tested?
Full ZTS run, including new tests, completed successfully against Linux 6.1.137 and FreeBSD 14.2-p1 (though I'll concede, a few recent PRs of mine have come with full test runs locally and still tripped up in CI, so I should at least work out how to rephrase this boilerplate!)
Types of changes
Checklist:
Signed-off-by
.