Skip to content

ZIL: "crash" the ZIL if the pool suspends during fallback #17398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

robn
Copy link
Member

@robn robn commented May 29, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

Currently, if something goes wrong inside the ZIL, zil_commit() it will fall back to a full txg sync, which provides the same semantics, just slower. If pool suspends, txg_wait_synced_flags() will block until the pool resumes. The end result is that syncing ops like fsync() block regardless of how failmode= is set.

Since #17355, txg_wait_synced_flags() can bail out if the pool suspends while waiting. We can use this feature to not block zil_commit() if the pool suspends, but instead unwind and return an error. Once wired back through, this allows fsync() and friends to return EIO rather than blocking forever.

This is an obvious good, as it allows applications to take alternate action when the pool suspends, rather than just waiting forever.

(Further information in my BSDCan 2024 talk “Why fsync() on OpenZFS can’t fail, and what happens when it does”, if you like that sort of thing).

Description

Going commit-by-commit.

ZTS: test response of various sync methods under different failmodes

Test suite additions that confirm behaviour of various syncing ops (fsync(), msync(), write() after O_SYNC, and write() with sync=always), with failmode=wait and failmode=continue. The short of it is that when the pool suspends, syncing ops should block in failmode=wait, and return error with failmode=continue.

I couldn’t find a convenient program that would do a dd-like write+sync using mmap()+msync(), so I wrote one just for this test.

ZIL: allow zil_commit() to fail with error

This changes zil_commit() to return int, and then updates all call sites to have an appropriate error handling. Most of these are fairly mechanical, being the handling for sync=always. Any tricky ones have comments.

Of interest, I’ve made zil_commit() be __attribute__((__warn_unused_result__)), causing the compiler to check that the result is being used and throw a warning (== error) if not. This was initially to help smoke out all the call sites, but I decided to leave it in, on the idea that if you are calling zil_commit(), you are indicating that it is important to you that the data gets onto disk, and you need to take care of it if it’s not. I personally think it’s a slam-dunk for programmer confidence, but I’m happy to discuss it if you don’t feel great about it.

ZIL: pass commit errors back to ITX callbacks

ZIL transactions (itx_t) can optionally have a callback, which will be called when it is destroyed (committed or not). This is used in one place: on Linux, a syncing writepage (that is, msync()) will set a callback so that it can mark the page clean once the op has completed.

Since that’s a syncing op that we want to fail if the ZIL crashes, we pass any error back to the callbacks as well. For the existing callbacks (zfs_putpage_sync_commit_cb(), zfs_putpage_async_commit_cb()), we handle the error by keeping the page dirty, and also setting its error flag. This prevents the page being evicted until is written back cleanly, and is propagated back up to msync() later when the kernel calls vfs_sync_range() -> zpl_fsync() -> filemap_write_and_wait_range().

With #17445, FreeBSD gets equivalent changes. We don't currently have the sync/async tracking there, so we simply force a writeback on in zfs_freebsd_fsync(), and if the callbacks receive errors, we unlock and notify, but leave the page dirty.

ZIL: "crash" the ZIL if the pool suspends during fallback

The main event. Now we have the ability to pass errors back to callers, we need to generate some when things go wrong!

The basic idea is straighforward: instead of falling back to txg_wait_synced(), which blocks on suspend, we change all calls to txg_wait_synced_flags(TXG_WAIT_SUSPEND), and then thread the error return back to the zil_commit() caller.

Handling suspension means returning an error to all commit waiters. This is relatively straightforward, as zil_commit_waiter_t already has zcw_zio_error to hold the write IO error, which signals a fallback to txg_wait_synced_flags(TXG_WAIT_SUSPEND), which will fail, and so the waiter can now return an error from zil_commit().

However, commit waiters are normally signalled when their associated write (LWB) completes. If the pool has suspended, those IOs may not return for some time, or maybe not at all. We still want to signal those waiters so they can return from zil_commit(). We have a list of those in-flight LWBs on zl_lwb_list, so we can run through those, detach them and signal them. The LWB itself is still in-flight, but no longer has attached waiters, so when it returns there will be nothing to do. ITXs are directly connected to LWBs, so we can destroy them there too, passing the error to pass on to their completion callbacks.

At this point, all ZIL waiters have been ejected, so we only have to consider the internal state. We potentially still have ITXs that have not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL is in an unknown state; some writes may have been written but not returned to us. We really can't rely on any of it; the best thing to do is abandon it entirely and start over when the pool returns to service. But, since we may have IO out that won't return until the pool resumes, we need something for it to return to.

The simplest solution, implemented here, is to "crash" the ZIL: accept no new ITXs, make no further updates, and let it empty out on its normal schedule, that is, as txgs complete and zil_sync() and zil_clean() are called. We set a "restart txg" to four txgs in the future (syncing + TXG_SIZE), at which point all the internal state will have been cleared out, and the ZIL can resume operation (handled at the top of zil_clean()).

This commit adds zil_crash(), which handles all of the above:

  • sets the restart txg
  • capture and signal all waiters
  • zero the header

zil_crash() is called when txg_wait_synced_flags(TXG_WAIT_SUSPEND) returns because the pool suspended (ESHUTDOWN).

The rest of the commit is just threading the errors through, and related housekeeping.

Discussion

The failmode= check

I’ve put the failmode= check at the bottom of zil_commit(), falling back to a “forever” txg_wait_synced() if the pool suspends and failmode=wait. There are two possible related issues, one structural, one design.

First, I think this isn’t the best place for this check. I feel like zil_commit() should always return EIO if the pool suspends, because it’s sort of an “off to the side” component. However, it’s very convenient to do the check here because of all the ZPL callers that would have to do those checks themselves if it wasn’t there. It might be better to do it in a wrapper, or with a flag.

The design question is this. I think there is a case to be made that syncing ops should always return EIO if they cannot be serviced now, regardless of the setting of failmode=. I admit it’s hard to draw a line, but I sort of see it that a basic async write() is always kind of a best effort, while something like fsync() is explicitly an application asking for something to be on disk. Any well-designed fsync() caller has to have a solid error strategy; why should it be delayed from enacting it? It’s a situation where it almost by definition it needs an EIO handler, while ops like write() are often not checked as rigourously.

On balance, I think gating it behind failmode=continue for now is a good idea since it represents such a fundamental change to how OpenZFS works that we should tread carefully, but I’m curious to hear if people have any thoughts.

No error checks in tests

The tests only check whether or not the syncing op blocks or not, not whether or not it returns an error. This is deliberate. Until we resolve the problem that flush errors don’t propagate to write errors (both in the ZIL and in the main spa_sync() line), syncing ops will always be able to erroneously return success instead of failure. This PR is mostly only about unblocking the syncing op, though we return errors when we can.

What I think will be the final change to handle flush errors as write errors should be in a PR in a week or two. It’s no secret that I’ve attempted this multiple times before, and I think I’ve got something workable this time. It’s extremely invasive though, so I don’t want to mix it up with this PR because it’ll just make it even harder to review.

FreeBSD page writeback errors

Resolved when #17445 lands. This PR is based on top of that, and makes the necessary changes to handle writeback errors on FreeBSD too.

Previous discussion below, for interest:

Our page writeback handler for FreeBSD doesn’t appear to have any ability to handle writeback errors in its current form, and I’m not sure what to do about it. From zfs_putpages():

		/*
		 * XXX we should be passing a callback to undirty
		 * but that would make the locking messier
		 */
		zfs_log_write(zfsvfs->z_log, tx, TX_WRITE, zp, off,
		    len, commit, B_FALSE, NULL, NULL);

		zfs_vmobject_wlock(object);
		for (i = 0; i < ncount; i++) {
			rtvals[i] = zfs_vm_pagerret_ok;
			vm_page_undirty(ma[i]);
		}
		zfs_vmobject_wunlock(object);
		VM_CNT_INC(v_vnodeout);
		VM_CNT_ADD(v_vnodepgsout, ncount);

If I’m understanding that correctly, it’s adding the ITX to the ZIL, then immediately marking the pages clean and error-free. If that’s true, then it might mean msync() will always return success and the page be made clean and evictable, even if it’s not on disk. That almost certainly needs to be fixed, though perhaps we can get away with it a little longer, since this whole PR only changes behaviour with failmode=continue, which isn’t widely used, and the fallback case is not totally reliable on errors, as above.

In any case, I will be trying to rope in a FreeBSD person to help with this bit. Let me know if that’s you!

How Has This Been Tested?

Full ZTS run, including new tests, completed successfully against Linux 6.1.137 and FreeBSD 14.2-p1 (though I'll concede, a few recent PRs of mine have come with full test runs locally and still tripped up in CI, so I should at least work out how to rephrase this boilerplate!)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read the big last commit deep enough to properly understand the whole concept, will do another time, but I have subtle feeling that you are overdoing something with the zl_lwb_crash_list. ZIL already has a mechanism to chain errors via lwb_error, and error on one LWB automatically mean the same error on all the following, without even issuing them, unless they are already issued. And proper ZIL operation is restored only when some transaction commit finally succeed to restart the ZIL chain, which means the pool iv obviously back alive. I got your idea of removing ITX'es and calling callbacks for potentially stuck LWB ZIOs, it is much better than attempt to terminate stuck ZIOs. But do you need some other LWB list and special cleanup for that?

@amotin amotin added the Status: Code Review Needed Ready for review and testing label May 29, 2025
@robn
Copy link
Member Author

robn commented May 30, 2025

I have subtle feeling that you are overdoing something with the zl_lwb_crash_list. ZIL already has a mechanism to chain errors via lwb_error, and error on one LWB automatically mean the same error on all the following, without even issuing them, unless they are already issued. And proper ZIL operation is restored only when some transaction commit finally succeed to restart the ZIL chain, which means the pool iv obviously back alive. I got your idea of removing ITX'es and calling callbacks for potentially stuck LWB ZIOs, it is much better than attempt to terminate stuck ZIOs. But do you need some other LWB list and special cleanup for that?

I wouldn't be surprised. All the way through I wondered if I was overdoing it, so I've tried to make it as safe as possible. My concern was that with disks gone, what if some IO never returned and/or some returned out of order? Could IO not return even after the pool had resumed? I don't even know if any of this is possible but I was trying to avoid assuming anything at all about the lost IO, and just make sure there was always some stub lying around in case it ever did come back.

I'll have another read through next week; maybe with fresh eyes it'll be obvious where it's overdone (it's actually been months since I've though very hard about this).

You definitely understand the ZIL better than I do though, so if anything jumps out at you as obviously unnecessary, please do let me know!

@robn
Copy link
Member Author

robn commented Jun 2, 2025

I've spent most of the day studying how page writeback errors wire up to this (wrt #17398 (comment) above) and it seems it all needs work.

As best I can tell, we're handling async/sync page writeback correctly (ie blocking when we need to block), but I'm pretty sure we're not really holding the kernel tools correctly to ensure errors are all propagated properly. Not really surprising; we never had to do it before!

It doesn't affect the bulk of this work (in zil.c); it's all in the dance between zfs_putpage(), zpl_sync(), and the itx callbacks. I think I know what to do with it all; just need to work through it, and write a test for it. That'll be tomorrow I think.

@robn robn force-pushed the zil-suspend-break branch 2 times, most recently from 06449c0 to a5bbbf9 Compare June 4, 2025 10:07
@robn
Copy link
Member Author

robn commented Jun 4, 2025

Alright, last push has a new commit, sorting out the error reporting and page dirtying dance between zpl_fsync(), zfs_putpage() and the itx callbacks. I believe its (close to) right, but this is deep and subtle.

Before the end of the week I will try to write or extend a test to exercise combinations of mapped and not-mapped writes and syncs when pool suspension is involved.

@robn
Copy link
Member Author

robn commented Jun 4, 2025

I've got some more changes coming on this too, though I am fast running out of week for it:

  • incorporate (rebase atop) syncfs: don't erroneously return success when the pool is suspended #17420, and make sure the zil_commit() call in zfs_sync() gets back to syncfs() correctly
  • some kind of "nowait" or "callback" zil_commit(). There's now two places where it's only called to get things writing, and it would be very useful for the "nowait" mode of zfs_sync() as well. At least, I want a mode which doesn't consider failmode=, just returns, but I think in both styles the error is important, so I need to think a little about that.
  • write the page writeback callbacks for FreeBSD. I suspect I know enough now to identify the shapes I need in the FreeBSD kernel and hook it up

robn added 11 commits June 10, 2025 11:10
Fairly coarse, but if it returns while the pool suspends, it must be
with an error.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
The superblock pointer will always be set, as will z_log, so remove code
supporting cases that can't occur (on Linux at least).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
If the pool is suspended, we'll just block in zil_commit(). If the
system is shutting down, blocking wouldn't help anyone. So, we should
keep this test for now, but at least return an error for anyone who is
actually interested.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
If the kernel will honour our error returns, use them. If not, fool it
by setting a writeback error on the superblock, if available.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
zfs_putpages() would put the entire range of pages onto the ZIL, then
return VM_PAGER_OK for each page to the kernel. However, an associated
zil_commit() or txg sync had not happened at this point, so the write
may not actually be on disk.

So, we rework it to use a ZIL commit callback, and do the post-write
work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND to the kernel instead so it knows that we will take care
of it.

We change from a single zfs_log_write() to cover the whole range, to
multiple calls, each for a single page. This is because if
zfs_log_write() receives a write too large to fit into a single ZIL
block, it will split it into multiple writes, but each will receive the
same callback and argument. This will lead to the callback being called
multiple times, but we don't which pages/range the call is for, and so
can't clean them up. This way is not ideal, but is simple and correct,
and so is a good start.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
These are all the same shape: set up the pool to suspend on first write,
then perform some write+sync operation. The pool should suspend, and the
sync operation should respond according to the failmode= property.

We test fsync(), msync() and two forms of write() (open with O_SYNC, and
async with sync=always), which all take slightly different paths to
zil_commit() and back.

A helper function is included to do the write+sync sequence with mmap()
and msync(), since I didn't find a convenient tool to do that.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
This changes zil_commit() to have an int return, and updates all callers
to check it. There are no corresponding internal changes yet; it will
always return 0.

Since zil_commit() is an indication that the caller _really_ wants the
associated data to be durability stored, I've annotated it with the
__warn_unused_result__ compiler attribute (via __must_check), to emit a
warning if it's ever ussd without doing something with the return code.
I hope this will mean we never misuse it in the future.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
ITX callbacks are used to signal that something can be cleaned up after
a itx is committed. Presently that's only used when syncing out mapped
pages (msync()) to mark dirty pages clean.

This extends the callback interface so it can be passed an error, and
take a different cleanup action if necessary.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
If the ZIL runs into trouble, it calls txg_wait_synced(), which blocks
on suspend. We want it to not block on suspend, instead returning an
error. On the surface, this is simple: change all calls to
txg_wait_synced_flags(TXG_WAIT_SUSPEND), and then thread the error
return back to the zil_commit() caller.

Handling suspension means returning an error to all commit waiters. This
is relatively straightforward, as zil_commit_waiter_t already has
zcw_zio_error to hold the write IO error, which signals a fallback to
txg_wait_synced_flags(TXG_WAIT_SUSPEND), which will fail, and so the
waiter can now return an error from zil_commit().

However, commit waiters are normally signalled when their associated
write (LWB) completes. If the pool has suspended, those IOs may not
return for some time, or maybe not at all. We still want to signal those
waiters so they can return from zil_commit(). We have a list of those
in-flight LWBs on zl_lwb_list, so we can run through those, detach them
and signal them. The LWB itself is still in-flight, but no longer has
attached waiters, so when it returns there will be nothing to do.

(As an aside, ITXs can also supply completion callbacks, which are
called when they are destroyed. These are directly connected to LWBs
though, so are passed the error code and destroyed there too).

At this point, all ZIL waiters have been ejected, so we only have to
consider the internal state. We potentially still have ITXs that have
not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL
is in an unknown state; some writes may have been written but not
returned to us. We really can't rely on any of it; the best thing to do
is abandon it entirely and start over when the pool returns to service.
But, since we may have IO out that won't return until the pool resumes,
we need something for it to return to.

The simplest solution I could find, implemented here, is to "crash" the
ZIL: accept no new ITXs, make no further updates, and let it empty out
on its normal schedule, that is, as txgs complete and zil_sync() and
zil_clean() are called. We set a "restart txg" to three txgs in the
future (syncing + TXG_CONCURRENT_STATES), at which point all the
internal state will have been cleared out, and the ZIL can resume
operation (handled at the top of zil_clean()).

This commit adds zil_crash(), which handles all of the above:
 - sets the restart txg
 - capture and signal all waiters
 - zero the header

zil_crash() is called when txg_wait_synced_flags(TXG_WAIT_SUSPEND)
returns because the pool suspended (ESHUTDOWN).

The rest of the commit is just threading the errors through, and related
housekeeping.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Page writeback is considered completed when the associated itx callback
completes. A syncing writeback will receive the error in its callback
directly, but an in-flight async writeback that was promoted to sync by
the ZIL may also receive an error.

Writeback errors, even syncing writeback errors, are not especially
serious on their own, because the error will ultimately be returned to
the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg
msync()) or to zfs_putpage() itself for a syncing (WB_SYNC_ALL) writeback
(kernel housekeeping or sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER).

The only thing we need to do when a page writeback fails is to re-mark
the page dirty, since we don't know if it made it to disk yet. This will
ensure that it gets written out again in the future, either some
scheduled async writeback or another explicit syncing call.

On the other side, we need to make sure that if a syncing op arrives,
any changes on dirty pages are written back to the DMU and/or the ZIL
first. We do this by starting an _async_ (WB_SYNC_NONE) writeback on the
file mapping at the start of the sync op (fsync(), msync(), etc). An
async op will get an async itx created and logged, ready for the
followup zfs_fsync()->zil_commit() to find, while avoiding a zil_commit()
call for every page in the range.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Page writeback is considered completed when the associated itx callback
completes. A syncing writeback will receive the error in its callback
directly, but an in-flight async writeback that was promoted to sync by
the ZIL may also receive an error.

Writeback errors, even syncing writeback errors, are not especially
serious on their own, because the error will ultimately be returned to
the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg
msync()) or to zfs_putpage() itself for a syncing (VM_PAGER_PUT_SYNC)
writeback.

The only thing we need to do when a page writeback fails is to skip
marking the page clean ("undirty"), since we don't know if it made it to
disk yet. This will ensure that it gets written out again in the future,
either some scheduled async writeback or another explicit syncing call.

On the other side, we need to make sure that if a syncing op arrives,
any changes on dirty pages are written back to the DMU and/or the ZIL
first. We do this by starting an async writeback on the vnode cache
first, so any dirty data has been recorded in the ZIL, ready for the
followup zfs_sync()->zil_commit() to find.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn robn force-pushed the zil-suspend-break branch from 8b10502 to 40a85c9 Compare June 10, 2025 03:09
@robn
Copy link
Member Author

robn commented Jun 10, 2025

Last push rebased on #17420 and #17445, and includes matching changes to both, most importantly that the FreeBSD putpage callbacks now receive and handle errors too.

I'm still chewing on variations of zil_commit(). I do think I want to be able to separate out the failmode= check, and will look at it soon.

I feel like being able to start writing ahead of time is more a reflection of some existing structural problem.

A true async version could be achieved by returning a commit waiter, or having a callback on the commit itx, or stashing the commit waiter for the next call and have some kind of cookie to reference it. But the actual question is "why?", because it seems like a call to zil_commit() is saying you want everything it's got on disk now, and it's only the async/sync counts and the writeback error states complicating matters, so maybe that should all be moved inside the ZIL and there be other ways to signal what it is you want (using the ZIL to help with things that aren't strictly sync, I guess?).

I think that needs a bit more thought, but outside of this PR. It's not necessary for this, it'll just be nicer code at the end, and I'd rather not blow out this diff any further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants