Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linux: zvols: correctly detect flush requests #17131

Merged

Conversation

Fabian-Gruenbichler
Copy link
Contributor

Motivation and Context

a user reported that since ZFS 2.2.7, sync writes from a VM backed by a zvol are no longer honored as sync writes by ZFS. a bisect identified the commit introducing the regression (see below).

Description

since 4.10, bio->bi_opf needs to be checked to determine all kinds of flush requests. this was the case prior to the commit referenced below, but the order of ifdefs was not the usual one (newest up top), which might have caused this to slip through.

this fixes a regression when using zvols as Qemu block devices, but might have broken other use cases as well. the symptoms are that all sync writes from within a VM configured to use such a virtual block devices are ignored and treated as async writes by the host ZFS layer.

this can be verified using fio in sync mode inside the VM, for example with

fio --filename=/dev/sda --ioengine=libaio --loops=1 --size=10G --time_based --runtime=60 --group_reporting --stonewall --name=cc1 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 --numjobs=1 --sync=1

which shows an IOPS number way above what the physical device underneath supports, with "zpool iostat -r 1" on the hypervisor side showing no sync IO occuring during the benchmark.

with the regression fixed, both fio inside the VM and the IO stats on the host show the expected numbers.

Fixes: 846b598

How Has This Been Tested?

see above - configured a test VM with a disk backed by a zvol, bisected using the fio commandline above and verified the fix in the same fashion after identifying the culprit,

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@Fabian-Gruenbichler
Copy link
Contributor Author

it would be great if this could be backported to 2.3.x (and 2.2.x, in case new releases will be cut there as well) once reviewed and merged into master!

tagging @robn @tonyhutter

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I can see, modern kernel define both REQ_OP_FLUSH and REQ_PREFLUSH, so it is not obvious to me which one is right these days. May be checking both the same as before in HAVE_REQ_OP_FLUSH case would be safer? If nothing else, the comment above the function is inconsistent with the proposed content.

@Fabian-Gruenbichler
Copy link
Contributor Author

yes, the comment already got broken by an earlier kernel compat bump..

I tested both the variant for 4.10+ (just checking PREFLUSH, as it ended up in the final commit) and the one before (checking either OP or the flag), both seem to work. the kernel's op_is_flush (https://github.com/torvalds/linux/blob/master/include/linux/blk_types.h#L445) checks the flag, but with an additional bit being required. I am not too well-versed in block-layer internals, so either the pre 4.10 variant or the 4.10+ variant seem fine to me, since the latter is a subset of the former, and this would effectively revert to the ZFS <= 2.2.6 behaviour instead of the 2.2.7/2.3.0/master definitely broken one ;)

@tonyhutter
Copy link
Contributor

op_is_flush() is present on 4.18-6.13 kernels (ZFS's supported versions), so that might be the way to go. This at least builds for me on the 6.13 kernel (no idea if it fixes underlying issue):

diff --git a/include/os/linux/kernel/linux/blkdev_compat.h b/include/os/linux/kernel/linux/blkdev_compat.h
index d96708c60..8b040f4d6 100644
--- a/include/os/linux/kernel/linux/blkdev_compat.h
+++ b/include/os/linux/kernel/linux/blkdev_compat.h
@@ -383,7 +383,7 @@ bio_set_flush(struct bio *bio)
 static inline boolean_t
 bio_is_flush(struct bio *bio)
 {
-       return (bio_op(bio) == REQ_OP_FLUSH);
+       return (op_is_flush(bio->bi_opf));
 }
 
 /*

I can't seem to find the equivalent op_is_flush() macro for the blk-mq path, so I'm assuming the existing req_op(rq) == REQ_OP_FLUSH check is ok there.

@Fabian-Gruenbichler Fabian-Gruenbichler force-pushed the zvol-flush-behaviour-regression branch from 3f2242e to c9ae14e Compare March 11, 2025 07:21
@Fabian-Gruenbichler
Copy link
Contributor Author

can confirm that op_is_flush also behaves as expected, so switched to that and added the previous REQ_OP_FLUSH check back in.

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Mar 11, 2025
Copy link
Contributor

@tonyhutter tonyhutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fabian-Gruenbichler this looks good. If you squash the commits it should be ready to merge.

since 4.10, bio->bi_opf needs to be checked to determine all kinds of
flush requests. this was the case prior to the commit referenced below,
but the order of ifdefs was not the usual one (newest up top), which
might have caused this to slip through.

this fixes a regression when using zvols as Qemu block devices, but
might have broken other use cases as well. the symptoms are that all
sync writes from within a VM configured to use such a virtual block
devices are ignored and treated as async writes by the host ZFS layer.

this can be verified using fio in sync mode inside the VM, for example
with

 fio \
 --filename=/dev/sda --ioengine=libaio --loops=1 --size=10G \
 --time_based --runtime=60 --group_reporting --stonewall --name=cc1 \
 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 \
 --numjobs=1 --sync=1

which shows an IOPS number way above what the physical device underneath
supports, with "zpool iostat -r 1" on the hypervisor side showing no
sync IO occuring during the benchmark.

with the regression fixed, both fio inside the VM and the IO stats on
the host show the expected numbers.

Fixes: 846b598
"config: remove HAVE_REQ_OP_* and HAVE_REQ_*"

Co-authored-by: Alexander Motin <[email protected]>
Signed-off-by: Fabian-Gruenbichler <[email protected]>
@Fabian-Gruenbichler Fabian-Gruenbichler force-pushed the zvol-flush-behaviour-regression branch from ee17324 to 644a7a7 Compare March 12, 2025 07:13
@Fabian-Gruenbichler
Copy link
Contributor Author

done! left the co-authored-by that github added, the change is trivial either way :) thanks for the speedy review

@amotin amotin added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 12, 2025
@tonyhutter tonyhutter merged commit 41823a0 into openzfs:master Mar 12, 2025
36 of 41 checks passed
ixhamza pushed a commit to truenas/zfs that referenced this pull request Mar 19, 2025
since 4.10, bio->bi_opf needs to be checked to determine all kinds of
flush requests. this was the case prior to the commit referenced below,
but the order of ifdefs was not the usual one (newest up top), which
might have caused this to slip through.

this fixes a regression when using zvols as Qemu block devices, but
might have broken other use cases as well. the symptoms are that all
sync writes from within a VM configured to use such a virtual block
devices are ignored and treated as async writes by the host ZFS layer.

this can be verified using fio in sync mode inside the VM, for example
with

 fio \
 --filename=/dev/sda --ioengine=libaio --loops=1 --size=10G \
 --time_based --runtime=60 --group_reporting --stonewall --name=cc1 \
 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 \
 --numjobs=1 --sync=1

which shows an IOPS number way above what the physical device underneath
supports, with "zpool iostat -r 1" on the hypervisor side showing no
sync IO occuring during the benchmark.

with the regression fixed, both fio inside the VM and the IO stats on
the host show the expected numbers.

Fixes: 846b598
"config: remove HAVE_REQ_OP_* and HAVE_REQ_*"

Signed-off-by: Fabian-Gruenbichler <[email protected]>
Co-authored-by: Alexander Motin <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
ixhamza pushed a commit to truenas/zfs that referenced this pull request Mar 19, 2025
since 4.10, bio->bi_opf needs to be checked to determine all kinds of
flush requests. this was the case prior to the commit referenced below,
but the order of ifdefs was not the usual one (newest up top), which
might have caused this to slip through.

this fixes a regression when using zvols as Qemu block devices, but
might have broken other use cases as well. the symptoms are that all
sync writes from within a VM configured to use such a virtual block
devices are ignored and treated as async writes by the host ZFS layer.

this can be verified using fio in sync mode inside the VM, for example
with

 fio \
 --filename=/dev/sda --ioengine=libaio --loops=1 --size=10G \
 --time_based --runtime=60 --group_reporting --stonewall --name=cc1 \
 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 \
 --numjobs=1 --sync=1

which shows an IOPS number way above what the physical device underneath
supports, with "zpool iostat -r 1" on the hypervisor side showing no
sync IO occuring during the benchmark.

with the regression fixed, both fio inside the VM and the IO stats on
the host show the expected numbers.

Fixes: 846b598
"config: remove HAVE_REQ_OP_* and HAVE_REQ_*"

Signed-off-by: Fabian-Gruenbichler <[email protected]>
Co-authored-by: Alexander Motin <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
ixhamza pushed a commit to truenas/zfs that referenced this pull request Mar 19, 2025
since 4.10, bio->bi_opf needs to be checked to determine all kinds of
flush requests. this was the case prior to the commit referenced below,
but the order of ifdefs was not the usual one (newest up top), which
might have caused this to slip through.

this fixes a regression when using zvols as Qemu block devices, but
might have broken other use cases as well. the symptoms are that all
sync writes from within a VM configured to use such a virtual block
devices are ignored and treated as async writes by the host ZFS layer.

this can be verified using fio in sync mode inside the VM, for example
with

 fio \
 --filename=/dev/sda --ioengine=libaio --loops=1 --size=10G \
 --time_based --runtime=60 --group_reporting --stonewall --name=cc1 \
 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 \
 --numjobs=1 --sync=1

which shows an IOPS number way above what the physical device underneath
supports, with "zpool iostat -r 1" on the hypervisor side showing no
sync IO occuring during the benchmark.

with the regression fixed, both fio inside the VM and the IO stats on
the host show the expected numbers.

Fixes: 846b598
"config: remove HAVE_REQ_OP_* and HAVE_REQ_*"

Signed-off-by: Fabian-Gruenbichler <[email protected]>
Co-authored-by: Alexander Motin <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
ixhamza added a commit to truenas/zfs that referenced this pull request Mar 19, 2025
NAS-134891 / 25.10 / linux: zvols: correctly detect flush requests (openzfs#17131)
amotin added a commit to truenas/zfs that referenced this pull request Mar 19, 2025
NAS-134891 / 25.04.0 / linux: zvols: correctly detect flush requests (openzfs#17131)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants