Fix 2 bugs in non-raw send with encryption #17340

gamanakis · 2025-05-16T13:46:07Z

Motivation and Context

Closes #12014

Description

Bisecting identified the redacted send/receive as the source of the bug
for issue #12014. Specifically the call to
dsl_dataset_hold_obj(&fromds) has been replaced by
dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates
a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing
and the key mapping is not cleared. This may be inadvertedly used, which
results in arc_untransform failing with ECKSUM in:
arc_untransform+0x96/0xb0 [zfs]
dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs]
dbuf_read+0x56/0x770 [zfs]
dmu_buf_hold_by_dnode+0x4a/0x80 [zfs]
zap_lockdir+0x87/0xf0 [zfs]
zap_lookup_norm+0x5c/0xd0 [zfs]
zap_lookup+0x16/0x20 [zfs]
zfs_get_zplprop+0x8d/0x1d0 [zfs]
setup_featureflags+0x267/0x2e0 [zfs]
dmu_send_impl+0xe7/0xcb0 [zfs]
dmu_send_obj+0x265/0x360 [zfs]
zfs_ioc_send+0x10c/0x280 [zfs]

Fix this by restoring the call to dsl_dataset_hold_obj().

The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with
dsl_dataset_rele_flags().

Both leaked key mappings will cause a panic when exporting the
sending pool or unloading the zfs module after a non-raw send from
an encrypted filesystem.

How Has This Been Tested?

Manually running the scripts https://github.com/HankB/provoke_ZFS_corruption

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

gamanakis · 2025-05-16T16:38:52Z

How can I acknowledge @HankB and @pcd1193182 for the significant effort they put into this?

amotin

Makes sense to me. Though after searching around this area for couple hours I still have no idea why it helps with the corruptions. I think it is only a trigger, but for what?

tonyhutter · 2025-05-16T18:01:47Z

How can I acknowledge @HankB and @pcd1193182 for the significant effort they put into this?

Contributions-by: would be my recommendation. I see some of those in the git logs.

behlendorf

The fixes here make sense. But I agree, it's not at all clear to me how this would lead to a corruption. Was the final reproducer for this something which could be reasonably adapted and included in the test suite?

ryao · 2025-05-17T01:45:43Z

The fixes here make sense. But I agree, it's not at all clear to me how this would lead to a corruption. Was the final reproducer for this something which could be reasonably adapted and included in the test suite?

I have a WIP CodeQL check for instances of this problem in my git repository:

https://github.com/ryao/zfs/tree/issue-12014

I am waiting to see if it works. If it does, we should think about adding checks for other functions that could be accidentally mismatched.

ryao · 2025-05-17T04:31:52Z

ryao/zfs@7c921e3 works and detected the two issues fixed in this patch. The current iteration is limited to cases where the mismatched hold and release functions are called from the same function. I might try to handle more cases before I open a PR with the check.

ryao · 2025-05-17T05:00:22Z

The current iteration is limited to cases where the mismatched hold and release functions are called from the same function. I might try to handle more cases before I open a PR with the check.

I just did a manual search to see if this is even necessary. It turns out that it is. dmu_objset_hold_flags() calls dsl_dataset_hold_flags(), which calls dsl_dataset_hold_obj_flags(). The returned dsl_dataset_t pointer is then passed to dsl_dataset_rele(), which is wrong, on error from dmu_objset_from_ds() in dmu_objset_hold_flags(). If I expand the power of the check to catch mismatched hold and release functions across function calls, the check should be able to identify this and possibly other bugs.

gamanakis · 2025-05-17T08:06:15Z

@ryao thank you for taking this to the next level.
@behlendorf I think expanding the ZTS would be great, but the scripts take about 2-3h to reproduce this.

gamanakis · 2025-05-17T19:37:56Z

I believe the crucial issue in #12014 is the failure to clear key mappings, which is not currently handled in the code base. My suspicion is that the apparent corruption in ARC stems from a scenario where an object (snapshot) is freed, but its key mapping persists. In the current code base this happens for both fromds and to_ds. This suggests the stale key mappings are inadvertently reused.

In this context, it would likely be safe to use dsl_dataset_hold_obj_flags(&fromds) as long as we call dsl_dataset_rele_flags(&fromds) afterwards.

HankB · 2025-05-18T12:12:48Z

Was the final reproducer for this something which could be reasonably adapted and included in the test suite?

I'll comment because I wrote the reproducer. (Unfortunately I remain blissfully ignorant of the test suite and how it is run.)

I produced a series of scripts that:

Populate a pool with compressible and random files in multiple nested datasets.
Modify files, create and delete snapshots and send the encrypted pool to an unencrypted pool. (IOW, thrash on the pool.)

The setup of the tests is manual (copying and pasting commands from previous test notes.) This could all be scripted, I just didn't do that.

On my H/W - ten year old server motherboard, two 500GB SATA SSDs - it takes about 2 hours typically to reproduce corruption. One problem with automating it is that the results are indeterminate. It's "whale on it to see if it breaks." If it typically breaks at 2 hours, how long insures with reasonable confidence that the problem does not exist? Some earlier testing took on the order of a day to provoke corruption but I was able to bring that down by increasing the activity on the pool. It's conceivable that other changes to the stack could result in reducing the chance of corruption, delaying onset past the typical two hours.

Perhaps the next thing would be to fully automate the setup and bring the existing scripts up to "production quality." Others have used the scripts and been kind not comment on the code quality (beyond submitting PRs for bugs that I had overlooked.) Whether the scripts could be included in a test suite is left to those more familiar with the test suite.

gamanakis · 2025-05-18T20:15:18Z

62f125b: updated the commit message

AndrewJDR · 2025-05-18T21:49:44Z

I’m just an observer here, but i’m curious: Is the context and timing of the uncleared key mapping reusage understood well enough to design a synthetic test for the crash?

owlshrimp · 2025-05-19T00:56:36Z

I’m just an observer here, but i’m curious: Is the context and timing of the uncleared key mapping reusage understood well enough to design a synthetic test for the crash?

Or alternatively, can the theory be tested with some printf's or the like?

Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Signed-off-by: George Amanakis <[email protected]>

gamanakis · 2025-05-19T04:42:35Z

40567d0: updated the commit message, it's not really a panic but arc_untransform failing with ECKSUM.

The theory that the stale key mapping is reused is the only possible explanation I see. I have been trying to see if this is true but it's tricky with cmn_err(). In the current codebase all key mappings are leaked.

behlendorf · 2025-05-19T16:52:33Z

@HankB thanks for the clarification on the reproducer. I figured it wouldn't be easy to add to the CI or George would have done it in this PR already, but I still wanted to ask!

@ryao the codeQL check would be a nice addition to help us catch future issues, even if it is limited to within the scope of a single calling function (for now). Plus having a working example should make it pretty easy to extend to other similar interfaces.

This was caught when doing a manual check to see if openzfs#17352 needed to be improved to catch mismatches across stack frames of the kind that were first found in openzfs#17340. Signed-off-by: Richard Yao <[email protected]>

Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12014 Closes openzfs#17340

This was caught when doing a manual check to see if #17352 needed to be improved to catch mismatches across stack frames of the kind that were first found in #17340. Reviewed-by: George Amanakis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #17353

robn · 2025-05-22T02:15:57Z

Has anyone done a full analysis to understand how this bug manifested as the myriad encryption bugs, panics and other quirks we've seen over the years?

Don't get me wrong; this PR is obviously right for what it is, and I won't be surprised if it closes a locking gap somewhere that could be hit if you're unlucky, but I'd like those dots connected so I can feel comfortable saying yes, this definitely sorts it out. Especially since the forums are starting to pick up on it and the high-fives and cheers are making me a bit uncomfortable.

(and yes, I am a blast at parties 🎉 😬)

If not, I've got some spare time this afternoon; I'll try to do the analysis myself.

amotin · 2025-05-22T02:59:46Z

@robn I don't think there is still a proper understanding of what's going on. At least I haven't seen and don't have one.

Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12014 Closes openzfs#17340 (cherry picked from commit ea74cde)

This was caught when doing a manual check to see if openzfs#17352 needed to be improved to catch mismatches across stack frames of the kind that were first found in openzfs#17340. Reviewed-by: George Amanakis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#17353 (cherry picked from commit 83fa80a) (cherry picked from commit 07d815f)

This was caught when doing a manual check to see if openzfs#17352 needed to be improved to catch mismatches across stack frames of the kind that were first found in openzfs#17340. Reviewed-by: George Amanakis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#17353 (cherry picked from commit 83fa80a) (cherry picked from commit 07d815f2a06573514f51c8601aa60db6fe1a76ad)

Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12014 Closes openzfs#17340 (cherry picked from commit ea74cde)

This was caught when doing a manual check to see if openzfs#17352 needed to be improved to catch mismatches across stack frames of the kind that were first found in openzfs#17340. Reviewed-by: George Amanakis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#17353 (cherry picked from commit 83fa80a)

Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12014 Closes openzfs#17340 (cherry picked from commit ea74cde)

This was caught when doing a manual check to see if openzfs#17352 needed to be improved to catch mismatches across stack frames of the kind that were first found in openzfs#17340. Reviewed-by: George Amanakis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#17353 (cherry picked from commit 83fa80a)

Bisecting identified the redacted send/receive as the source of the bug for issue #12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #12014 Closes #17340 (cherry picked from commit ea74cde)

This was caught when doing a manual check to see if #17352 needed to be improved to catch mismatches across stack frames of the kind that were first found in #17340. Reviewed-by: George Amanakis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #17353 (cherry picked from commit 83fa80a) (cherry picked from commit 07d815f)

Bisecting identified the redacted send/receive as the source of the bug for issue #12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #12014 Closes #17340 (cherry picked from commit ea74cde)

This was caught when doing a manual check to see if #17352 needed to be improved to catch mismatches across stack frames of the kind that were first found in #17340. Reviewed-by: George Amanakis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #17353 (cherry picked from commit 83fa80a)

Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#12014 Closes openzfs#17340

girgen · 2025-07-03T07:10:43Z

@robn I don't think there is still a proper understanding of what's going on. At least I haven't seen and don't have one.

Are you sure that this statement still stands? The description from the FreeBSD advisory [1] is pretty clear what happens:

II.  Problem Description

ZFS has built-in replication and backup functionality, which serializes a
filesystem for transport to another system, known as "ZFS send".  ZFS send
also supports incremental updates between a pair of snapshots.  When sending
an encrypted dataset, the dataset can either be left encrypted for
transit/receipt (raw mode), or decrypted.  During a decrypting (normal) send,
a bug in the code caused some metadata (key mappings) in the snapshots to be
decrypted in memory, but not properly released.  As a result, the key mappings
used for decryption were not freed from the in-memory table.

OK, the sentence

... they can result in spurious checksum errors when they are
incorrectly used to access data later.

is perhaps a bit vague, but that could be only due to the brevity of the text.

[1] https://www.freebsd.org/security/advisories/FreeBSD-EN-25:10.zfs.asc

robn · 2025-07-03T07:20:52Z

I still don't think anyone has worked out the exact interactions and timing required between threads to trip it, but I think its understood enough. If you know enough of the code, it's fairly straightforward to draw a line from a leftover key mapping in memory to the various symptoms that have been reported over the years, and since those reports have dried up in the last few weeks, so I think it's safe to say "idk, seems plausible!".

It could probably be better understood, but there's loads of things like that, and they take whole hours to study. I'm content for now to stick this one on my reading list (tsundoku-style).

github-actions bot added the Status: Work in Progress Not yet ready for general review label May 16, 2025

gamanakis force-pushed the pr12014 branch from faa1b9e to af37285 Compare May 16, 2025 16:37

amotin approved these changes May 16, 2025

View reviewed changes

behlendorf approved these changes May 16, 2025

View reviewed changes

gamanakis force-pushed the pr12014 branch from af37285 to 23e42f3 Compare May 16, 2025 20:31

ryao approved these changes May 17, 2025

View reviewed changes

robn approved these changes May 17, 2025

View reviewed changes

lzsaver mentioned this pull request May 17, 2025

2.2.8 staging prep #17325

Closed

13 tasks

gamanakis force-pushed the pr12014 branch from 23e42f3 to 62f125b Compare May 18, 2025 20:14

gamanakis force-pushed the pr12014 branch from 62f125b to 40567d0 Compare May 19, 2025 04:40

amotin approved these changes May 19, 2025

View reviewed changes

amotin marked this pull request as ready for review May 19, 2025 13:59

amotin added the Status: Accepted Ready to integrate (reviewed, tested) label May 19, 2025

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review Status: Accepted Ready to integrate (reviewed, tested) labels May 19, 2025

amotin added the Status: Accepted Ready to integrate (reviewed, tested) label May 19, 2025

behlendorf approved these changes May 19, 2025

View reviewed changes

behlendorf merged commit ea74cde into openzfs:master May 19, 2025
28 checks passed

ryao mentioned this pull request May 19, 2025

Add CodeQL check to detect cause of issue #12014 #17352

Draft

13 tasks

ryao mentioned this pull request May 19, 2025

dmu_objset_hold_flags() should call dsl_dataset_rele_flags() on error #17353

Merged

13 tasks

LunarLambda mentioned this pull request May 22, 2025

Permanent zpool errors after a few days of making natively encrypted zvol snapshots with sanoid #15837

Closed

gamanakis mentioned this pull request May 22, 2025

ZFS hangs on kernel error: VERIFY3(0 == dmu_bonus_hold_by_dnode #12001

Closed

robn mentioned this pull request May 23, 2025

2.3.3 staging prep #17371

Closed

14 tasks

AndrewJDR mentioned this pull request May 31, 2025

Proposal: Consider adding warnings against using zfs native encryption along with send/recv in production openzfs/openzfs-docs#494

Open

almereyda mentioned this pull request Jun 13, 2025

2.3.3 staging prep #17459

Merged

14 tasks

nixomose mentioned this pull request Jun 14, 2025

BUG: unable to handle page fault for address: 0000000000422a99, looks like invalid pointer to kfree()? #17387

Closed

Fix 2 bugs in non-raw send with encryption #17340

Fix 2 bugs in non-raw send with encryption #17340

Conversation

gamanakis commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

gamanakis commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

tonyhutter commented May 16, 2025

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

ryao commented May 17, 2025

Uh oh!

ryao commented May 17, 2025

Uh oh!

ryao commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gamanakis commented May 17, 2025

Uh oh!

gamanakis commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HankB commented May 18, 2025

Uh oh!

gamanakis commented May 18, 2025

Uh oh!

AndrewJDR commented May 18, 2025

Uh oh!

owlshrimp commented May 19, 2025

Uh oh!

gamanakis commented May 19, 2025

Uh oh!

behlendorf commented May 19, 2025

Uh oh!

Uh oh!

robn commented May 22, 2025

Uh oh!

amotin commented May 22, 2025

Uh oh!

girgen commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented Jul 3, 2025

Uh oh!

Uh oh!

gamanakis commented May 16, 2025 •

edited

Loading

gamanakis commented May 16, 2025 •

edited

Loading

ryao commented May 17, 2025 •

edited

Loading

gamanakis commented May 17, 2025 •

edited

Loading

girgen commented Jul 3, 2025 •

edited

Loading