-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Consider adding warnings against using zfs native encryption along with send/recv in production #494
Comments
It says here:
It seems better to have Known Issues than general guidance to not use encryption, unless there are totally unknown causes to verified and unsolved problems. But, yes, docs for any feature should advise people to consult Known Issues when they exist. |
If there is functionality around encryption that is known to cause corruption, there really ought to be an unavoidable warning in the software to act as a catch. Not everyone reads all relevant documentation before executing each command, and it seems like issues with encryption have been around long enough for software warnings to be implemented (this isn't a freshly-discovered bug). I know of plenty of people personally who presumably didn't see any warnings, just commands that went through cleanly, when turning on encryption and assumed that if it's shipping in ZFS it must be safe. |
@Matthew-Bradley that's a great idea. Sounds to me like both the documentation and the software should warn against enabling ZFS-native encryption. This is how the ZFS community can show respect and courtesy to users, especially considering that ZFS-native encryption has been causing real headaches and time-loss for real people for years now. |
This is news to me. I assumed that zfs native encryption is not as fast as native encryption, but that it is stable. Could someone more knowledgeable please clarify whether encryption is considered unsafe in general, or whether it is unsafe in combination with other features or usage patterns, and if so with which? |
(Disclaimer: I'm not nowledgeable about zfs internals. But I am an experienced user.) We've been running encryption on Ubuntu systems since it was available in the distro. We have never had any corrupt data. Until recently, the only problem we had was with Ubuntu/Jammy and a missing patch, which caused snapshots to be unmountable. The patch (or a send/recv loop) fixed that: https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1987190 But, recently we did observe the send/recv issues that people are talking about. This did make snapshots that were recently made unavailable/useless, although the data does appear to be there (according to zdb). This is tracked in #15474 but it very much looks like #12014. openzfs/zfs#15474 (comment) (wdoekes, Nov 2023)
openzfs/zfs#12014 (comment) (aerusso, May 2021)
openzfs/zfs#12014 (comment) (jgoerzen, May 2021)
openzfs/zfs#12014 (comment) (J0riz, Nov 2023)
Usage pattern that appears to trigger this issue:
So. Yes, I would prefer that the bug gets fixed. But if you're willing to put up with maintenance of the occasional snapshot failures - which might never happen - then I think you should be fine. |
I'm going to try to steer things back on topic to the idea of adding some warnings about this feature.
|
I've added a draft of a warning message to the OP. It can of course be adjusted, but it is just there to get the ball rolling. |
@rincebrain : In your reddit post you say that you are able to reproduce "one" encryption issue 50 % of the time on our test system. Which issue is that? And why cant it be further debugged if it is reproducible? |
I've personally given up trying to fix native encryption issues after the project's continued refusal to acknowledge they fucked up by merging this. I do not have the energy to argue any more that introducing a 1% chance of lighting your shit on fire in a project that's supposed to be about "reliability" where one did not exist before is a catastrophic failure, or that "it's not technically data loss because someone could write tooling to recover it" doesn't really matter if you don't have the tooling in hand, that's still data loss from everyone else's perspective, or "snapshots can cause you to get IO errors if you're doing send/recv at the same time" is a sign of how badly this is broken. Of course, as always, leadership will probably be along shortly to claim there's no issue, and that it would be bad PR to admit there's an issue, and that's why they won't warn people, like the last 2 or 3 times I've asked them to do this. The reason the reproducer system I have is difficult to debug is that it's a little sparc box, and the race in question in openzfs/zfs#11679 is very finicky, so A) it being a sparc box means most of Linux's kernel debugging tools just laugh at you and don't run, and B) if you add too many debug prints, the timing gets less reliable, so you can't just get all the information you want out reliably. |
I do think it is on-topic to try to document as best as possible when these issues occur. It would strengthen the case for putting up a warning and help readers of such a warning to make an informed decision. |
Agreed. There should be some clarity exactly *what* is being warned against. Whether a specific warning/lockout is warranted when using certain features in combination with native encryption, depends on whether the problems can *truly* be narrowed to specific combinations. Right now is looks like the answer is No, with kernel panics and random data corruption in the mix. Even if that is disregarded, corruption with native encryption seems from this thread to impact multiple features across snapshots, send/receive, and scrubbing. In these latter two cases (panics and widespread impact) top-level warnings against enabling native encryption in both the documentation and tools must be part of the solution. Some tooling friction against enabling it may also be warranted. A corresponding issue should probably be set up in https://github.com/openzfs/zfs/issues or similar to track changes to the tooling. The internet at large has already picked up on this (I first became aware of it from a phoronix article). The best thing the project can do is put strong safeguards in place to stop the flow of people being bitten. People should be absolutely certain that they can't do something dangerous without at least running into a warning or error. |
Absolutely agreed that having as much clarity as we can achieve is a good thing. If anyone has suggestions on tweaks for the warning message based on what you've learned, please chime in -- the draft is in the OP. I've tried to clarify it as much as I can, based on what I've been able to learn from the publicly available information. If someone wants to work in the testbed results from @rincebrain, that seems fine as well. I thought about doing it, but struggled with how to phrase it. Personally, I think it's important not to make the message "too scary", because that can provide folks an opening to muddy the waters with comments like "Well, I've been using it fine for years!", which while not untrue, doesn't really help the many people that try it and run into the issues. This is why the current draft of the warning mentions that many have been able to use it without issue. |
I think the only way is to fix issues if we have encryption merged so leaders agreeded to support it like there is support when there is a bug in code without encryption. Otherwise there should be warning encryption is EXPERIMENTAL - no one supports it. Use if you are prepared to lose your data. |
FreeBSD Discord drew attention to https://old.reddit.com/r/zfs/comments/1aowvuj/-/kq4f13l/ (2024-02-12) in PSA: ZFS has a data corruption bug when using native encryption and send/recv.
Reference: OpenZFS Native Encryption Use Raises Data Corruption Concerns - Phoronix – I assume that this article triggered the same-day post in /r/zfs. |
Among experienced zfs users and developers, it seems to be conventional wisdom that zfs native encryption is not suitable for production usage, particularly when combined with snapshotting and zfs send/recv. There is a long standing data corruption issue with many firsthand user reports:
openzfs/zfs#12014
openzfs/zfs#11688
(Also see the issues linked from those)
Additionally, if you join #zfs or #zfsonlinux on libera.chat and mention that you're having an issue with zfs native encryption, you'll be met with advice from developers that zfs native encryption is simply not reliable.
Should warnings be added to the sections of the documentation and/or the zfs command itself that mention native encryption that this combination of features (native encryption + send/recv) is known to be unsuitable for production usage? As-is, there don't appear to be any warnings, and it just seems inappropriate to guide new zfs users down a path toward potential data corruption, or even -- at best -- unscheduled reboots and scrubs. I have attempted writing a warning message below. This can of course be adjusted and is just here to get the ball rolling:
Begin message
End Message
Update:
I received some feedback that this was not well-substantiated enough. So for some additional context, here is a reddit comment from a zfs developer / contributor:
In addition, there is the constant stream of user reports in the issues referenced above.
I think there's already an understanding that this issue may be very difficult to fix, but in the meantime I'm just suggesting that it would be good if layman users such as myself had some documentation and zfs command level warning against using these features in production until this is resolved.
The text was updated successfully, but these errors were encountered: