Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock on I/O errors during device removal #17145

Merged
merged 1 commit into from
Mar 19, 2025

Conversation

amotin
Copy link
Member

@amotin amotin commented Mar 14, 2025

spa_vdev_remove_thread() should not hold svr_lock while loading metaslabs. It may block ZIO threads, required to handle metaslab loading, at least in case of read errors causing recovery writes.

How Has This Been Tested?

I hope it to fix deadlocks of removal_with_errors ZTS test that for some reason I observed plenty in the CI lately.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

spa_vdev_remove_thread() should not hold svr_lock while loading a
metaslab.  It may block ZIO threads, required to handle metaslab
loading, at least in case of read errors causing recovery writes.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
@amotin amotin added the Status: Code Review Needed Ready for review and testing label Mar 14, 2025
@amotin amotin mentioned this pull request Mar 17, 2025
13 tasks
@tonyhutter
Copy link
Contributor

Can we get one more reviewer on this one?

Copy link
Member

@robn robn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach seems right to me.

The only possible thing I can thing might go wrong is if two threads raced through here such that one loaded a range tree and swapped it into svr_allocd_segs, and then the other did. I suspect that won't happen or won't be a problem for other reasons though, based on normal ZFS patterns and my own intuition:

  • there's only one thread?
  • the locking dance can't fall out this way
  • its only ever going to swap the exact same contents in, so it's an effective no-op

Because I don't know this code at all and have not thought it all the way through, and because I am quite sure you know what you're doing here, take this as "well, I did at least think about it a little bit and its almost certainly fine" :)

Thanks!

@amotin
Copy link
Member Author

amotin commented Mar 19, 2025

@robn Thanks. Yes, AFAIK there can be only one thread. Plus the code asserts svr_allocd_segs is empty, first explicitly, and then (to my surprise) as part of zfs_range_tree_swap(). So it don't think there is a chance of double loading.

@amotin amotin added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 19, 2025
@amotin amotin merged commit 676b7ef into openzfs:master Mar 19, 2025
76 of 88 checks passed
@amotin amotin deleted the removal_deadlock branch March 19, 2025 18:48
amotin added a commit to amotin/zfs that referenced this pull request Mar 21, 2025
FreeBSD kernel's WITNESS code detected lock ordering violation in
spa_vdev_remove_cancel_sync().  It took svr_lock while holding
ms_lock, which is opposite to other places.  I was thinking to
resolve it similar to openzfs#17145, but looking closer I don't think
we even need svr_lock at that point, since we already asserted
svr_allocd_segs is empty, and we don't need to add there segments
we are going to call free_mapped_segment_cb for.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants