Skip to content

ZFS hangs on kernel error: VERIFY3(0 == dmu_bonus_hold_by_dnode #12001

@vicsn

Description

@vicsn

System information

Type Version/Name
Distribution Name Debian
Distribution Version Buster
Linux Kernel 5.10.0-0.bpo.4-amd64 / 4.19.0-16
Architecture amd64
ZFS Version 2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Zpool properties:

              -o ashift=12
              -O checksum=sha256
              -O compression=lz4
              -O atime=off
              -O normalization=formD
              -O xattr=sa
              -O relatime=on

Example test machine properties:

  • CPU model name : AMD Ryzen Threadripper 2990WX 32-Core Processor
  • 6 HDDs of model: Gigabyte Technology Co., Ltd. X399 AORUS PRO/X399 AORUS PRO-CF, BIOS F2g 05/08/2019

Describe the problem you're observing

During an incremental send using zfs send -I | receive -F, ZFS hangs. Any subsequent call to ZFS or e.g. opening any file on the ZFS filesystem, will hang indefinitely.

Note that we also experienced: #12014 - perhaps it is related.

Describe how to reproduce the problem

This is happening frequently for us, on multiple machines, all using HDDs. The machines act as backup targets so most activity is receiving snapshots all day. Other zfs related processes on the machines are: snapshots are taken and pruned multiple times per day and a scrub happens once a week. We use encrypted filesystems and a single zpool. Historically the error has triggered for us between 1 and 48 hours.

Total data usage is 2.12T, with zpool iostat output showing:

              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data        4.37T  10.2T     17    172   967K  5.46M

Current mitigation

Downgrading to zfs version 0.8.4-1~bpo10+1 seems to have resolved the issue - at the time of writing we had about a week month without ZFS hanging.

Include any warning/errors/backtraces from the system logs

The first error

|May  4 17:11:31 <host> kernel: [96617.381743] VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)                               
│May  4 17:11:31 <host> kernel: [96617.381882] PANIC at dmu_recv.c:1811:receive_object()                                                                
│May  4 17:11:31 <host> kernel: [96617.381990] Showing stack for process 97766                                                                          
│May  4 17:11:31 <host> kernel: [96617.382002] CPU: 55 PID: 97766 Comm: receive_writer Tainted: P           OE     4.19.0-16-amd64 #1 Debian 4.19.181-1 
│May  4 17:11:31 <host> kernel: [96617.382006] Hardware name: Gigabyte Technology Co., Ltd. X399 AORUS PRO/X399 AORUS PRO-CF, BIOS F2g 05/08/2019       
│May  4 17:11:31 <host> kernel: [96617.382008] Call Trace:                                                                                              
│May  4 17:11:31 <host> kernel: [96617.382058]  dump_stack+0x66/0x81                                                                                    
│May  4 17:11:31 <host> kernel: [96617.382098]  spl_panic+0xd3/0xfb [spl]                                                                               
│May  4 17:11:31 <host> kernel: [96617.382454]  ? dbuf_rele_and_unlock+0x30f/0x660 [zfs]                                                                
│May  4 17:11:31 <host> kernel: [96617.382562]  ? dbuf_read+0x1ca/0x520 [zfs]                                                                           
│May  4 17:11:31 <host> kernel: [96617.382642]  ? dnode_hold_impl+0x350/0xc20 [zfs]                                                                     
│May  4 17:11:31 <host> kernel: [96617.382652]  ? spl_kmem_cache_free+0xec/0x1c0 [spl]                                                                  
│May  4 17:11:31 <host> kernel: [96617.382728]  receive_object+0x923/0xc70 [zfs]                                                                        
│May  4 17:11:31 <host> kernel: [96617.382803]  receive_writer_thread+0x279/0xa00 [zfs]                                                                 
│May  4 17:11:31 <host> kernel: [96617.382817]  ? set_curr_task_fair+0x26/0x50                                                                          
│May  4 17:11:31 <host> kernel: [96617.382894]  ? receive_process_write_record+0x190/0x190 [zfs]                                                        
│May  4 17:11:31 <host> kernel: [96617.382901]  ? __thread_exit+0x20/0x20 [spl]                                                                         
│May  4 17:11:31 <host> kernel: [96617.382905]  ? thread_generic_wrapper+0x6f/0x80 [spl]                                                                
│May  4 17:11:31 <host> kernel: [96617.382910]  thread_generic_wrapper+0x6f/0x80 [spl]                                                                  
│May  4 17:11:31 <host> kernel: [96617.382920]  kthread+0x112/0x130                                                                                     
│May  4 17:11:31 <host> kernel: [96617.382926]  ? kthread_bind+0x30/0x30                                                                                
│May  4 17:11:31 <host> kernel: [96617.382935]  ret_from_fork+0x22/0x40                

Subsequent errors

│May 4 17:16:05 <host> kernel: [96890.895937] hrtimer: interrupt took 74060 ns
│May 4 17:16:20 <host> kernel: [96906.140617] INFO: task txg_quiesce:1694 blocked for more than 120 seconds.
│May 4 17:16:20 <host> kernel: [96906.140714] Tainted: P OE 4.19.0-16-amd64 #1 Debian 4.19.181-1
│May 4 17:16:20 <host> kernel: [96906.140799] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
│May 4 17:16:20 <host> kernel: [96906.140884] txg_quiesce D 0 1694 2 0x80000000
│May 4 17:16:20 <host> kernel: [96906.140889] Call Trace:
│May 4 17:16:20 <host> kernel: [96906.140903] __schedule+0x29f/0x840
│May 4 17:16:20 <host> kernel: [96906.140908] schedule+0x28/0x80
│May 4 17:16:20 <host> kernel: [96906.140924] cv_wait_common+0xfb/0x130 [spl]
│May 4 17:16:20 <host> kernel: [96906.140935] ? finish_wait+0x80/0x80
│May 4 17:16:20 <host> kernel: [96906.141077] txg_quiesce_thread+0x2a9/0x3a0 [zfs]
│May 4 17:16:20 <host> kernel: [96906.141169] ? txg_sync_thread+0x480/0x480 [zfs]
│May 4 17:16:20 <host> kernel: [96906.141176] ? __thread_exit+0x20/0x20 [spl]
│May 4 17:16:20 <host> kernel: [96906.141182] thread_generic_wrapper+0x6f/0x80 [spl]
│May 4 17:16:20 <host> kernel: [96906.141187] kthread+0x112/0x130
│May 4 17:16:20 <host> kernel: [96906.141190] ? kthread_bind+0x30/0x30
│May 4 17:16:20 <host> kernel: [96906.141193] ret_from_fork+0x22/0x40
│May 4 17:16:20 <host> kernel: [96906.141783] INFO: task receive_writer:97766 blocked for more than 120 seconds. ┤
│May 4 17:16:20 <host> kernel: [96906.141865] Tainted: P OE 4.19.0-16-amd64 #1 Debian 4.19.181-1 ┤
│May 4 17:16:20 <host> kernel: [96906.141963] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ┤
│May 4 17:16:20 <host> kernel: [96906.142065] receive_writer D 0 97766 2 0x80000000 ┤
│May 4 17:16:20 <host> kernel: [96906.142068] Call Trace: ┤
│May 4 17:16:20 <host> kernel: [96906.142074] __schedule+0x29f/0x840 ┤
│May 4 17:16:20 <host> kernel: [96906.142078] ? ret_from_fork+0x21/0x40 ┤
│May 4 17:16:20 <host> kernel: [96906.142083] schedule+0x28/0x80 ┤
│May 4 17:16:20 <host> kernel: [96906.142091] spl_panic+0xf9/0xfb [spl] ┤
│May 4 17:16:20 <host> kernel: [96906.142192] ? dbuf_rele_and_unlock+0x30f/0x660 [zfs] ┤
│May 4 17:16:20 <host> kernel: [96906.142239] ? dbuf_read+0x1ca/0x520 [zfs] │
│May 4 17:16:20 <host> kernel: [96906.142289] ? dnode_hold_impl+0x350/0xc20 [zfs] ┤
│May 4 17:16:20 <host> kernel: [96906.142295] ? spl_kmem_cache_free+0xec/0x1c0 [spl] ┤
│May 4 17:16:20 <host> kernel: [96906.142343] receive_object+0x923/0xc70 [zfs] ┤
│May 4 17:16:20 <host> kernel: [96906.142394] receive_writer_thread+0x279/0xa00 [zfs] ┤
│May 4 17:16:20 <host> kernel: [96906.142399] ? set_curr_task_fair+0x26/0x50 ┤
│May 4 17:16:20 <host> kernel: [96906.142448] ? receive_process_write_record+0x190/0x190 [zfs] ┤
│May 4 17:16:20 <host> kernel: [96906.142455] ? __thread_exit+0x20/0x20 [spl] ┤
│May 4 17:16:20 <host> kernel: [96906.142460] ? thread_generic_wrapper+0x6f/0x80 [spl] ┤
│May 4 17:16:20 <host> kernel: [96906.142465] thread_generic_wrapper+0x6f/0x80 [spl] ┤
│May 4 17:16:20 <host> kernel: [96906.142469] kthread+0x112/0x130 ┤
│May 4 17:16:20 <host> kernel: [96906.142471] ? kthread_bind+0x30/0x30 ┤
│May 4 17:16:20 <host> kernel: [96906.142473] ret_from_fork+0x22/0x40

EDIT May 11th 2021: downgrading ZFS to 0.8.4-1~bpo10+1 resolved the issue.
EDIT June 14th 2021: the issue has still been resolved on the downgraded ZFS version.
EDIT December 20th 2021: the issue is still there. Also with zfs-2.0.3-9 on 5.10.0-8-amd64

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions