Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiering issue, hitting the bucket_alloc_failed tracepoint #768

Open
g2p opened this issue Oct 17, 2024 · 7 comments
Open

Tiering issue, hitting the bucket_alloc_failed tracepoint #768

g2p opened this issue Oct 17, 2024 · 7 comments

Comments

@g2p
Copy link
Contributor

g2p commented Oct 17, 2024

Here is the tracepoint output, showing issues writing to the promote device

(from perf trace -e 'bcachefs:*')

163193.447 kworker/u49:12/210640 bcachefs:read_promote(dev: 265289742, sector: 12958336168, nr_sector: 24, rwbs: "R")
163193.475 kworker/u49:12/210640 bcachefs:bucket_alloc_fail(dev: 265289742, str: "dev                    dm-15 (2)                                                            
watermark              stripe
data type              user
blocking               0
free                   3447
avail                  0             
copygc_wait            0/-591065977248
seen                   0                                                                                                                                                      
open                   0                                                                                                                                                      
need journal commit    0     
nocow                  0   
nouse                  0
mi_btree_bitmap        0   
err                    freelist_empty
")                                    
163193.495 kworker/u49:12/210640 bcachefs:bucket_alloc_fail(dev: 265289742, str: "dev                    dm-15 (2)                                                            
watermark              stripe                                                                                                                                                 
data type              user  
blocking               0   
free                   3447
avail                  0   
copygc_wait            0/-591065977248
seen                   0              
open                   0                                                                                                                                                      
need journal commit    0
nocow                  0
nouse                  0
mi_btree_bitmap        0
err                    freelist_empty
")
163193.514 kworker/u49:12/210640 bcachefs:bucket_alloc_fail(dev: 265289742, str: "dev                    dm-15 (2)
watermark              stripe
data type              user
blocking               1
free                   3447
avail                  0
copygc_wait            0/-591065977248
seen                   0
open                   0
need journal commit    0
nocow                  0
nouse                  0
mi_btree_bitmap        0
err                    freelist_empty
")
163193.564 kworker/u49:12/210640 bcachefs:read_promote(dev: 265289742, sector: 12958336192, nr_sector: 56, rwbs: "R")

(perf top -g is confirming this is when trying to cache)

worker_thread
process_one_work
__bch2_read_endio
bch2_data_update_read_done
bch2_write
__bch2_write
bch2_alloc_sectors_start_trans
__open_bucket_add_buckets
bch2_bucket_alloc_set_trans
bch2_bucket_alloc_trans
trace_bucket_alloc2
trace_bucket_alloc2

bcachefs show-super:

Device:                                     (unknown device)
External UUID:                             9986b9d8-9f64-43c3-9a1d-cd397dcf3643
Internal UUID:                             8c62e0f5-da29-4f90-b500-a08bd80af1c5
Magic number:                              c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index:                              0
Label:                                     Down
Version:                                   1.13: inode_has_child_snapshots
Version upgrade complete:                  1.13: inode_has_child_snapshots
Oldest version on disk:                    0.27: fragmentation_lru
Created:                                   Sun Mar  5 10:42:51 2023
Sequence number:                           832
Time of last write:                        Tue Oct 15 16:08:15 2024
Superblock size:                           5.08 KiB/1.00 MiB
Clean:                                     0
Devices:                                   2
Sections:                                  members_v1,crypt,replicas_v0,disk_groups,clean,journal_seq_blacklist,journal_v2,counters,members_v2,errors,ext,downgrade
Features:                                  lz4,zstd,journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,reflink_inline_data,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features:                           alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options:
  block_size:                              4.00 KiB
  btree_node_size:                         256 KiB
  errors:                                  continue [fix_safe] panic ro 
  metadata_replicas:                       1
  data_replicas:                           1
  metadata_replicas_required:              1
  data_replicas_required:                  1
  encoded_extent_max:                      64.0 KiB
  metadata_checksum:                       none [crc32c] crc64 xxhash 
  data_checksum:                           none [crc32c] crc64 xxhash 
  compression:                             lz4
  background_compression:                  zstd
  str_hash:                                crc32c crc64 [siphash] 
  metadata_target:                         none
  foreground_target:                       ssd
  background_target:                       hdd
  promote_target:                          ssd
  erasure_code:                            0
  inodes_32bit:                            1
  shard_inode_numbers:                     1
  inodes_use_key_cache:                    1
  gc_reserve_percent:                      5
  gc_reserve_bytes:                        0 B
  root_reserve_percent:                    0
  wide_macs:                               0
  promote_whole_extents:                   1
  acl:                                     1
  usrquota:                                0
  grpquota:                                0
  prjquota:                                0
  journal_flush_delay:                     1000
  journal_flush_disabled:                  0
  journal_reclaim_delay:                   100
  journal_transaction_names:               1
  allocator_stuck_timeout:                 30
  version_upgrade:                         [compatible] incompatible none 
  nocow:                                   0

members_v2 (size 448):
Device:                                    0
  Label:                                   hdd (0)
  UUID:                                    c75e08b9-b199-4daa-a17a-ac61aafee6cc
  Size:                                    11.6 TiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         0
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             512 KiB
  First bucket:                            0
  Buckets:                                 24225784
  Last mount:                              Tue Oct 15 16:08:14 2024
  Last superblock write:                   832
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                btree,user
  Btree allocated bitmap blocksize:        512 MiB
  Btree allocated bitmap:                  0000000000000000000000001111111111111111111111111111111111111111
  Durability:                              1
  Discard:                                 0
  Freespace initialized:                   1
Device:                                    2
  Label:                                   ssd (1)
  UUID:                                    94af33ae-cd3b-4be1-babf-22f3eeac2530
  Size:                                    128 GiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         0
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             512 KiB
  First bucket:                            0
  Buckets:                                 262144
  Last mount:                              Tue Oct 15 16:08:14 2024
  Last superblock write:                   832
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                journal,btree,user
  Btree allocated bitmap blocksize:        4.00 MiB
  Btree allocated bitmap:                  1111111111011101111111111110010110111011011111111001110111011111
  Durability:                              1
  Discard:                                 1
  Freespace initialized:                   1

errors (size 88):
accounting_mismatch                         25              Mon Sep 30 13:34:13 2024
alloc_key_fragmentation_lru_wrong           20              Fri Jul 12 10:35:17 2024
accounting_key_junk_at_end                  72              Sun Aug 11 10:32:56 2024
accounting_key_replicas_nr_devs_0           28              Sun Aug 11 10:32:56 2024
accounting_key_version_0                    183             Tue Oct 15 16:08:15 2024

Counters (this mount):

write_buffer_flush_sync            1.00B
write_super                        1.00B
btree_node_split                   2.00B
trans_restart_relock_path          2.00B
trans_restart_would_deadlock       3.00B
btree_cache_cannibalize_lock_fail  7.00B
btree_reserve_get_fail             13.0B
trans_restart_btree_node_split     13.0B
trans_restart_btree_node_reused    20.0B
trans_restart_mem_realloced        24.0B
trans_restart_relock_path_intent   72.0B
move_extent_fail                   148B
trans_restart_relock               284B
trans_restart_upgrade              322B
btree_path_upgrade_fail            519B
copygc                             1.03KiB
btree_path_relock_fail             1.71KiB
btree_cache_scan                   2.53KiB
btree_node_set_root                3.09KiB
btree_node_rewrite                 12.4KiB
bucket_alloc                       14.2KiB
bucket_discard                     14.2KiB
trans_traverse_all                 14.5KiB
btree_node_compact                 15.2KiB
btree_cache_cannibalize_lock       27.1KiB
btree_cache_cannibalize_unlock     27.1KiB
btree_node_alloc                   27.6KiB
btree_node_free                    45.5KiB
journal_write                      90.1KiB
move_extent_finish                 93.1KiB
io_move                            107KiB
move_extent_read                   107KiB
btree_cache_reap                   323KiB
btree_node_read                    339KiB
btree_node_write                   690KiB
transaction_commit                 865KiB
journal_reclaim_finish             1.68MiB
journal_reclaim_start              1.68MiB
read_split                         8.30MiB
read_bounce                        16.0MiB
read_promote                       16.0MiB
bucket_alloc_fail                  47.9MiB
io_write                           473MiB
io_read                            493MiB

Counters (since creation):

# (cd /sys/fs/bcachefs/99*/counters; grep ^ -- *) |sed -ne 's#:since filesystem creation: *#\t#p' |sed 's# ##' |grep -v $'\t0B$' |LC_ALL=C sort -hk2 |column -t
trans_restart_mark_replicas           1.00B
trans_restart_split_race              3.00B
trans_restart_would_deadlock_write    18.0B
gc_gens_end                           28.0B
gc_gens_start                         30.0B
trans_restart_relock_parent_for_fill  42.0B
write_buffer_flush_sync               59.0B
write_buffer_flush_slowpath           73.0B
trans_restart_too_many_iters          362B
trans_restart_journal_reclaim         627B
write_super                           832B
trans_restart_relock_after_fill       1.17KiB
trans_blocked_journal_reclaim         3.62KiB
trans_restart_key_cache_upgrade       5.02KiB
trans_restart_mem_realloced           5.02KiB
read_reuse_race                       13.9KiB
trans_restart_relock_next_node        13.9KiB
trans_restart_relock_path             28.5KiB
trans_restart_relock_key_cache_fill   29.9KiB
btree_reserve_get_fail                32.1KiB
trans_restart_journal_res_get         49.2KiB
btree_cache_cannibalize_lock_fail     58.3KiB
btree_node_merge                      110KiB
btree_node_set_root                   153KiB
copygc                                201KiB
trans_restart_btree_node_reused       231KiB
journal_full                          255KiB
btree_node_split                      434KiB
trans_restart_btree_node_split        514KiB
trans_restart_relock_path_intent      745KiB
trans_restart_upgrade                 773KiB
btree_path_upgrade_fail               1.07MiB
trans_restart_write_buffer_flush      1.23MiB
move_extent_fail                      1.53MiB
btree_node_rewrite                    2.49MiB
trans_restart_would_deadlock          3.73MiB
btree_node_compact                    12.7MiB
btree_cache_cannibalize_lock          15.7MiB
btree_cache_cannibalize_unlock        15.7MiB
btree_node_alloc                      16.1MiB
btree_node_free                       28.8MiB
journal_entry_full                    51.3MiB
journal_write                         69.1MiB
btree_cache_scan                      110MiB
trans_restart_relock                  141MiB
btree_path_relock_fail                194MiB
trans_traverse_all                    249MiB
bucket_invalidate                     306MiB
bucket_discard                        335MiB
bucket_alloc                          341MiB
btree_node_write                      343MiB
journal_reclaim_finish                1.85GiB
journal_reclaim_start                 1.85GiB
btree_cache_reap                      5.10GiB
btree_node_read                       5.10GiB
io_move                               8.66GiB
move_extent_read                      8.66GiB
read_split                            8.79GiB
read_promote                          10.4GiB
transaction_commit                    10.8GiB
read_bounce                           11.1GiB
bucket_alloc_fail                     11.8GiB
copygc_wait                           58.4GiB
move_extent_finish                    178GiB
io_write                              399GiB
io_read                               668GiB
@koverstreet
Copy link
Owner

bcachefs fs usage output?

@koverstreet
Copy link
Owner

I think I need more context - is copygc making progress?

@g2p
Copy link
Contributor Author

g2p commented Oct 18, 2024

Here is fs usage:

Filesystem: 9986b9d8-9f64-43c3-9a1d-cd397dcf3643
Size:                       11.1 TiB
Used:                       10.4 TiB
Online reserved:                 0 B

Data type       Required/total  Durability    Devices
reserved:       1/1                [] 43.8 GiB
btree:          1/1             1             [dm-14]             27.7 GiB
btree:          1/1             1             [dm-15]             92.0 GiB
user:           1/1             1             [dm-14]             10.2 TiB
user:           1/1             1             [dm-15]             33.3 GiB

Compression:
type              compressed    uncompressed     average extent size
lz4                  858 MiB        1010 MiB                60.6 KiB
zstd                 142 GiB         180 GiB                59.4 KiB
incompressible      10.7 TiB        10.7 TiB                23.5 KiB

Btree usage:
extents:            48.1 GiB
inodes:              114 MiB
dirents:            73.3 MiB
xattrs:              256 KiB
alloc:              2.84 GiB
quotas:              256 KiB
stripes:             256 KiB
reflink:             895 MiB
subvolumes:          256 KiB
snapshots:           256 KiB
lru:                27.3 GiB
freespace:          6.25 MiB
need_discard:        512 KiB
backpointers:       40.4 GiB
bucket_gens:        45.3 MiB
snapshot_trees:      256 KiB
deleted_inodes:      256 KiB
logged_ops:          256 KiB
rebalance_work:      512 KiB
subvolume_children:  256 KiB
accounting:         18.5 MiB

Pending rebalance work:
34.1 GiB

hdd (device 0):                dm-14              rw
                                data         buckets    fragmented
  free:                     1.34 TiB         2821351
  sb:                       3.00 MiB               7       508 KiB
  journal:                  4.00 GiB            8192
  btree:                    27.7 GiB           56721
  user:                     10.2 TiB        21339513       804 KiB
  cached:                        0 B               0
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:                  0 B               0
  unstriped:                     0 B               0
  capacity:                 11.6 TiB        24225784

ssd (device 2):                dm-15              rw
                                data         buckets    fragmented
  free:                     1.68 GiB            3441
  sb:                       3.00 MiB               7       508 KiB
  journal:                  1.00 GiB            2048
  btree:                    92.0 GiB          188527      16.5 MiB
  user:                     33.3 GiB           68121
  cached:                        0 B               0
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:                  0 B               0
  unstriped:                     0 B               0
  capacity:                  128 GiB          262144

I think copygc is stuck, "Pending rebalance work" has been stable for a long time and across reboots.

It's also strange that most of the SSD data is showing up as user and not cached.

Here is the rebalance_work btree; keys and bfloat-failed appear empty:

cat /sys/kernel/debug/bcachefs/99*/btrees/rebalance_work/formats
l 0 POS_MIN - SPOS_MAX:
    ptrs: seq 377ff76598042bc7 written 8 min_key POS_MIN durability: 1 ptr: 2:252942:512 gen 82
    format: u64s 3 fields 64:0, 64:0, 32:0, 0:0, 0:0, 0:0    unpack fn len: 0
    bytes used 0/261984 (0% full)
    sib u64s: 0, 0 (merge threshold 10916)
    nr packed keys 0
    nr unpacked keys 0
    floats 0
    failed unpacked 0
l 1 POS_MIN - SPOS_MAX:
    ptrs: seq c9d8b7bc10f603e7 written 184 min_key POS_MIN durability: 1 ptr: 2:131965:512 gen 98
    format: u64s 3 fields 64:0, 64:0, 32:0, 0:0, 0:0, 0:0    unpack fn len: 0
    bytes used 72/261984 (0% full)
    sib u64s: 65535, 65535 (merge threshold 10916)
    nr packed keys 1
    nr unpacked keys 0
    floats 0
    failed unpacked 0

@g2p
Copy link
Contributor Author

g2p commented Oct 18, 2024

Or is it just the accounting that's stuck? But I don't think pending_rebalance has been reset the last time I ran fsck. Also I have broken keys in that accounting btree (#756)

@g2p
Copy link
Contributor Author

g2p commented Oct 18, 2024

copygc does trigger on the same device:

tools/perf/perf trace -e bcachefs:copygc
     0.000 bch-copygc/998/6974 bcachefs:copygc(dev: 265289742, sectors_moved: 4712)
    26.189 bch-copygc/998/6974 bcachefs:copygc(dev: 265289742, sectors_moved: 4792)
    53.465 bch-copygc/998/6974 bcachefs:copygc(dev: 265289742, sectors_moved: 4896)
    61.943 bch-copygc/998/6974 bcachefs:copygc(dev: 265289742, sectors_moved: 3272)

@g2p
Copy link
Contributor Author

g2p commented Oct 18, 2024

bcachefs:rebalance_extent never triggers

grep ^ /sys/fs/bcachefs/99*/internal/rebalance_*
/sys/fs/bcachefs/99…/internal/rebalance_enabled:1
/sys/fs/bcachefs/99…/internal/rebalance_rate:1.00 KiB
/sys/fs/bcachefs/99…/internal/rebalance_rate_bytes:1024
/sys/fs/bcachefs/99…/internal/rebalance_rate_debug:rate:              1.00 KiB
/sys/fs/bcachefs/99…/internal/rebalance_rate_debug:target:            0 B
/sys/fs/bcachefs/99…/internal/rebalance_rate_debug:actual:            0 B
/sys/fs/bcachefs/99…/internal/rebalance_rate_debug:proportional:      0 B
/sys/fs/bcachefs/99…/internal/rebalance_rate_debug:derivative:        0 B
/sys/fs/bcachefs/99…/internal/rebalance_rate_debug:change:            0 B
/sys/fs/bcachefs/99…/internal/rebalance_rate_debug:next io:           -251785307ms
/sys/fs/bcachefs/99…/internal/rebalance_rate_d_term:30
/sys/fs/bcachefs/99…/internal/rebalance_rate_p_term_inverse:6000

@g2p g2p changed the title copygc issues, hitting the bucket_alloc_failed tracepoint Tiering issue, hitting the bucket_alloc_failed tracepoint Oct 18, 2024
@g2p
Copy link
Contributor Author

g2p commented Oct 26, 2024

I have worked around this by running bcachefs device evacuate and bcachefs device set-state rw again (I needed the filesystem).

Tiering would need to self-heal on its own but the immediate issue is gone for me. It should be reproducible again by filling a foreground target with data before adding the background target, although here the imbalance happened organically.

fs usage:

Filesystem: 9986b9d8-9f64-43c3-9a1d-cd397dcf3643
Size:                       11.1 TiB
Used:                       10.4 TiB
Online reserved:            1.91 MiB

Data type       Required/total  Durability    Devices
reserved:       1/1                [] 43.7 GiB
btree:          1/1             1             [dm-4]               111 GiB
btree:          1/1             1             [dm-5]              8.97 GiB
user:           1/1             1             [dm-4]              10.2 TiB
user:           1/1             1             [dm-5]              7.04 MiB
cached:         1/1             1             [dm-5]               107 GiB

Compression:
type              compressed    uncompressed     average extent size
lz4                 1.10 GiB        1.32 GiB                58.4 KiB
zstd                 149 GiB         192 GiB                63.0 KiB
incompressible      12.3 TiB        12.3 TiB                26.8 KiB

Btree usage:
extents:            48.3 GiB
inodes:              114 MiB
dirents:            73.3 MiB
xattrs:              256 KiB
alloc:              2.86 GiB
quotas:              256 KiB
stripes:             256 KiB
reflink:             904 MiB
subvolumes:          256 KiB
snapshots:           256 KiB
lru:                27.3 GiB
freespace:          5.50 MiB
need_discard:        512 KiB
backpointers:       40.5 GiB
bucket_gens:        45.3 MiB
snapshot_trees:      256 KiB
deleted_inodes:      256 KiB
logged_ops:          256 KiB
rebalance_work:      512 KiB
subvolume_children:  256 KiB
accounting:         18.5 MiB

Pending rebalance work:
29.5 MiB

hdd (device 0):                 dm-4              rw
                                data         buckets    fragmented
  free:                     1.19 TiB         2495812
  sb:                       3.00 MiB               7       508 KiB
  journal:                  4.00 GiB            8192
  btree:                     111 GiB          237973      5.05 GiB
  user:                     10.2 TiB        21483800      83.0 MiB
  cached:                        0 B               0
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:                  0 B               0
  unstriped:                     0 B               0
  capacity:                 11.6 TiB        24225784

ssd (device 2):                 dm-5              rw
                                data         buckets    fragmented
  free:                     5.01 GiB           10267
  sb:                       3.00 MiB               7       508 KiB
  journal:                  1.00 GiB            2048
  btree:                    8.97 GiB           30910      6.12 GiB
  user:                     7.04 MiB              16       982 KiB
  cached:                    107 GiB          218896      96.0 MiB
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:                  0 B               0
  unstriped:                     0 B               0
  capacity:                  128 GiB          262144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants