[6.12] Track btrfs patches #36

kakra · 2024-11-23T21:06:25Z

Export patch series: https://github.com/kakra/linux/pull/36.patch

Here's a good guide by @Forza-tng: https://wiki.tnonline.net/w/Btrfs/Allocator_Hints. Please leave them a nice comment. Thanks. :-)

Allocator hint patches: Allows to prefer SSDs for meta-data allocations while excluding HDDs from meta-data allocation, greatly improves btrfs responsiveness, file system remains compatible with non-patched systems but won't honor allocation preferences then (re-balance needed to fix that after going back to a patched kernel)

To make use of the allocator hints, add these to your kernel. Then run btrfs device usage /path/to/btrfs and take note of which device IDs are SSDs and which are HDDs.

Go to /sys/fs/btrfs/BTRFS-UUID/devinfo and run:

echo 0 | sudo tee HDD-ID/type to prefer writing data to this device (btrfs will then prefer allocating data chunks from this device before considering other devices) - recommended for HDDs, set by default
echo 1 | sudo tee SSD-ID/type to prefer writing meta-data to this device (btrfs will then prefer allocating meta-data chunks from this device before considering other devices) - recommended for SSDs
There's also type 2 and 3 which write meta-data only (2) or data only (3) to the specified device - not recommended, can result in early no-space situations
Added 2024-06-27: Type 4 can be used to avoid allocating new chunks from a device, useful if you plan on removing the device from the pool in the future: echo 4 | sudo tee LEGACY-ID/type
Added 2024-12-06: Type 5 can be used to prevent allocating any chunks from a device, useful if you plan on removing multiple devices from the pool in parallel: echo 5 | sudo tee LEGACY-ID/type
NEVER EVER use type 2 or 3 if you only have one type of device unless you know what you do and why you are doing this
The default "preferred" heuristics (0 and 1) are good enough because btrfs will always allocate from devices with most space first (respecting the "preferred" type with this patch)
After changing the values, a one-time meta-data and/or data balance (optionally filtered to the affected device IDs) is needed

Important note: This recommends to use at least two independent SSDs so btrfs meta-data raid1 requirement is still satisfied. You can, however, create two partitions on the same SSD but then it's no longer protected against hardware faults, it's essentially dup-quality meta-data then, not raid1. Before sizing the partitions, look at btrfs device usage to find the amount of meta-data, at least double that size to size your meta-data partitions.

This can be combined with bcache by directly using meta-data partitions as a native SSD partition for btrfs, and only using data partitions routed through bcache. This also takes a lot of meta-data pressure from bcache, making it more efficient and less write-wearing as a result.

Real-world example

In this example, sde is a 1 TB SSD having two meta-data partitions (2x 128 GB) with the remaining space dedicated to a single bcache partition attached to my btrfs pool devices:

# btrfs device usage /
/dev/bcache2, ID: 1
   Device size:             3.63TiB
   Device slack:            3.50KiB
   Data,single:             1.66TiB
   Unallocated:             1.97TiB

/dev/bcache0, ID: 2
   Device size:             3.63TiB
   Device slack:            3.50KiB
   Data,single:             1.66TiB
   Unallocated:             1.97TiB

/dev/bcache1, ID: 3
   Device size:             2.70TiB
   Device slack:            3.50KiB
   Data,single:           752.00GiB
   Unallocated:             1.96TiB

/dev/sde4, ID: 4
   Device size:           128.00GiB
   Device slack:              0.00B
   Metadata,RAID1:         27.00GiB
   System,RAID1:           32.00MiB
   Unallocated:           100.97GiB

/dev/sde5, ID: 5
   Device size:           128.01GiB
   Device slack:              0.00B
   Metadata,RAID1:         27.00GiB
   System,RAID1:           32.00MiB
   Unallocated:           100.98GiB

# bcache show
Name            Type            State                   Bname           AttachToDev
/dev/sdd2       1 (data)        dirty(running)          bcache1         /dev/sde2
/dev/sdb2       1 (data)        dirty(running)          bcache2         /dev/sde2
/dev/sde2       3 (cache)       active                  N/A             N/A
/dev/sdc2       1 (data)        clean(running)          bcache3         /dev/sde2
/dev/sda2       1 (data)        dirty(running)          bcache0         /dev/sde2

A curious reader may find that sde1 and sde3 are missing, which is my EFI boot partition (sde1) and swap space (sde3).

Add the following flags to give an hint about which chunk should be allocated in which a disk. The following flags are created: - BTRFS_DEV_ALLOCATION_PREFERRED_DATA preferred data chunk, but metadata chunk allowed - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA preferred metadata chunk, but data chunk allowed - BTRFS_DEV_ALLOCATION_METADATA_ONLY only metadata chunk allowed - BTRFS_DEV_ALLOCATION_DATA_ONLY only data chunk allowed Signed-off-by: Goffredo Baroncelli <[email protected]>

Signed-off-by: Goffredo Baroncelli <[email protected]>

Signed-off-by: Kai Krakow <[email protected]>

When this mode is enabled, the chunk allocation policy is modified as follow. Each disk may have a different tag: - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA - BTRFS_DEV_ALLOCATION_METADATA_ONLY - BTRFS_DEV_ALLOCATION_DATA_ONLY - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default) Where: - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for the X chunk type (the other type may be allowed when the space is low) - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type. This means also that it is a preferred choice. Each time the allocator allocates a chunk of type X , first it takes the disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY; if the space is not enough, it uses also the other disks, with the exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other type of chunk (i.e. not X). Signed-off-by: Goffredo Baroncelli <[email protected]>

This is useful where you want to prevent new allocations of chunks on a disk which is going to removed from the pool anyways, e.g. due to bad blocks or because it's slow. Signed-off-by: Kai Krakow <[email protected]>

@Zygo

This is useful where you want to prevent new allocations of chunks to a set of multiple disks which are going to be removed from the pool. This acts as a multiple `btrfs dev remove` on steroids that can remove multiple disks in parallel without moving data to disks which would be removed in the next round. In such cases, it will avoid moving the same data multiple times, and thus avoid placing it on potentially bad disks. Thanks to @Zygo for the explanation and suggestion. Link: kdave/btrfs-progs#907 (comment) Signed-off-by: Kai Krakow <[email protected]>

tanriol · 2025-02-06T14:58:10Z

Hi. What's the status of these patches? Are these something that's going be upstream in a reasonable amount of time or a long-term external patch series?

kakra · 2025-02-07T01:32:30Z

These won't go in into the kernel as-is and may be replaced by some different implementation in the kernel sooner or later. But I keep those safe to use - aka they don't create incompatibilities with future kernels and can just be dropped from your kernel without posing any danger to your btrfs.

@Forza-tng has some explanations why those patches won't go into the kernel: https://wiki.tnonline.net/w/Btrfs/Allocator_Hints

Forza-tng · 2025-02-14T22:08:51Z

Hi.

I noticed that type 0 data preferred has higher priority than type 3 data only. This can lead to interesting cases. For example on one server i had replaced two disks but forgot to set type 3, so they were left as type 0, while the other disks were type 3.

The result for this RAID10 was that new data were stored only on the new disks (ID 15 and 16) with a 2 stripe RAID10 instead of the expected 10 stripe RAID10.

While this was unintended, perhaps this effect could be used to tier data chunks? To test this, I created a 3 device btrfs:

1 = ssd
2 = hdd
3 = nvme

❯ grep . /sys/fs/btrfs/238f21dc-8199-4eaa-b503-ecd2983456d6/devinfo/*/type
1/type:0x00000000 # data preferred
2/type:0x00000003 # data only
3/type:0x00000002 # metadata only

❯ btrfs fi us -T .
Overall:
    Device size:		 115.00GiB
    Device allocated:		   6.03GiB
    Device unallocated:		 108.97GiB
    Device missing:		     0.00B
    Device slack:		     0.00B
    Used:			   3.00GiB
    Free (estimated):		 110.97GiB	(min: 110.97GiB)
    Free (statfs, df):		 110.97GiB
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		   5.50MiB	(used: 16.00KiB)
    Multiple profiles:		        no

              Data    Metadata System                              
Id Path       single  single   single   Unallocated Total     Slack
-- ---------- ------- -------- -------- ----------- --------- -----
 1 /dev/loop0 5.00GiB        -        -     5.00GiB  10.00GiB     -
 2 /dev/loop1       -        -        -   100.00GiB 100.00GiB     -
 3 /dev/loop2       -  1.00GiB 32.00MiB     3.97GiB   5.00GiB     -
-- ---------- ------- -------- -------- ----------- --------- -----
   Total      5.00GiB  1.00GiB 32.00MiB   108.97GiB 115.00GiB 0.00B
   Used       3.00GiB  3.17MiB 16.00KiB                            



❯ dd if=/dev/zero of=file6.data count=10000000

❯ btrfs fi us -T .
Overall:
    Device size:                 115.00GiB
    Device allocated:             10.03GiB
    Device unallocated:          104.97GiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                          8.78GiB
    Free (estimated):            105.20GiB      (min: 105.20GiB)
    Free (statfs, df):           105.20GiB
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:                7.67MiB      (used: 0.00B)
    Multiple profiles:                  no
              Data    Metadata System
Id Path       single  single   single   Unallocated Total     Slack
-- ---------- ------- -------- -------- ----------- --------- -----
 1 /dev/loop0 9.00GiB        -        -     1.00GiB  10.00GiB     -
 2 /dev/loop1       -        -        -   100.00GiB 100.00GiB     -
 3 /dev/loop2       -  1.00GiB 32.00MiB     3.97GiB   5.00GiB     -
-- ---------- ------- -------- -------- ----------- --------- -----
   Total      9.00GiB  1.00GiB 32.00MiB   104.97GiB 115.00GiB 0.00B
   Used       8.77GiB  9.08MiB 16.00KiB
   
   
❯ dd if=/dev/zero of=file7.data count=10000000

❯ btrfs fi us -T .
Overall:
    Device size:                 115.00GiB
    Device allocated:             15.03GiB
    Device unallocated:           99.97GiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                         13.03GiB
    Free (estimated):            100.95GiB      (min: 100.95GiB)
    Free (statfs, df):           100.95GiB
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:               12.81MiB      (used: 0.00B)
    Multiple profiles:                  no
              Data     Metadata System
Id Path       single   single   single   Unallocated Total     Slack
-- ---------- -------- -------- -------- ----------- --------- -----
 1 /dev/loop0 10.00GiB        -        -     1.00MiB  10.00GiB     -
 2 /dev/loop1  4.00GiB        -        -    96.00GiB 100.00GiB     -
 3 /dev/loop2        -  1.00GiB 32.00MiB     3.97GiB   5.00GiB     -
-- ---------- -------- -------- -------- ----------- --------- -----
   Total      14.00GiB  1.00GiB 32.00MiB    99.97GiB 115.00GiB 0.00B
   Used       13.02GiB 13.41MiB 16.00KiB

We can see that the smaller ssd fills up before it spills over onto the HDD.

OK, perhaps not very useful since we cannot easily move hot data back on the ssd. However, it could suffice as an emergency overflow to avoid ENOSPC if your workload can survive the reduced iops... :)

kakra · 2025-02-14T23:32:15Z

Yes, I think this is intentional behavior of the initial version of the patches: The type numbers are generally used as a priority sort with the *-only doing some kind of exception. My added new types follow a similar exception rule.

I'm not sure if it would be useful to put data on data-only disks first.

I think my idea of using chunk size classes for tiering may be more useful than this side-effect (what I mentioned in a report over at btrfs-todo).

But in theory, type 0 and type 3 should be treated equally as soon as the remaining unallocated space is identical... Did you reach that point? (looks like your loop dev example did exactly that if I followed correctly)

But in the end: Well, "preferred" means "preferred", doesn't it? ;-)

Forza-tng · 2025-02-15T10:17:19Z

But in theory, type 0 and type 3 should be treated equally as soon as the remaining unallocated space is identical... Did you reach that point? (looks like your loop dev example did exactly that if I followed correctly)

The loop test did the opposite of this, where the type 0 device was filled before the type 3, even though it was smaller/had less unallocated. I had expected types 3 and 0 were treated equally, but we see that this isn't the case?

It isn't wrong or bad, just something I hadn't thought would happen.

But in the end: Well, "preferred" means "preferred", doesn't it? ;-)

Indeed 😁

kreijack and others added 5 commits November 18, 2024 15:26

btrfs: export dev_item.type in /sys/fs/btrfs/<uuid>/devinfo/<devid>/type

160344a

Signed-off-by: Goffredo Baroncelli <[email protected]>

btrfs: change the DEV_ITEM 'type' field via sysfs

29637f2

Signed-off-by: Kai Krakow <[email protected]>

btrfs: add allocator_hint for no allocation preferred

1c1f2e2

This is useful where you want to prevent new allocations of chunks on a disk which is going to removed from the pool anyways, e.g. due to bad blocks or because it's slow. Signed-off-by: Kai Krakow <[email protected]>

kakra mentioned this pull request Nov 23, 2024

[6.6] Track btrfs patches #31

Closed

kakra mentioned this pull request Dec 4, 2024

btrfs balance single targeting a single devid didn't move all the data to that device kdave/btrfs-progs#907

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[6.12] Track btrfs patches #36

[6.12] Track btrfs patches #36

kakra commented Nov 23, 2024 •

edited

Loading

tanriol commented Feb 6, 2025

kakra commented Feb 7, 2025 •

edited

Loading

Forza-tng commented Feb 14, 2025

kakra commented Feb 14, 2025

Forza-tng commented Feb 15, 2025

[6.12] Track btrfs patches #36

Are you sure you want to change the base?

[6.12] Track btrfs patches #36

Conversation

kakra commented Nov 23, 2024 • edited Loading

tanriol commented Feb 6, 2025

kakra commented Feb 7, 2025 • edited Loading

Forza-tng commented Feb 14, 2025

kakra commented Feb 14, 2025

Forza-tng commented Feb 15, 2025

kakra commented Nov 23, 2024 •

edited

Loading

kakra commented Feb 7, 2025 •

edited

Loading