Rework FDT dedup log sync #17038

pcd1193182 · 2025-02-07T20:55:05Z

This PR condenses the FDT dedup log syncing into a single sync pass. This reduces the overhead of modifying indirect blocks for the dedup table multiple times per txg. In addition, changes were made to the formula for how much to sync per txg. We now also consider the backlog we have to clear, to prevent it from growing too large, or remaining large on an idle system.

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Authored-by: Don Brady [email protected]
Authored-by: Paul Dagnelie [email protected]

Motivation and Context

When flushing the DDT log, currently this takes place over multiple sync passes. This means that the same indirect blocks can be updated several times during one sync, which sort of defeats the purpose the DDT log was meant to solve in the first place. In addition, there is no mechanism in place to reduce the size of the DDT log; we try to keep up with the ingest rate, but that's it. If it ever does grow to a large size, we may never make progress in reducing the size, which can result in increased import times.

Description

There are two main changes included in this patch. The first is condensing all the syncing into a single sync pass. We do this by removing the code that divided the flush targets by the number of passes, and generally not doing any work beyond the first sync pass.

The second is the modification to the flush targets for each txg. The basic algorithm has changed; rather than directly targeting the ingest rate, the primary mechanism for determining how much to flush is by looking at the size of the backlog and dividing it by a target turnover rate (measured in TXGs). The idea is that this will smooth out the noise in the ingest rate, and over time, the flush rate will match the ingest rate. This is the result of a differential equation: dbacklog/dt = ingest_rate - backlog/C describes the change in backlog over time. This results in the backlog tending towards C * ingest_rate, where C is the turnover rate. Then the flush rate is C * ingest_rate / C, which is just the ingest rate.

However, one potential issue with this algorithm is that the backlog size is now proportional to the ingest rate. Whenever we do an import of the pool, we have to read through the whole DDT log to build up the in-memory state. If a user has hard requirements on import time, then large DDT log backlogs can cause problems for them. As a result, there is a separate pressure-based system to keep the backlog sizing from rising above a cap, when that cap is set. The way the pressure system works is that every txg, pressure is added if the backlog is above the cap and increasing; the amount added is proportional to the backlog divided by the cap, which helps us catch up to rapid spikes. If the backlog is above the cap but not increasing, we maintain the pressure; either it was a brief spike, or we've added enough pressure to bring the size down. Finally, if the backlog is below the cap, we release some of the pressure. The pressure is based on how far below the cap we are; that way, we quickly release pressure if the increased ingest rate abates, and we return to normal behavior. Here is a few charts to help demonstrate the behavior of this cap system:

In this example, we start with an ingest rate of 2k entries per second. We have a cap of 50k set, and the target turnover rate is 20 ( 20 make the changes happen more quickly, and be easier to see). At txg 10, the ingest rate increases by a factor of 3, and then at TXG 100 it decreases to the baseline. As you can see, the un-capped backlog quickly grows as the flush rate slowly rises to match the new ingest rate. Meanwhile, the capped backlog's flush rate climbs quickly to bring the backlog down near the cap, and then stabilizes to keep it there. Similarly, when the ingest rate drops, the un-capped backlog quickly starts falling as the flush rate slowly drops to the new baseline. Meanwhile the cap-based system starts to flush below the cap size and then corrects, levelling off quickly near the previous baseline.

Finally, in addition to these changes, I added a new test to the ZTS to verify that pacing works as expected.

How Has This Been Tested?

In addition to the zfs test suite, I ran several tests where I simulated various ingestion patterns into the DDT, and verified that the backlog behaved as expected with and without the cap set.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

man/man4/zfs.4

module/zfs/ddt.c

tonyhutter

I don't see any surface level issues. However I'm not super familiar with the ddt.c code, so hopefully we can get a DDT developer to look into that.

behlendorf

This all looks reasonable to me, although I haven't testing it locally. Being able to apply a limit the log size will definitely make this more usable for those of us using it in a fail-over configuration. Just a few trivial nits.

include/sys/zfs_debug.h

man/man4/zfs.4

module/zfs/ddt.c

tests/zfs-tests/tests/functional/dedup/dedup_fdt_pacing.ksh

tonyhutter · 2025-03-03T21:47:16Z

@pcd1193182 this should be good to go - would you mind doing a squash/rebase? That should make the Fedora runners happy again.

tonyhutter · 2025-03-07T00:27:20Z

Fedora 41 had a bunch of test failures but I suspect it's unrelated to this PR. I've kicked off a re-run of the failed tests.

amotin · 2025-03-09T15:51:51Z

@pcd1193182 It seems dedup_prune test failed on number of systems.

This PR condenses the FDT dedup log syncing into a single sync pass. This reduces the overhead of modifying indirect blocks for the dedup table multiple times per txg. In addition, changes were made to the formula for how much to sync per txg. We now also consider the backlog we have to clear, to prevent it from growing too large, or remaining large on an idle system. Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Authored-by: Don Brady <[email protected]> Authored-by: Paul Dagnelie <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]>

pcd1193182 force-pushed the pacing branch 2 times, most recently from 6bf076a to 037537f Compare February 10, 2025 17:20

tonyhutter reviewed Feb 11, 2025

View reviewed changes

man/man4/zfs.4 Outdated Show resolved Hide resolved

tonyhutter reviewed Feb 11, 2025

View reviewed changes

module/zfs/ddt.c Outdated Show resolved Hide resolved

pcd1193182 mentioned this pull request Feb 12, 2025

Add more DDT tests #17049

Merged

13 tasks

tonyhutter approved these changes Feb 14, 2025

View reviewed changes

tonyhutter added the Status: Code Review Needed Ready for review and testing label Feb 14, 2025

behlendorf self-requested a review February 24, 2025 20:58

behlendorf approved these changes Feb 25, 2025

View reviewed changes

pcd1193182 force-pushed the pacing branch from 25b8a90 to e138550 Compare February 28, 2025 21:34

pcd1193182 force-pushed the pacing branch from e138550 to d4efeec Compare March 4, 2025 00:39

pcd1193182 force-pushed the pacing branch from d4efeec to b84c4d2 Compare March 7, 2025 20:25

pcd1193182 force-pushed the pacing branch from 20a7874 to 8da1158 Compare March 10, 2025 17:52

allanjude force-pushed the pacing branch from 8da1158 to 67bac34 Compare March 11, 2025 13:33

amotin approved these changes Mar 13, 2025

View reviewed changes

amotin added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 13, 2025

amotin merged commit 1b495ee into openzfs:master Mar 13, 2025
22 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework FDT dedup log sync #17038

Rework FDT dedup log sync #17038

pcd1193182 commented Feb 7, 2025

tonyhutter left a comment

behlendorf left a comment

tonyhutter commented Mar 3, 2025

tonyhutter commented Mar 7, 2025

amotin commented Mar 9, 2025

Rework FDT dedup log sync #17038

Rework FDT dedup log sync #17038

Conversation

pcd1193182 commented Feb 7, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

tonyhutter left a comment

Choose a reason for hiding this comment

behlendorf left a comment

Choose a reason for hiding this comment

tonyhutter commented Mar 3, 2025

tonyhutter commented Mar 7, 2025

amotin commented Mar 9, 2025