-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework FDT dedup log sync #17038
Rework FDT dedup log sync #17038
Conversation
6bf076a
to
037537f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any surface level issues. However I'm not super familiar with the ddt.c code, so hopefully we can get a DDT developer to look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks reasonable to me, although I haven't testing it locally. Being able to apply a limit the log size will definitely make this more usable for those of us using it in a fail-over configuration. Just a few trivial nits.
@pcd1193182 this should be good to go - would you mind doing a squash/rebase? That should make the Fedora runners happy again. |
Fedora 41 had a bunch of test failures but I suspect it's unrelated to this PR. I've kicked off a re-run of the failed tests. |
@pcd1193182 It seems dedup_prune test failed on number of systems. |
This PR condenses the FDT dedup log syncing into a single sync pass. This reduces the overhead of modifying indirect blocks for the dedup table multiple times per txg. In addition, changes were made to the formula for how much to sync per txg. We now also consider the backlog we have to clear, to prevent it from growing too large, or remaining large on an idle system. Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Authored-by: Don Brady <[email protected]> Authored-by: Paul Dagnelie <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]>
This PR condenses the FDT dedup log syncing into a single sync pass. This reduces the overhead of modifying indirect blocks for the dedup table multiple times per txg. In addition, changes were made to the formula for how much to sync per txg. We now also consider the backlog we have to clear, to prevent it from growing too large, or remaining large on an idle system.
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Authored-by: Don Brady [email protected]
Authored-by: Paul Dagnelie [email protected]
Motivation and Context
When flushing the DDT log, currently this takes place over multiple sync passes. This means that the same indirect blocks can be updated several times during one sync, which sort of defeats the purpose the DDT log was meant to solve in the first place. In addition, there is no mechanism in place to reduce the size of the DDT log; we try to keep up with the ingest rate, but that's it. If it ever does grow to a large size, we may never make progress in reducing the size, which can result in increased import times.
Description
There are two main changes included in this patch. The first is condensing all the syncing into a single sync pass. We do this by removing the code that divided the flush targets by the number of passes, and generally not doing any work beyond the first sync pass.
The second is the modification to the flush targets for each txg. The basic algorithm has changed; rather than directly targeting the ingest rate, the primary mechanism for determining how much to flush is by looking at the size of the backlog and dividing it by a target turnover rate (measured in TXGs). The idea is that this will smooth out the noise in the ingest rate, and over time, the flush rate will match the ingest rate. This is the result of a differential equation:
dbacklog/dt = ingest_rate - backlog/C
describes the change in backlog over time. This results in the backlog tending towardsC * ingest_rate
, whereC
is the turnover rate. Then the flush rate isC * ingest_rate / C
, which is just the ingest rate.However, one potential issue with this algorithm is that the backlog size is now proportional to the ingest rate. Whenever we do an import of the pool, we have to read through the whole DDT log to build up the in-memory state. If a user has hard requirements on import time, then large DDT log backlogs can cause problems for them. As a result, there is a separate pressure-based system to keep the backlog sizing from rising above a cap, when that cap is set. The way the pressure system works is that every txg, pressure is added if the backlog is above the cap and increasing; the amount added is proportional to the backlog divided by the cap, which helps us catch up to rapid spikes. If the backlog is above the cap but not increasing, we maintain the pressure; either it was a brief spike, or we've added enough pressure to bring the size down. Finally, if the backlog is below the cap, we release some of the pressure. The pressure is based on how far below the cap we are; that way, we quickly release pressure if the increased ingest rate abates, and we return to normal behavior. Here is a few charts to help demonstrate the behavior of this cap system:
In this example, we start with an ingest rate of 2k entries per second. We have a cap of 50k set, and the target turnover rate is 20 ( 20 make the changes happen more quickly, and be easier to see). At txg 10, the ingest rate increases by a factor of 3, and then at TXG 100 it decreases to the baseline. As you can see, the un-capped backlog quickly grows as the flush rate slowly rises to match the new ingest rate. Meanwhile, the capped backlog's flush rate climbs quickly to bring the backlog down near the cap, and then stabilizes to keep it there. Similarly, when the ingest rate drops, the un-capped backlog quickly starts falling as the flush rate slowly drops to the new baseline. Meanwhile the cap-based system starts to flush below the cap size and then corrects, levelling off quickly near the previous baseline.

Finally, in addition to these changes, I added a new test to the ZTS to verify that pacing works as expected.
How Has This Been Tested?
In addition to the zfs test suite, I ran several tests where I simulated various ingestion patterns into the DDT, and verified that the backlog behaved as expected with and without the cap set.
Types of changes
Checklist:
Signed-off-by
.