fair_queue: make the fair_group token grabbing discipline more fair #2616

michoecho · 2025-01-14T03:01:48Z

The current design of fair_group isn't fair enough to shards.
During contention, the group will be -- aproximately -- taking requests
from shards one-by-one, in round robin.

This guarantees that each contender will dispatch an equal number of
requests. This is some kind of fairness, but it's not the kind we want,
probably ever.

A better kind of fairness is that under contention each shard should
be guaranteed 1/nr_shards of the disk's IOPS and/or 1/nr_shards of
byte-bandwidth, whichever dimension it pressures more.
This is needed so that each shard can be relied on to sustain a certain
rate of requests -- the lower bound of the slowest shard's throughput
usually dictates the throughput of the entire cluster.

But those two kinds of fairness are only the same if all IO requests have
the same size and direction. Otherwise they can be drastically different.

With the current design it's easy to create a situation where a shard receives
an arbitrarily small fraction of both IOPS and bandwidth, despite being
IO-bound. (Example: a node with X shards, where one shard spams only very small
requests and other shards spams only big requests).

This is a problem in practice. In ScyllaDB, we observed IO starvation of
some shards during realistic workloads. While they require some workload
asymmetry to occur, even small asymmetries can cause serious unfairness to occur.
(For example, a shard which receives 6% more of database queries than other
shards can be starved to less than 50% of its fair share of IOPS and/or
bandwidth -- because each of those surplus 1 kiB queries is "fairly" matched with
16x costlier 128 kiB low-priority batch IO requests on other shards).

To improve this, fair_group needs a different queueing discipline.
There are many possible ways, but this patch chooses the one which is relatively
the most similar to the current one. The main idea is that we still rely on the
"approximate round robin" of the token bucket as the basis for fairness, but we reserve
a fixed-size batch of tokens at a time, rather than a fixed-size (i.e. 1) batch
of requests at a time. This turns the discipline from request-fair to
token-fair, which is what we want.

The implementation details are non-trivial, though, and should be carefully reviewed.

This change doesn't result in full fairness. The degree of fairness depends on how often
a given shard polls. A continuously-polling shard should get its fair share of disk bandwidth,
but a rarely-polling shard might only get a fraction of its fair share. But with current default values
for IO scheduler parameters, a shard should be guaranteed at least 60% of its fair share
in the worst case, which is decent, and better than the current "arbitrarily-low%".

Also note that by "fair share" we mean 1 / nr_shards_in_group, not 1/nr_actively_contending_shards_in_group. If there are 30 shards in a group but only 2 are contending for IO — one polling rarely and one polling continuously — the rarely-polling contender will only get ~1/30 of the disk's bandwidth, not ~1/2. This is not a problem for applications where the interactive workloads are mostly symmetric (because then it's the "normal" case that all shards are contending and none can expect more than 1/30), but it's unsatisfying.

Different scheduling algorithms could improve on both of the above. We have some ideas for that. But they would be a bigger departure from status quo, and they should be harder to implement well.

michoecho · 2025-01-14T03:03:09Z

Refs (to some degree: fixes) #2615

src/core/fair_queue.cc

xemul · 2025-01-15T12:23:48Z

src/core/fair_queue.cc

+            // It shouldn't matter in practice.
+            grab_amount = std::min<capacity_t>(grab_amount, _group.maximum_capacity());
+            _group.refund_tokens(recycled);
+            grab_capacity(grab_amount);


There's no fair_queue::grab_capacity(capacity_t) method here, only the fair_queue::grab_capacity(const fair_queue_entry&) one, how does it compile?

grab_amount is a local variable, though? Not a function.

My misprint :( updated the comment

Huh, you're right. Apparently it works because there's an implicit conversion from capacity_t to fair_queue_entry:

seastar/include/seastar/core/fair_queue.hh

Lines 118 to 119 in faf6a5d

fair_queue_entry(capacity_t c) noexcept

: _capacity(c) {}

But it does the expected thing, so... whatever, I guess?

Please fix it with the assumption that fair_queue_entry constructor is explicit

include/seastar/core/fair_queue.hh

src/core/fair_queue.cc

michoecho · 2025-01-20T01:09:59Z

v2:
Functional changes:

Removed the maybe_replenish_capacity() call from reap_pending_capacity(). Now maybe_replenish_capacity() is only called from dispatch_requests().
Fixed an incompatibility between the patch and fair_queue_test.
Cosmetic changes:
Addressed review comments:
- reap_pending_capacity now returns our_turn_has_come via a return value rather than via an output parameter.
- The addition of _queued_capacity is now in a separate commit.
- Several typos were fixed.
Removed the rewrite of the doc comment of fair_group, it didn't really belong in this PR.
Some misleading comments were rewritten.

@xemul Please re-review.

But I was thinking: is it really necessary to have a "timeline" of dispatches, where each shard only performs the dispatch after the bucket head reaches its _pending.head?

If this isn't necessary, then I see an alternative which is both simpler and has better guarantees. See michoecho@d5a559a. The idea is: each shard has a _balance variable. While _balance is non-negative, and there any queued requests, we dispatch the request and subtract its cost from _balance. While _balance is negative, we grab (reserve) a fixed batch of tokens, and add it to _balance after the reservation is fulfilled by bucket head.

This means that we dispatch first, then we grab tokens to pay for that, and only then we dispatch again. There is no attempt to use the _pending.head as an ideal dispatch timepoint; grabs are only used to limit the throughput. Is there a problem with that?

xemul · 2025-01-20T09:36:47Z

But I was thinking: is it really necessary to have a "timeline" of dispatches, where each shard only performs the dispatch after the bucket head reaches its _pending.head?

If we want to make shards dispatch independently, then timeline can be avoided, but presumably I mis-understanding what you call "a timeline" (see below)

While _balance is negative, we grab (reserve) a fixed batch of tokens, and add it to _balance after the reservation is fulfilled by bucket head.

This "after the reservation is fulfilled by bucket head", how does it differ from "each shard only performs the dispatch after the bucket head reaches its _pending.head"?

xemul · 2025-01-20T09:57:08Z

apps/io_tester/test_cases/one_cpu_starved_shard_can_still_saturate_io.sh

@@ -0,0 +1,39 @@
+#!/usr/bin/env bash


Please, move all new tests into tests/manual/

michoecho · 2025-01-20T10:00:04Z

But I was thinking: is it really necessary to have a "timeline" of dispatches, where each shard only performs the dispatch after the bucket head reaches its _pending.head?

If we want to make shards dispatch independently, then timeline can be avoided, but presumably I mis-understanding what you call "a timeline" (see below)

While _balance is negative, we grab (reserve) a fixed batch of tokens, and add it to _balance after the reservation is fulfilled by bucket head.

This "after the reservation is fulfilled by bucket head", how does it differ from "each shard only performs the dispatch after the bucket head reaches its _pending.head"?

@xemul Imagine that per_tick_grab_threshold is 20 tokens, while all requests cost 100 tokens, and there are 7 shards.
When we are sticking to the "timeline" idea, we want one shard to dispatch at timepoint 100, another at 200, ..., last at 700. And then another round at 800, 900, etc.
If we ignore the "timeline", then we are fine with all 7 shards dispatching together at timepoint 0, then eating tokens (in 5 rotations of round robin) to pay for their requests, and then all dispatching again at ~700.

The main benefit of that is that the total amount of pending reservations has a hard limit at nr_contenders * per_tick_grab_threshold. With the design currently used in this PR, where we put big requests on the "timeline" in their entirety, the total amount of pending reservations can reach nr_contenders * (per_tick_grab_threshold + max_request_size).

The "fairness" of the entire batching idea is that each shard can grab a batch once per "round robin", so the shorter the round robin, the better the fairness. In this case, for our default reactor parameters, the improvement from nr_contenders * (per_tick_grab_threshold + max_request_size) to nr_contenders * per_tick_grab_threshold should result in an improvement in worst-case fairness guarantees from "each shard gets at least ~30% of its fair share" to "each shard gets at least ~60% of its fair share".

xemul · 2025-01-20T10:11:04Z

If we ignore the "timeline", then we are fine with all 7 shards dispatching together at timepoint 0, then eating tokens (in 5 rotations of round robin) to pay for their requests, and then all dispatching again at ~700.

OK, but isn't it another "timeline"?

Imagine that per_tick_grab_threshold is 20 tokens, while all requests cost 100 tokens, and there are 7 shards.
When we are sticking to the "timeline" idea, we want one shard to dispatch at timepoint 100, another at 200, ..., last at 700. And then another round at 800, 900, etc.
If we ignore the "timeline", then we are fine with all 7 shards dispatching together at timepoint 0, then eating tokens (in 5 rotations of round robin) to pay for their requests, and then all dispatching again at ~700.

It can work both ways. We overload the disk in a hope that it will handle the load within the upcoming idling time (corresponding to 700 tokens). On one hand, we increased the in-disk concurrency and it can do better job than serving requests coming one-by-one, but on the other, in-disk load->latency doesn't scale linearly, so disk may want time equivalent to, say, 1400 tokens, so once we continue dispatching after 700 disk will still be busy.

michoecho · 2025-01-20T10:46:26Z

If we ignore the "timeline", then we are fine with all 7 shards dispatching together at timepoint 0, then eating tokens (in 5 rotations of round robin) to pay for their requests, and then all dispatching again at ~700.

OK, but isn't it another "timeline"?

It is. Nomenclature is hard. What I meant is that in the "timeline" version, we explicitly give each request its own designated "slot" in the timeline. Conceptually, if the disk had max concurrency of 1, that slot would match the time span when the disk is handling that request.
In the "non-timeline" version, we aren't trying to order requests so strictly, we are just trying to dispatch them slowly enough, without caring about "local" overlaps.

in-disk load->latency doesn't scale linearly, so disk may want time equivalent to, say, 1400 tokens

@xemul Is that so? And did we ever observe that? Do we have some past experiments?

In the model of reality in my head, sending several requests immediately should never result in worse disk utilization/speed than spacing them out manually. If that's not true, what's the mechanism which makes the "spaced out" version better? If we send too many requests at once, does it result in some extra delay in the kernel or something?

If the "spaced-out timeline" is important, then I have yet another idea, which combines the "old" and the "new" version — have two separate shared_token_buckets, one for governing fair throughput (this bucket would be replenished by the passage of time, and each shard would grab fixed batches of tokens from it), and one for governing the "spacing out". When a shard was ready to dispatch the request based on the amount of tokens it grabbed from the first bucket, it would make a grab (for the full capacity of the current request) in the second bucket, and it would wait with the actual dispatch to disk until the second bucket head fulfills this grab. (The second bucket would always be replenished with exactly the same amount of tokens as the first bucket). But this is extra complex.

Add some test cases useful for presenting improvement ideas for the IO scheduler. 3 of 4 tests added in this patch illustrate some goals which weren't met before but are met after this series. The `tau_nemesis` test illustrates a problem which is present both before and after this series.

…test_env fair_queue_test implicitly assumes that `fair_group::clock_type::now()` is always smaller than `_fg.replenished_ts() + std::chrono::microseconds(1)`. This ensures that the amount of replenished tokens does not exceed what the `tick()` calls are supposed to replenish. If this assumption is violated by the point of some `replenish_capacity()` call in `tick()`, then `tick()` will not replenish tokens, and instead tokens will be replenished by the `maybe_replenish_capacity()` call done by `dispatch_requests()`, which will replenish more than 1us worth of tokens, breaking the assumptions of the test. To prevent this, and ensure that the test doesn't rely on timing, we can initialize `_fg.replenished_ts()` to some point in the future, which will ensure that `now()` won't catch up to it.

Adds a member variable which tracks the summed capacity of all requests waiting in the queue. This is a piece of data which might be valuable to the IO scheduler. We make use of it in later patches in the series.

The current design of `fair_group` isn't fair enough to shards. During contention, the group will be -- aproximately -- taking requests from shards one-by-one, in round robin. This guarantees that each contender will dispatch an equal *number* of requests. This is some kind of fairness, but it's not the kind we want, probably ever. A better kind of fairness is that under contention, each shard should be guaranteed `1/nr_shards` of the disk's IOPS and/or `1/nr_shards` of byte-bandwidth, whichever dimension it pressures more. This is needed so that each shard can be relied on to sustain a certain rate of requests -- the lower bound of the slowest shard's throughput usually dictates the throughput of the entire cluster. But those two kinds of fairness are only the same if all IO requests have the same size and direction. Otherwise they can be drastically different. With the current design it's easy to create a situation where a shard receives an arbitrarily small fraction of both IOPS and bandwidth, despite being IO-bound. (Example: a node with X shards, where one shard spams only very small requests and other shards spams only big requests). This is a problem in practice. In ScyllaDB, we observed IO starvation of some shards during realistic workloads. While they require some workload asymmetry to occur, even small asymmetries can cause serious unfairness to occur. (For example, a shard which receives 6% more of database queries than other shards can be starved to less than 50% of its fair share of IOPS and/or bandwidth -- because each of those 1 kiB queries is "fairly" matched with 16x costlier 128 kiB low-priority batch IO requests on other shards). To improve this, `fair_group` needs a different queueing discipline. There are many possible ways, but this patch chooses the one which is relatively the most similar to the current one. The main idea is that we still rely on the "approximate round robin" of token queue as the basis for fairness, but we reserve a fixed-size batch of tokens at a time, rather than a fixed-size (i.e. 1) batch of _requests_ at a time. This turns the discipline from approximately-request-fair to approximately-token-fair, which is what we want. The implementation details are non-trivial, though, and should be carefully reviewed.

michoecho · 2025-01-21T10:38:51Z

in-disk load->latency doesn't scale linearly, so disk may want time equivalent to, say, 1400 tokens

@xemul Is that so? And did we ever observe that? Do we have some past experiments?

In the model of reality in my head, sending several requests immediately should never result in worse disk utilization/speed than spacing them out manually. If that's not true, what's the mechanism which makes the "spaced out" version better? If we send too many requests at once, does it result in some extra delay in the kernel or something?

If the "spaced-out timeline" is important, then I have yet another idea, which combines the "old" and the "new" version — have two separate shared_token_buckets, one for governing fair throughput (this bucket would be replenished by the passage of time, and each shard would grab fixed batches of tokens from it), and one for governing the "spacing out". When a shard was ready to dispatch the request based on the amount of tokens it grabbed from the first bucket, it would make a grab (for the full capacity of the current request) in the second bucket, and it would wait with the actual dispatch to disk until the second bucket head fulfills this grab. (The second bucket would always be replenished with exactly the same amount of tokens as the first bucket). But this is extra complex.

@xemul Ping.

avikivity · 2025-01-21T15:16:58Z

tests/manual/iosched_reproducers/one_cpu_starved_shard_can_still_saturate_io.sh

+#!/usr/bin/env bash
+
+# Test scenario:
+# A single CPU-starved shard has a batch IO job.


It's not really starved (IIUC). It's saturated but the I/O fiber will receive CPU every task quota, which is as much as it can expect.

"CPU-starved" here means that there is always more work for the CPU to do. The important part is that the shard only polls once per ~500us, not ~1us.

I know that "starved" usually has another meaning, but I don't have a better word here. Saturated?

The I/O fiber would be starved if it needed CPU but got it less than once per task quota.

The reactor (not the I/O fiber) is saturated: it doesn't have any cycles to spare, so it isn't able to provide better-than-expected service (CPU every 1usec). But it's not starving the I/O fiber, merely sticking to the expectations.

I'd read "reactor starved" to mean that something outside the reactor is eating the CPU.

avikivity · 2025-01-21T15:17:20Z

tests/manual/iosched_reproducers/one_cpu_starved_shard_can_still_saturate_io.sh

+  shards: [0]
+  shard_info:
+    parallelism: 1
+    execution_time: 550us


Well, almost, since this is above the task quota, but not by much.

avikivity · 2025-01-21T15:20:07Z

While I couldn't follow the code, I think I can follow the reasoning in the detailed patch changelogs and they make sense to me.

mykaul · 2025-01-22T10:55:55Z

@xemul - can you give this another round of review, and let's see how far off it is from inclusion?

bhalevy · 2025-01-26T11:33:07Z

@xemul - can you give this another round of review, and let's see how far off it is from inclusion?

ping @xemul

xemul · 2025-01-27T08:44:32Z

@xemul Is that so? And did we ever observe that? Do we have some past experiments?

Sorry. Yes, we did. The whole scheduler math is built around assumption that disk cannot be re-loaded. We have scylladb/diskplorer plots showing that "too much requests" render larger latency for each. Also I have metrics run against two distinct io-properties.yaml files, one "correct" and another one "bursty" (i.e. -- with write iops set 2x times larger than they should) and the "bursty" one shows higher in-disk latency.

xemul · 2025-01-27T08:46:44Z

In the model of reality in my head, sending several requests immediately should never result in worse disk utilization/speed than spacing them out manually. If that's not true, what's the mechanism which makes the "spaced out" version better? If we send too many requests at once, does it result in some extra delay in the kernel or something?

If that was true we'd better not queue requests in seastar and send all of them into disk. You're right, that, say, 2 requests in disk is better than 1, but at the same time something like, say, 200 is worse than 100. The goal is to find a balance point where concurrency increase doesn't result in worse latency that we need.

michoecho requested a review from xemul January 14, 2025 03:01

michoecho mentioned this pull request Jan 14, 2025

[DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens #2596

Closed

michoecho force-pushed the iosched_token_batching branch from 2070713 to ba7df37 Compare January 14, 2025 03:14