Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fair_queue: make the fair_group token grabbing discipline more fair #2616

Merged
merged 4 commits into from
Jan 27, 2025

Conversation

michoecho
Copy link
Contributor

@michoecho michoecho commented Jan 14, 2025

The current design of fair_group isn't fair enough to shards.
During contention, the group will be -- aproximately -- taking requests
from shards one-by-one, in round robin.

This guarantees that each contender will dispatch an equal number of
requests. This is some kind of fairness, but it's not the kind we want,
probably ever.

A better kind of fairness is that under contention each shard should
be guaranteed 1/nr_shards of the disk's IOPS and/or 1/nr_shards of
byte-bandwidth, whichever dimension it pressures more.
This is needed so that each shard can be relied on to sustain a certain
rate of requests -- the lower bound of the slowest shard's throughput
usually dictates the throughput of the entire cluster.

But those two kinds of fairness are only the same if all IO requests have
the same size and direction. Otherwise they can be drastically different.

With the current design it's easy to create a situation where a shard receives
an arbitrarily small fraction of both IOPS and bandwidth, despite being
IO-bound. (Example: a node with X shards, where one shard spams only very small
requests and other shards spams only big requests).

This is a problem in practice. In ScyllaDB, we observed IO starvation of
some shards during realistic workloads. While they require some workload
asymmetry to occur, even small asymmetries can cause serious unfairness to occur.
(For example, a shard which receives 6% more of database queries than other
shards can be starved to less than 50% of its fair share of IOPS and/or
bandwidth -- because each of those surplus 1 kiB queries is "fairly" matched with
16x costlier 128 kiB low-priority batch IO requests on other shards).

To improve this, fair_group needs a different queueing discipline.
There are many possible ways, but this patch chooses the one which is relatively
the most similar to the current one. The main idea is that we still rely on the
"approximate round robin" of the token bucket as the basis for fairness, but we reserve
a fixed-size batch of tokens at a time, rather than a fixed-size (i.e. 1) batch
of requests at a time. This turns the discipline from request-fair to
token-fair, which is what we want.

The implementation details are non-trivial, though, and should be carefully reviewed.

This change doesn't result in full fairness. The degree of fairness depends on how often
a given shard polls. A continuously-polling shard should get its fair share of disk bandwidth,
but a rarely-polling shard might only get a fraction of its fair share. But with current default values
for IO scheduler parameters, a shard should be guaranteed at least 60% of its fair share
in the worst case, which is decent, and better than the current "arbitrarily-low%".

Also note that by "fair share" we mean 1 / nr_shards_in_group, not 1/nr_actively_contending_shards_in_group. If there are 30 shards in a group but only 2 are contending for IO — one polling rarely and one polling continuously — the rarely-polling contender will only get ~1/30 of the disk's bandwidth, not ~1/2. This is not a problem for applications where the interactive workloads are mostly symmetric (because then it's the "normal" case that all shards are contending and none can expect more than 1/30), but it's unsatisfying.

Different scheduling algorithms could improve on both of the above. We have some ideas for that. But they would be a bigger departure from status quo, and they should be harder to implement well.

@michoecho michoecho requested a review from xemul January 14, 2025 03:01
@michoecho
Copy link
Contributor Author

Refs (to some degree: fixes) #2615

src/core/fair_queue.cc Outdated Show resolved Hide resolved
// It shouldn't matter in practice.
grab_amount = std::min<capacity_t>(grab_amount, _group.maximum_capacity());
_group.refund_tokens(recycled);
grab_capacity(grab_amount);
Copy link
Contributor

@xemul xemul Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no fair_queue::grab_capacity(capacity_t) method here, only the fair_queue::grab_capacity(const fair_queue_entry&) one, how does it compile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grab_amount is a local variable, though? Not a function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My misprint :( updated the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, you're right. Apparently it works because there's an implicit conversion from capacity_t to fair_queue_entry:

fair_queue_entry(capacity_t c) noexcept
: _capacity(c) {}

But it does the expected thing, so... whatever, I guess?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix it with the assumption that fair_queue_entry constructor is explicit

include/seastar/core/fair_queue.hh Outdated Show resolved Hide resolved
include/seastar/core/fair_queue.hh Outdated Show resolved Hide resolved
include/seastar/core/fair_queue.hh Outdated Show resolved Hide resolved
src/core/fair_queue.cc Outdated Show resolved Hide resolved
@michoecho michoecho force-pushed the iosched_token_batching branch 2 times, most recently from 264a61f to d62436f Compare January 20, 2025 01:02
@michoecho
Copy link
Contributor Author

v2:
Functional changes:

  • Removed the maybe_replenish_capacity() call from reap_pending_capacity(). Now maybe_replenish_capacity() is only called from dispatch_requests().
  • Fixed an incompatibility between the patch and fair_queue_test.
    Cosmetic changes:
  • Addressed review comments:
    • reap_pending_capacity now returns our_turn_has_come via a return value rather than via an output parameter.
    • The addition of _queued_capacity is now in a separate commit.
    • Several typos were fixed.
  • Removed the rewrite of the doc comment of fair_group, it didn't really belong in this PR.
  • Some misleading comments were rewritten.

@xemul Please re-review.

But I was thinking: is it really necessary to have a "timeline" of dispatches, where each shard only performs the dispatch after the bucket head reaches its _pending.head?

If this isn't necessary, then I see an alternative which is both simpler and has better guarantees. See michoecho@d5a559a. The idea is: each shard has a _balance variable. While _balance is non-negative, and there any queued requests, we dispatch the request and subtract its cost from _balance. While _balance is negative, we grab (reserve) a fixed batch of tokens, and add it to _balance after the reservation is fulfilled by bucket head.

This means that we dispatch first, then we grab tokens to pay for that, and only then we dispatch again. There is no attempt to use the _pending.head as an ideal dispatch timepoint; grabs are only used to limit the throughput. Is there a problem with that?

@xemul
Copy link
Contributor

xemul commented Jan 20, 2025

But I was thinking: is it really necessary to have a "timeline" of dispatches, where each shard only performs the dispatch after the bucket head reaches its _pending.head?

If we want to make shards dispatch independently, then timeline can be avoided, but presumably I mis-understanding what you call "a timeline" (see below)

While _balance is negative, we grab (reserve) a fixed batch of tokens, and add it to _balance after the reservation is fulfilled by bucket head.

This "after the reservation is fulfilled by bucket head", how does it differ from "each shard only performs the dispatch after the bucket head reaches its _pending.head"?

@@ -0,0 +1,39 @@
#!/usr/bin/env bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, move all new tests into tests/manual/

@michoecho
Copy link
Contributor Author

michoecho commented Jan 20, 2025

But I was thinking: is it really necessary to have a "timeline" of dispatches, where each shard only performs the dispatch after the bucket head reaches its _pending.head?

If we want to make shards dispatch independently, then timeline can be avoided, but presumably I mis-understanding what you call "a timeline" (see below)

While _balance is negative, we grab (reserve) a fixed batch of tokens, and add it to _balance after the reservation is fulfilled by bucket head.

This "after the reservation is fulfilled by bucket head", how does it differ from "each shard only performs the dispatch after the bucket head reaches its _pending.head"?

@xemul Imagine that per_tick_grab_threshold is 20 tokens, while all requests cost 100 tokens, and there are 7 shards.
When we are sticking to the "timeline" idea, we want one shard to dispatch at timepoint 100, another at 200, ..., last at 700. And then another round at 800, 900, etc.
If we ignore the "timeline", then we are fine with all 7 shards dispatching together at timepoint 0, then eating tokens (in 5 rotations of round robin) to pay for their requests, and then all dispatching again at ~700.

The main benefit of that is that the total amount of pending reservations has a hard limit at nr_contenders * per_tick_grab_threshold. With the design currently used in this PR, where we put big requests on the "timeline" in their entirety, the total amount of pending reservations can reach nr_contenders * (per_tick_grab_threshold + max_request_size).

The "fairness" of the entire batching idea is that each shard can grab a batch once per "round robin", so the shorter the round robin, the better the fairness. In this case, for our default reactor parameters, the improvement from nr_contenders * (per_tick_grab_threshold + max_request_size) to nr_contenders * per_tick_grab_threshold should result in an improvement in worst-case fairness guarantees from "each shard gets at least ~30% of its fair share" to "each shard gets at least ~60% of its fair share".

@xemul
Copy link
Contributor

xemul commented Jan 20, 2025

If we ignore the "timeline", then we are fine with all 7 shards dispatching together at timepoint 0, then eating tokens (in 5 rotations of round robin) to pay for their requests, and then all dispatching again at ~700.

OK, but isn't it another "timeline"?

Imagine that per_tick_grab_threshold is 20 tokens, while all requests cost 100 tokens, and there are 7 shards.
When we are sticking to the "timeline" idea, we want one shard to dispatch at timepoint 100, another at 200, ..., last at 700. And then another round at 800, 900, etc.
If we ignore the "timeline", then we are fine with all 7 shards dispatching together at timepoint 0, then eating tokens (in 5 rotations of round robin) to pay for their requests, and then all dispatching again at ~700.

It can work both ways. We overload the disk in a hope that it will handle the load within the upcoming idling time (corresponding to 700 tokens). On one hand, we increased the in-disk concurrency and it can do better job than serving requests coming one-by-one, but on the other, in-disk load->latency doesn't scale linearly, so disk may want time equivalent to, say, 1400 tokens, so once we continue dispatching after 700 disk will still be busy.

@michoecho
Copy link
Contributor Author

michoecho commented Jan 20, 2025

If we ignore the "timeline", then we are fine with all 7 shards dispatching together at timepoint 0, then eating tokens (in 5 rotations of round robin) to pay for their requests, and then all dispatching again at ~700.

OK, but isn't it another "timeline"?

It is. Nomenclature is hard. What I meant is that in the "timeline" version, we explicitly give each request its own designated "slot" in the timeline. Conceptually, if the disk had max concurrency of 1, that slot would match the time span when the disk is handling that request.
In the "non-timeline" version, we aren't trying to order requests so strictly, we are just trying to dispatch them slowly enough, without caring about "local" overlaps.

in-disk load->latency doesn't scale linearly, so disk may want time equivalent to, say, 1400 tokens

@xemul Is that so? And did we ever observe that? Do we have some past experiments?

In the model of reality in my head, sending several requests immediately should never result in worse disk utilization/speed than spacing them out manually. If that's not true, what's the mechanism which makes the "spaced out" version better? If we send too many requests at once, does it result in some extra delay in the kernel or something?

If the "spaced-out timeline" is important, then I have yet another idea, which combines the "old" and the "new" version — have two separate shared_token_buckets, one for governing fair throughput (this bucket would be replenished by the passage of time, and each shard would grab fixed batches of tokens from it), and one for governing the "spacing out". When a shard was ready to dispatch the request based on the amount of tokens it grabbed from the first bucket, it would make a grab (for the full capacity of the current request) in the second bucket, and it would wait with the actual dispatch to disk until the second bucket head fulfills this grab. (The second bucket would always be replenished with exactly the same amount of tokens as the first bucket). But this is extra complex.

@michoecho michoecho force-pushed the iosched_token_batching branch from d62436f to da60cc0 Compare January 21, 2025 02:29
Add some test cases useful for presenting improvement ideas for the IO scheduler.
3 of 4 tests added in this patch illustrate some goals which weren't met before
but are met after this series. The `tau_nemesis` test illustrates a problem
which is present both before and after this series.
…test_env

fair_queue_test implicitly assumes that `fair_group::clock_type::now()` is always
smaller than `_fg.replenished_ts() + std::chrono::microseconds(1)`.
This ensures that the amount of replenished tokens does not exceed what
the `tick()` calls are supposed to replenish.

If this assumption is violated by the point of some `replenish_capacity()`
call in `tick()`, then `tick()` will not replenish tokens, and instead
tokens will be replenished by the `maybe_replenish_capacity()` call
done by `dispatch_requests()`, which will replenish more than 1us
worth of tokens, breaking the assumptions of the test.

To prevent this, and ensure that the test doesn't rely on timing,
we can initialize `_fg.replenished_ts()` to some point in the future,
which will ensure that `now()` won't catch up to it.
Adds a member variable which tracks the summed capacity of
all requests waiting in the queue.
This is a piece of data which might be valuable to the IO scheduler.
We make use of it in later patches in the series.
The current design of `fair_group` isn't fair enough to shards.
During contention, the group will be -- aproximately -- taking requests
from shards one-by-one, in round robin.

This guarantees that each contender will dispatch an equal *number* of
requests. This is some kind of fairness, but it's not the kind we want,
probably ever.

A better kind of fairness is that under contention, each shard should
be guaranteed `1/nr_shards` of the disk's IOPS and/or `1/nr_shards` of
byte-bandwidth, whichever dimension it pressures more.
This is needed so that each shard can be relied on to sustain a certain
rate of requests -- the lower bound of the slowest shard's throughput
usually dictates the throughput of the entire cluster.

But those two kinds of fairness are only the same if all IO requests have
the same size and direction. Otherwise they can be drastically different.

With the current design it's easy to create a situation where a shard receives
an arbitrarily small fraction of both IOPS and bandwidth, despite being
IO-bound. (Example: a node with X shards, where one shard spams only very small
requests and other shards spams only big requests).

This is a problem in practice. In ScyllaDB, we observed IO starvation of
some shards during realistic workloads. While they require some workload
asymmetry to occur, even small asymmetries can cause serious unfairness to occur.
(For example, a shard which receives 6% more of database queries than other
shards can be starved to less than 50% of its fair share of IOPS and/or
bandwidth -- because each of those 1 kiB queries is "fairly" matched with
16x costlier 128 kiB low-priority batch IO requests on other shards).

To improve this, `fair_group` needs a different queueing discipline.
There are many possible ways, but this patch chooses the one which is relatively
the most similar to the current one. The main idea is that we still rely on the
"approximate round robin" of token queue as the basis for fairness, but we reserve
a fixed-size batch of tokens at a time, rather than a fixed-size (i.e. 1) batch
of _requests_ at a time. This turns the discipline from
approximately-request-fair to approximately-token-fair, which is what we want.

The implementation details are non-trivial, though, and should be carefully reviewed.
@michoecho michoecho force-pushed the iosched_token_batching branch from da60cc0 to 2330929 Compare January 21, 2025 03:20
@michoecho
Copy link
Contributor Author

in-disk load->latency doesn't scale linearly, so disk may want time equivalent to, say, 1400 tokens

@xemul Is that so? And did we ever observe that? Do we have some past experiments?

In the model of reality in my head, sending several requests immediately should never result in worse disk utilization/speed than spacing them out manually. If that's not true, what's the mechanism which makes the "spaced out" version better? If we send too many requests at once, does it result in some extra delay in the kernel or something?

If the "spaced-out timeline" is important, then I have yet another idea, which combines the "old" and the "new" version — have two separate shared_token_buckets, one for governing fair throughput (this bucket would be replenished by the passage of time, and each shard would grab fixed batches of tokens from it), and one for governing the "spacing out". When a shard was ready to dispatch the request based on the amount of tokens it grabbed from the first bucket, it would make a grab (for the full capacity of the current request) in the second bucket, and it would wait with the actual dispatch to disk until the second bucket head fulfills this grab. (The second bucket would always be replenished with exactly the same amount of tokens as the first bucket). But this is extra complex.

@xemul Ping.

#!/usr/bin/env bash

# Test scenario:
# A single CPU-starved shard has a batch IO job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really starved (IIUC). It's saturated but the I/O fiber will receive CPU every task quota, which is as much as it can expect.

Copy link
Contributor Author

@michoecho michoecho Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"CPU-starved" here means that there is always more work for the CPU to do. The important part is that the shard only polls once per ~500us, not ~1us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that "starved" usually has another meaning, but I don't have a better word here. Saturated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The I/O fiber would be starved if it needed CPU but got it less than once per task quota.

The reactor (not the I/O fiber) is saturated: it doesn't have any cycles to spare, so it isn't able to provide better-than-expected service (CPU every 1usec). But it's not starving the I/O fiber, merely sticking to the expectations.

I'd read "reactor starved" to mean that something outside the reactor is eating the CPU.

shards: [0]
shard_info:
parallelism: 1
execution_time: 550us
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, almost, since this is above the task quota, but not by much.

@avikivity
Copy link
Member

While I couldn't follow the code, I think I can follow the reasoning in the detailed patch changelogs and they make sense to me.

@mykaul
Copy link

mykaul commented Jan 22, 2025

@xemul - can you give this another round of review, and let's see how far off it is from inclusion?

@bhalevy
Copy link
Member

bhalevy commented Jan 26, 2025

@xemul - can you give this another round of review, and let's see how far off it is from inclusion?

ping @xemul

@xemul
Copy link
Contributor

xemul commented Jan 27, 2025

@xemul Is that so? And did we ever observe that? Do we have some past experiments?

Sorry. Yes, we did. The whole scheduler math is built around assumption that disk cannot be re-loaded. We have scylladb/diskplorer plots showing that "too much requests" render larger latency for each. Also I have metrics run against two distinct io-properties.yaml files, one "correct" and another one "bursty" (i.e. -- with write iops set 2x times larger than they should) and the "bursty" one shows higher in-disk latency.

@xemul
Copy link
Contributor

xemul commented Jan 27, 2025

In the model of reality in my head, sending several requests immediately should never result in worse disk utilization/speed than spacing them out manually. If that's not true, what's the mechanism which makes the "spaced out" version better? If we send too many requests at once, does it result in some extra delay in the kernel or something?

If that was true we'd better not queue requests in seastar and send all of them into disk. You're right, that, say, 2 requests in disk is better than 1, but at the same time something like, say, 200 is worse than 100. The goal is to find a balance point where concurrency increase doesn't result in worse latency that we need.

@xemul xemul closed this in 71036eb Jan 27, 2025
@xemul xemul merged commit 71036eb into scylladb:master Jan 27, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants