-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fair_queue: make the fair_group token grabbing discipline more fair #2616
Conversation
Refs (to some degree: fixes) #2615 |
2070713
to
ba7df37
Compare
// It shouldn't matter in practice. | ||
grab_amount = std::min<capacity_t>(grab_amount, _group.maximum_capacity()); | ||
_group.refund_tokens(recycled); | ||
grab_capacity(grab_amount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no fair_queue::grab_capacity(capacity_t)
method here, only the fair_queue::grab_capacity(const fair_queue_entry&)
one, how does it compile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grab_amount
is a local variable, though? Not a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My misprint :( updated the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, you're right. Apparently it works because there's an implicit conversion from capacity_t
to fair_queue_entry
:
seastar/include/seastar/core/fair_queue.hh
Lines 118 to 119 in faf6a5d
fair_queue_entry(capacity_t c) noexcept | |
: _capacity(c) {} |
But it does the expected thing, so... whatever, I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix it with the assumption that fair_queue_entry constructor is explicit
264a61f
to
d62436f
Compare
v2:
@xemul Please re-review. But I was thinking: is it really necessary to have a "timeline" of dispatches, where each shard only performs the dispatch after the bucket head reaches its If this isn't necessary, then I see an alternative which is both simpler and has better guarantees. See michoecho@d5a559a. The idea is: each shard has a This means that we dispatch first, then we grab tokens to pay for that, and only then we dispatch again. There is no attempt to use the |
If we want to make shards dispatch independently, then timeline can be avoided, but presumably I mis-understanding what you call "a timeline" (see below)
This "after the reservation is fulfilled by bucket head", how does it differ from "each shard only performs the dispatch after the bucket head reaches its _pending.head"? |
@@ -0,0 +1,39 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, move all new tests into tests/manual/
@xemul Imagine that The main benefit of that is that the total amount of pending reservations has a hard limit at The "fairness" of the entire batching idea is that each shard can grab a batch once per "round robin", so the shorter the round robin, the better the fairness. In this case, for our default reactor parameters, the improvement from |
OK, but isn't it another "timeline"?
It can work both ways. We overload the disk in a hope that it will handle the load within the upcoming idling time (corresponding to 700 tokens). On one hand, we increased the in-disk concurrency and it can do better job than serving requests coming one-by-one, but on the other, in-disk load->latency doesn't scale linearly, so disk may want time equivalent to, say, 1400 tokens, so once we continue dispatching after 700 disk will still be busy. |
It is. Nomenclature is hard. What I meant is that in the "timeline" version, we explicitly give each request its own designated "slot" in the timeline. Conceptually, if the disk had max concurrency of 1, that slot would match the time span when the disk is handling that request.
@xemul Is that so? And did we ever observe that? Do we have some past experiments? In the model of reality in my head, sending several requests immediately should never result in worse disk utilization/speed than spacing them out manually. If that's not true, what's the mechanism which makes the "spaced out" version better? If we send too many requests at once, does it result in some extra delay in the kernel or something? If the "spaced-out timeline" is important, then I have yet another idea, which combines the "old" and the "new" version — have two separate |
d62436f
to
da60cc0
Compare
Add some test cases useful for presenting improvement ideas for the IO scheduler. 3 of 4 tests added in this patch illustrate some goals which weren't met before but are met after this series. The `tau_nemesis` test illustrates a problem which is present both before and after this series.
…test_env fair_queue_test implicitly assumes that `fair_group::clock_type::now()` is always smaller than `_fg.replenished_ts() + std::chrono::microseconds(1)`. This ensures that the amount of replenished tokens does not exceed what the `tick()` calls are supposed to replenish. If this assumption is violated by the point of some `replenish_capacity()` call in `tick()`, then `tick()` will not replenish tokens, and instead tokens will be replenished by the `maybe_replenish_capacity()` call done by `dispatch_requests()`, which will replenish more than 1us worth of tokens, breaking the assumptions of the test. To prevent this, and ensure that the test doesn't rely on timing, we can initialize `_fg.replenished_ts()` to some point in the future, which will ensure that `now()` won't catch up to it.
Adds a member variable which tracks the summed capacity of all requests waiting in the queue. This is a piece of data which might be valuable to the IO scheduler. We make use of it in later patches in the series.
The current design of `fair_group` isn't fair enough to shards. During contention, the group will be -- aproximately -- taking requests from shards one-by-one, in round robin. This guarantees that each contender will dispatch an equal *number* of requests. This is some kind of fairness, but it's not the kind we want, probably ever. A better kind of fairness is that under contention, each shard should be guaranteed `1/nr_shards` of the disk's IOPS and/or `1/nr_shards` of byte-bandwidth, whichever dimension it pressures more. This is needed so that each shard can be relied on to sustain a certain rate of requests -- the lower bound of the slowest shard's throughput usually dictates the throughput of the entire cluster. But those two kinds of fairness are only the same if all IO requests have the same size and direction. Otherwise they can be drastically different. With the current design it's easy to create a situation where a shard receives an arbitrarily small fraction of both IOPS and bandwidth, despite being IO-bound. (Example: a node with X shards, where one shard spams only very small requests and other shards spams only big requests). This is a problem in practice. In ScyllaDB, we observed IO starvation of some shards during realistic workloads. While they require some workload asymmetry to occur, even small asymmetries can cause serious unfairness to occur. (For example, a shard which receives 6% more of database queries than other shards can be starved to less than 50% of its fair share of IOPS and/or bandwidth -- because each of those 1 kiB queries is "fairly" matched with 16x costlier 128 kiB low-priority batch IO requests on other shards). To improve this, `fair_group` needs a different queueing discipline. There are many possible ways, but this patch chooses the one which is relatively the most similar to the current one. The main idea is that we still rely on the "approximate round robin" of token queue as the basis for fairness, but we reserve a fixed-size batch of tokens at a time, rather than a fixed-size (i.e. 1) batch of _requests_ at a time. This turns the discipline from approximately-request-fair to approximately-token-fair, which is what we want. The implementation details are non-trivial, though, and should be carefully reviewed.
da60cc0
to
2330929
Compare
@xemul Ping. |
#!/usr/bin/env bash | ||
|
||
# Test scenario: | ||
# A single CPU-starved shard has a batch IO job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really starved (IIUC). It's saturated but the I/O fiber will receive CPU every task quota, which is as much as it can expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"CPU-starved" here means that there is always more work for the CPU to do. The important part is that the shard only polls once per ~500us, not ~1us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that "starved" usually has another meaning, but I don't have a better word here. Saturated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The I/O fiber would be starved if it needed CPU but got it less than once per task quota.
The reactor (not the I/O fiber) is saturated: it doesn't have any cycles to spare, so it isn't able to provide better-than-expected service (CPU every 1usec). But it's not starving the I/O fiber, merely sticking to the expectations.
I'd read "reactor starved" to mean that something outside the reactor is eating the CPU.
shards: [0] | ||
shard_info: | ||
parallelism: 1 | ||
execution_time: 550us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, almost, since this is above the task quota, but not by much.
While I couldn't follow the code, I think I can follow the reasoning in the detailed patch changelogs and they make sense to me. |
@xemul - can you give this another round of review, and let's see how far off it is from inclusion? |
Sorry. Yes, we did. The whole scheduler math is built around assumption that disk cannot be re-loaded. We have scylladb/diskplorer plots showing that "too much requests" render larger latency for each. Also I have metrics run against two distinct io-properties.yaml files, one "correct" and another one "bursty" (i.e. -- with write iops set 2x times larger than they should) and the "bursty" one shows higher in-disk latency. |
If that was true we'd better not queue requests in seastar and send all of them into disk. You're right, that, say, 2 requests in disk is better than 1, but at the same time something like, say, 200 is worse than 100. The goal is to find a balance point where concurrency increase doesn't result in worse latency that we need. |
The current design of
fair_group
isn't fair enough to shards.During contention, the group will be -- aproximately -- taking requests
from shards one-by-one, in round robin.
This guarantees that each contender will dispatch an equal number of
requests. This is some kind of fairness, but it's not the kind we want,
probably ever.
A better kind of fairness is that under contention each shard should
be guaranteed
1/nr_shards
of the disk's IOPS and/or1/nr_shards
ofbyte-bandwidth, whichever dimension it pressures more.
This is needed so that each shard can be relied on to sustain a certain
rate of requests -- the lower bound of the slowest shard's throughput
usually dictates the throughput of the entire cluster.
But those two kinds of fairness are only the same if all IO requests have
the same size and direction. Otherwise they can be drastically different.
With the current design it's easy to create a situation where a shard receives
an arbitrarily small fraction of both IOPS and bandwidth, despite being
IO-bound. (Example: a node with X shards, where one shard spams only very small
requests and other shards spams only big requests).
This is a problem in practice. In ScyllaDB, we observed IO starvation of
some shards during realistic workloads. While they require some workload
asymmetry to occur, even small asymmetries can cause serious unfairness to occur.
(For example, a shard which receives 6% more of database queries than other
shards can be starved to less than 50% of its fair share of IOPS and/or
bandwidth -- because each of those surplus 1 kiB queries is "fairly" matched with
16x costlier 128 kiB low-priority batch IO requests on other shards).
To improve this,
fair_group
needs a different queueing discipline.There are many possible ways, but this patch chooses the one which is relatively
the most similar to the current one. The main idea is that we still rely on the
"approximate round robin" of the token bucket as the basis for fairness, but we reserve
a fixed-size batch of tokens at a time, rather than a fixed-size (i.e. 1) batch
of requests at a time. This turns the discipline from request-fair to
token-fair, which is what we want.
The implementation details are non-trivial, though, and should be carefully reviewed.
This change doesn't result in full fairness. The degree of fairness depends on how often
a given shard polls. A continuously-polling shard should get its fair share of disk bandwidth,
but a rarely-polling shard might only get a fraction of its fair share. But with current default values
for IO scheduler parameters, a shard should be guaranteed at least 60% of its fair share
in the worst case, which is decent, and better than the current "arbitrarily-low%".
Also note that by "fair share" we mean 1 /
nr_shards_in_group
, not 1/nr_actively_contending_shards_in_group
. If there are 30 shards in a group but only 2 are contending for IO — one polling rarely and one polling continuously — the rarely-polling contender will only get ~1/30 of the disk's bandwidth, not ~1/2. This is not a problem for applications where the interactive workloads are mostly symmetric (because then it's the "normal" case that all shards are contending and none can expect more than 1/30), but it's unsatisfying.Different scheduling algorithms could improve on both of the above. We have some ideas for that. But they would be a bigger departure from status quo, and they should be harder to implement well.