-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spawn several threads when we fail to enqueue work in the blocki… #181
Conversation
This implementation will be ~2x slower (if we assume there are 4 cores) due to the congestion of threads in any scenario and batch allocation under pressure(with long bursts, short bursts, without burst, long-running tasks, etc.) which is not desired for most systems (incl. embedded systems – if this library is targeted for embedded domain too) in the future. Missing parts: This PR doesn't solve system boundary problems, which is tested and fixed in #108 |
I've run the benchmarks, here are the results for this implementation:
|
@vertexclique with my measurements it behaves the same as #108, and sometimes with less jitter because it's not dependent on the single thread that only adjusts every 200ms. The main thing that #108 improves over master on is that it will spawn more than one thread at a time. However, as we discussed, #108 is more complex than necessary, and in any case would require significant clean-up before it reaches a mergeable state. blocking 10k bench, measuring spawn latency distribution in terms of microseconds per spawn, 2018 i7 macbook pro with no foreground processes running other than tmux: master:
108:
this:
In general with performance-related code, please test using a proper histogram library like historian to get a sense of tail latency for actual requests, I think it will make conversations around performance more productive. Statements like "this is 2x slower" are not specific or actionable. From my measurements, I'm getting roughly the same performance, with a small fraction of the additional code and almost none of the math complexity or bespoke thread operational considerations. |
Curious to see your results on more hardware, here's the general recipe: diff --git a/Cargo.toml b/Cargo.toml
index ca0f3fc..f9f40b3 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -25,6 +25,7 @@ unstable = []
[dependencies]
async-task = "1.0.0"
+historian = "*"
cfg-if = "0.1.9"
crossbeam-channel = "0.3.9"
futures-core-preview = "0.3.0-alpha.18"
diff --git a/src/task/blocking.rs b/src/task/blocking.rs
index cb2b891..6cc5a3b 100644
--- a/src/task/blocking.rs
+++ b/src/task/blocking.rs
@@ -45,6 +45,7 @@ lazy_static! {
let (sender, receiver) = bounded(0);
Pool { sender, receiver }
};
+ pub static ref SPAWN_TIME: historian::Histo = historian::Histo::default();
}
// Create up to MAX_THREADS dynamic blocking task worker threads.
@@ -108,7 +109,9 @@ where
R: Send + 'static,
{
let (task, handle) = async_task::spawn(future, schedule, ());
+ let now = std::time::Instant::now();
task.schedule();
+ SPAWN_TIME.measure(now.elapsed().as_micros() as f64);
JoinHandle(handle)
}
diff --git a/src/task/mod.rs b/src/task/mod.rs
index eef7284..2c5d8db 100644
--- a/src/task/mod.rs
+++ b/src/task/mod.rs
@@ -36,4 +36,4 @@ mod pool;
mod sleep;
mod task;
-pub(crate) mod blocking;
+pub mod blocking; |
oh, actual bench code: λ cat src/bin/blocking_histos.rs
use async_std::task;
use std::thread;
use std::time::Duration;
fn main() {
let handles = (0..10_000)
.map(|_| {
task::blocking::spawn(async {
let duration = Duration::from_millis(1);
thread::sleep(duration);
})
})
.collect::<Vec<_>>();
task::blocking::SPAWN_TIME.print_percentiles();
} |
I think what you are measuring is the spawn time on the pool. It will be always approximately same on every implementation. Also, this histogram's will give exact same numbers in every implementation where it spawns on error case onto the thread pool. The only difference will be caused by how many you spawn dynamically. That will only scale up the percentile values towards the 0th percentile. While comparing that time to the EMA PR is comparing apples to nothing. Because it doesn't do the spawn in there.
Total job in/out time is something else. Job-in rate overhead is something else. You are calculating the latter. Since ema pr didn't well fit(also I don't think that it is well-read) I am going to close it. It is up to you what you want to use. |
Has the work on #108, the exponential moving average thread pool design, been closed and rejected? I'm not seeing an implementation of that here in this PR. |
@naftulikay this PR does not use an exponential moving average, it just spins up more than one thread at a time to get pretty much the same effect with far less complexity. |
Is there a benchmark comparing this with EMA somewhere ? |
@gnzlbg Yes. In the EMA branch there is a benchmark comparing total time of execution for any implementation of blocking pool. You can run it with this branch. I ran and wrote the results in the second comment. EMA's benhcmark result is in the PR description. |
726ad2c
to
ee50156
Compare
Rebased onto master. Merge if this passes CI :) |
Switched to unbounded channels because that improves our benchmark numbers here by an order of magnitude: https://github.com/jebrosen/async-file-benchmark Reading a 256K file:
|
This makes the spin-up time on the blocking threadpool far more efficient when encountering bursts.