-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rayon slower than single threaded performance when mixing iter and par_iter #1136
Comments
Generally speaking, parallelization does have a non-negligible overhead and hence whether it actually results in speed-ups depends on the amount of work done per item. In nested loops, it is there often the case that not parallelizing all levels is optimal. From past experience, I would recommend starting with parallelizing the outermost loop and then moving inwards so start with the largest work items and repeated reduce their size but increase their count as long as that pays off. In your case, I see a speed-up from parallelizing the two outer levels, but it is minimal compared to just the outermost level which I would probably stick to here: > hyperfine ./serial ./parallel-outermost ./parallel-two-levels
Benchmark 1: ./serial
Time (mean ± σ): 2.6 ms ± 0.2 ms [User: 1.7 ms, System: 1.3 ms]
Range (min … max): 2.3 ms … 3.6 ms 773 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: ./parallel-outermost
Time (mean ± σ): 1.4 ms ± 0.1 ms [User: 3.1 ms, System: 4.6 ms]
Range (min … max): 1.2 ms … 2.1 ms 919 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Benchmark 3: ./parallel-two-levels
Time (mean ± σ): 1.3 ms ± 0.1 ms [User: 3.2 ms, System: 4.3 ms]
Range (min … max): 1.1 ms … 1.9 ms 1093 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Summary
./parallel-two-levels ran
1.01 ± 0.10 times faster than ./parallel-outermost
1.91 ± 0.19 times faster than ./serial i.e. use std::hint::black_box;
use rayon::prelude::*;
fn main() {
let result = (0..512).into_par_iter().map(|x| {
(0..512).map(|y| {
(0..16).map(|z| black_box(x*y*z)).sum::<usize>()
}).sum::<usize>()
}).sum::<usize>();
black_box(result);
} Finally, maybe you could expand on what you would consider a helpful change here? Did you find the documentation lacking somewhere? P.S.: As an side, as Hyperfine warns, it would probably be nicer to time multiple iterations of the above as the runtime of a single one is below 5ms. |
My point is that it is surprising that Imagine that the inner scope is within another function used a lot of places in a code-base, initially sequential, but later made parallel. As you slowly start introducing rayon into the code-base, you might see big performance improvements in some places, while unexpectedly also getting 100x slowdowns in other parts of your code. |
The problem is that the work you're starting in parallel is tiny, not that there are iterators on the outside. (0..16).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>() Even when forced to a single thread, it's still sending that to the thread pool and waiting for the result, and I'm not surprised at all if that thread synchronization is more costly than the actual work. |
Yes, I intentionally crafted the example to have very little compute to make the overhead as visible as possible. Why is it that parallel iterators in the outer scopes remove the overhead of the par_iter in the inner scope? |
I guess the outer parallel iterators produce tasks which keep the thread pool occupied and prevent the inner tiny iterators from being split at all hence getting closer to the sequential version. In your example of some utility function encapsulating some parallel processing, you would probably need to avoid excessive splitting using something |
Also, the inner iterators will hit the fast path of already being within the pool, so they can start executing directly. |
Oh, I didn't know about use rayon::prelude::*;
use std::hint::black_box;
fn main() {
let t0 = std::time::Instant::now();
let result = (0..512).map(|x| {
(0..512).map(|y| {
my_sum_function(x, y, 16)
}).sum::<usize>()
}).sum::<usize>();
println!("elapsed: {} us", t0.elapsed().as_micros());
println!("result: {result}");
}
fn my_sum_function(x: usize, y: usize, n: usize) -> usize {
(0..n).into_par_iter().with_min_len(512).map(|z| black_box(x*y*z)).sum::<usize>()
} Timings (notice I'm using microseconds not milliseconds here): $ cargo run --release
elapsed: 6562 us
result: 2053539102720 That said, seems like the following still has less overhead:use rayon::prelude::*;
use std::hint::black_box;
fn main() {
let t0 = std::time::Instant::now();
let result = (0..512).map(|x| {
(0..512).map(|y| {
my_sum_function(x, y, 16)
}).sum::<usize>()
}).sum::<usize>();
println!("elapsed: {} us", t0.elapsed().as_micros());
println!("result: {result}");
}
fn my_sum_function(x: usize, y: usize, n: usize) -> usize {
if n < 512 {
(0..n).map(|z| black_box(x*y*z)).sum::<usize>()
} else {
(0..n).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()
}
} Timings: $ cargo run --releas
elapsed: 1890 us
result: 2053539102720 |
This seems to be right, when observing the following code (where we simply wrap the entire use rayon::prelude::*;
use std::hint::black_box;
fn main() {
let t0 = std::time::Instant::now();
let (result, _) = rayon::join(|| {
(0..512).map(|x| {
(0..512).map(|y| {
(0..16).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()
}).sum::<usize>()
}).sum::<usize>()
}, || {
});
println!("elapsed: {} ms", t0.elapsed().as_millis());
println!("result: {result}");
} And we can see that here $ cargo run --release
elapsed: 1764 ms
result: 2053539102720
$ RAYON_NUM_THREADS=1 cargo run --release
elapsed: 13 ms
result: 2053539102720 I'm guessing the reason the multi-threaded example is still somewhat slow is because the work is moved around in the thread pool a lot. |
Probably there's a corelation between spawning a bulldozer and a scoop. of course, scoop is much faster for tiny tasks. |
Consider the following "mixed iterator" example:
Timings:
$ cargo run --release elapsed: 4173 ms result: 2053539102720 # Notice how this is significantly faster than the multithreaded example above: $ RAYON_NUM_THREADS=1 cargo run --release elapsed: 1275 ms result: 2053539102720
Using perf on Linux:
into_par_iter on all 3 nesting levels (400x faster)
Timings:
Using only serial iterators
This turns out to actually be the fastest in this case, so using rayon at all was not advantageous for this particular problem.
Timings:
The text was updated successfully, but these errors were encountered: