Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelism without Rayon #171

Open
Shnatsel opened this issue Nov 8, 2020 · 6 comments
Open

Parallelism without Rayon #171

Shnatsel opened this issue Nov 8, 2020 · 6 comments

Comments

@Shnatsel
Copy link
Contributor

Shnatsel commented Nov 8, 2020

Rayon introduces a lot of unsafe code to the dependency graph - at about 2000 unsafe expressions according to cargo-geiger, accounting for over 80% of all unsafe code. This undermines the value proposition of the safe Rust decoder: the entire point of it is implementing a decoder in safe code; if you're OK with depending on unsafe code that's probably fine, you should just use libjpeg-turbo.

It would be nice to experiment and see if we can match the performance of Rayon without pulling in all of its dependency tree.

Right now jpeg-decoder uses Rayon for just one line of code:

image.par_chunks_mut(line_size)

It should be fairly trivial to replicate what Rayon does here using manually spawned threads and an MPMC channel to distribute work items to the threads. flume is a 100% safe code implementation of an MPMC channel that's on par with crossbeam in the use cases that are relevant here. I'm not sure if that's the right approach, but it's probably worth a shot.

@Shnatsel
Copy link
Contributor Author

I've taken an experimental stab at this. It seems we'll still need crossbeam-utils for scoped threads, but we could still get rid of crossbeam-deque and rayon dependencies which are ~500 unsafe expressions each.

@Shnatsel
Copy link
Contributor Author

Which might be a very good thing, since crossbeam-deque is pulling some dubiously legal tricks: rayon-rs/rayon#812

and all major single-threaded deque implementations in Rust had serious memory bugs, and crossbeam-deque has even less chance to be correct than the much simpler single-threaded impls...

@Shnatsel
Copy link
Contributor Author

My initial implementation - full of unwrap(), but probably functional - can be found here: https://github.com/Shnatsel/jpeg-decoder/tree/rayonless

So far it's decoding https://commons.wikimedia.org/wiki/File:Sun_over_Lake_Hawea,_New_Zealand.jpg in 332ms as opposed to 308ms with Rayon, but the code is quite naive so far.

Profile on the same image: https://share.firefox.dev/36T6adS

This shows that the worker threads are under-utilized. Dividing the work into larger chunks should do the trick; I believe that's what rayon does as well.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Nov 20, 2020

The addition of some naive batching has put the custom impl in line with Rayon, and even made it faster by a tiny bit - 305ms vs 308ms with Rayon. That means the approach is viable!

Code: https://github.com/image-rs/jpeg-decoder/compare/master...Shnatsel:rayonless?expand=1

@fintelia
Copy link
Contributor

This is exciting to see! It looks like it will be possible to avoid the rayon dependency without too much added complexity and without incurring a performance regression.

@Shnatsel
Copy link
Contributor Author

I've tried adding some heuristics for batching and choosing the number of CPUs, but so far all it did was regress my benchmarks across the board

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants