jpeg-decoder is slower than libjpeg-turbo #155

Shnatsel · 2020-06-20T16:39:18Z

jpeg_decoder::decoder::Decoder::decode_internal seems to take 50% of the decoding time, or over 75% if using Rayon because this part is not parallelized. This part alone takes more time than libjpeg-turbo takes to decode the entire image.

It appears that jpeg-decoder reads one byte at a time from the input stream and executes some complex logic for every byte, e.g. in HuffmanDecoder::read_bits and a number of other functions called from decode_internal. I suspect performing a single large read (a few Kb in size), then using something that lowers to memchr calls to find marker boundaries would be much faster.

Profiled using this file: https://commons.wikimedia.org/wiki/File:Sun_over_Lake_Hawea,_New_Zealand.jpg via image crate, jpeg-decoder v0.1.19

Single-treaded profile: https://share.firefox.dev/30ZTmks
Parallel profile: https://share.firefox.dev/3dqzE49

The text was updated successfully, but these errors were encountered:

lovasoa · 2020-06-20T17:45:04Z

did you use a BufReader for this test?

Shnatsel · 2020-06-20T17:52:30Z

Yes. Here's the code used for testing:

fn main() -> std::io::Result<()> {
    let path = std::env::args().skip(1).next().unwrap();
    let _ = image::io::Reader::open(path)?
        .with_guessed_format()
        .unwrap()
        .decode()
        .unwrap();
    Ok(())
}

image::io::Reader::open does require BufRead: https://github.com/image-rs/image/blob/0b21ce8bc8d0b697964820e649fd40127ef404fa/src/io/reader.rs#L124

Shnatsel · 2020-06-24T10:45:52Z

Initial experiments with buffering are available in the buffered-reads branch but do not demonstrate significantly better results so far.

Shnatsel · 2020-06-24T19:07:06Z

jpeg_decoder::huffman::HuffmanDecoder::read_bits accounts for 23% of all time spent, does byte-by-byte reads and spends most of its time calling std::io::Read::read_exact. Plus has additional complex logic because of its inability to return a byte it has already read to the reader. So that's probably where buffered reads would actually make a difference.

thomcc · 2020-06-26T16:42:30Z

Came across the link to this in zulip, but... For what it's worth there's a very good series on how to do bitwise io performantly in compressors on Fabien Giesen's blog, if you haven't seen it before:

Sorry if this is old news.

Shnatsel · 2020-10-17T20:54:14Z

I've done some more profiling and tinkering, and I believe my earlier assumptions are incorrect. In parallel mode most of the time is spent in jpeg_decoder::idct::dequantize_and_idct_block_8x8_inner. Here's a finer-grained profile to back that up.

I've also verified this experimentally by speeding up that function and seeing it reflected in end-to-end performance gain.

This is really good news because the function is self-contained and takes up 75% of the end-to-end execution time, so any optimizations we can make to it will translate to large gains in end-to-end decoding performance. The function can be found here.

lovasoa · 2020-10-18T08:03:30Z

see my pull request that uses simd for this function: #146

Shnatsel · 2020-10-18T11:36:01Z

After looking at it some more I don't think we can do much here without parallelization and/or SIMD, since the IDCT algorithm appears to be identical to the fallback one in libjpeg-turbo (which normally uses hand-written assembly with SIMD instructions).

Shnatsel · 2020-10-22T01:34:14Z

After looking at IDCT some more, particularly the threaded worker, there's really no reason why it cannot be made multi-threaded by component. They are already decoded independently and 95% of the infrastructure is already there. https://github.com/image-rs/jpeg-decoder/blob/master/src/worker/threaded.rs already does most of the heavy lifting, but doesn't split the image by component. This should be a nearly flat 3x speedup except for grayscale images.

Shnatsel · 2020-10-22T03:18:59Z

I've opened #168 for parallelizing IDCT. We can combine it with SIMD later to hopefully outperform libjpeg-turbo in the future.

Sadly it doesn't do all that much for performance because we get bottlenecked by the reader thread instead, as described in the original post. Most of the time is now spent in jpeg_decoder::decoder::decode_block.

~~It's time to dust off those BufReader optimizations that didn't seem to do anything!~~ Nope, the branch buffered-reads still makes no difference. It's slightly worse, if anything.

Profile after IDCT parallelization

lovasoa · 2020-10-22T11:51:22Z

Is that the profile for a release build ? It contains function calls for things like core::num::wrapping::::sub, that I would have expected to be inlined in a production build.

Shnatsel · 2020-10-22T12:15:07Z

They're inlined! perf is just that good. I'm using this in Cargo.toml:

[profile.release]
debug=true

and profiling with perf record --call-graph=dwarf so that it uses debug info to see into inlined functions.

willcrichton · 2021-02-25T16:47:40Z

Just another data point. I'm using jpeg-decoder via the image crate in a WASM project. I've noticed that loading JPEGs is very slow, roughly 200ms to decode a 2048 x 2048 image. Here's a screenshot of the Chrome profile of a single load, along with the most common functions calls at the bottom.

It seems like most of the time is spent in color_convert_line_ycbcr. I don't see that mentioned on the thread, so a different kind of bottleneck for WASM perhaps?

lovasoa · 2021-02-25T16:58:28Z

In what situation would you want to decode a JPEG in wasm ? You would have to ship a large wasm jpeg decoder to your users, that is always going to run slower than the native jpeg decoder in their browser. If you have a project that handles images in wasm, I would suggest handling the image loading and decoding with native browser APIs, and passing only a UInt8Array containing the pixels to your wasm.

willcrichton · 2021-02-25T17:24:05Z

@lovasoa yes I could implement all that. It's just significantly more convenient to use image, since it works cross-platform and my app also targets native. If the JPEG decoder were fast enough then I wouldn't bother with platform-specific code.

HeroicKatora · 2021-02-25T17:34:39Z

@willcrichton This would be a more useful data point if you submited traces, not screenshots. Spending 30% of time in memset and memcpy is surely not optimal either and anyone debugging would surely want to know where in the callgraph they occur.

willcrichton · 2021-02-25T18:28:46Z

Sure thing, here's the trace. wasm-jpeg-decoder.json.zip

Shnatsel · 2021-02-25T18:47:30Z

I'm afraid that JPEG decoding will always be significantly slower in WASM than it is in native code. It's very computationally expensive and relies on SIMD and/or parallelization to perform well, and WASM allows neither.

willcrichton · 2021-02-28T19:10:46Z

For the record, I implemented a web image loader: https://github.com/willcrichton/learn-opengl-rust/blob/88c0282be6bc855dd52d61e5395c3fa1df2c3fc4/src/io.rs#L54-L107

I haven't done a rigorous benchmark, but based on my observations from the traces:

Overall load times improved ~50%.
Time spent in the decoder went from max 1000ms per image to 200ms per image.
In the web loader, after decoding an image, I spend about 150ms in getImageData.
Then there's a mysterious ~50-100ms of work done by the GPU?
So a whole decode task takes ~400ms max.

Traces for the interested. traces.zip

lovasoa · 2021-02-28T19:18:45Z

@willcrichton : 😎 cool, this looks very useful, you should publish this as a small crate on crates.io ! One small remark: maybe I read too quickly, but it looks like you are waiting for the image to have fully loaded to start creating your canvas and creating a context. So your CPU will idle while the image is being downloaded, then it will be busy exclusively decoding the image (probably on a single core), then creating the canvas.

edit : Here is a small demo: http://jsbin.com/xunatebovu/edit

Shnatsel · 2022-05-13T22:48:00Z

As of version 0.2.6, on a 6200x8200 CMYK image, jpeg-decoder is actually faster than libjpeg-turbo on my 4-core machine!

Without the rayon feature it's 700ms for jpeg-decoder vs 800ms for libjpeg-turbo. And according to perf it's only utilizing 1.38 CPU cores, not all 4, so similar gains should be seen on dual-core machines as well.

The rayon feature is not currently usable due to #245, but once it is fixed I expect the decoding time to drop to 600ms.

Even without parallelism jpeg-decoder is within striking distance of libjpeg-turbo: 850ms as opposed to 800ms.

Shnatsel · 2022-05-14T15:01:32Z

Oops. I fear the celebration has been premature.

Now that I've tested it on a selection of photos, it appears that jpeg-decoder is still considerably slower than libjpeg-turbo even with parallelism: it takes 6 seconds to decode a corpus of photos with libjpeg-turbo and 10 seconds with jpeg-decoder. (measuring without rayon so far because of #245).

Huffman decoding continues bottlenecking decoding. In fact, on 3000x4000 photos Huffman decoding alone takes about as much time as libjpeg-turbo's entire decoding process.

Shnatsel mentioned this issue Jun 21, 2020

Be careful! image-rs is so slow nicolashahn/diffimg-rs#4

Closed

Shnatsel mentioned this issue Jun 24, 2020

Faster color conversions #157

Merged

Shnatsel changed the title ~~Image reading and marker extraction is slow~~ Decoder::decode_internal is slow Jun 25, 2020

Shnatsel changed the title ~~Decoder::decode_internal is slow~~ jpeg-decoder is slower than libjpeg-turbo Oct 17, 2020

HeroicKatora mentioned this issue Oct 23, 2021

Speed is half of normal jpeg decoders (on ARM) #202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jpeg-decoder is slower than libjpeg-turbo #155

jpeg-decoder is slower than libjpeg-turbo #155

Shnatsel commented Jun 20, 2020 •

edited

Loading

lovasoa commented Jun 20, 2020

Shnatsel commented Jun 20, 2020 •

edited

Loading

Shnatsel commented Jun 24, 2020

Shnatsel commented Jun 24, 2020

thomcc commented Jun 26, 2020

Shnatsel commented Oct 17, 2020

lovasoa commented Oct 18, 2020

Shnatsel commented Oct 18, 2020 •

edited

Loading

Shnatsel commented Oct 22, 2020

Shnatsel commented Oct 22, 2020 •

edited

Loading

lovasoa commented Oct 22, 2020

Shnatsel commented Oct 22, 2020

willcrichton commented Feb 25, 2021

lovasoa commented Feb 25, 2021

willcrichton commented Feb 25, 2021

HeroicKatora commented Feb 25, 2021

willcrichton commented Feb 25, 2021

Shnatsel commented Feb 25, 2021

willcrichton commented Feb 28, 2021 •

edited

Loading

lovasoa commented Feb 28, 2021 •

edited

Loading

Shnatsel commented May 13, 2022

Shnatsel commented May 14, 2022

jpeg-decoder is slower than libjpeg-turbo #155

jpeg-decoder is slower than libjpeg-turbo #155

Comments

Shnatsel commented Jun 20, 2020 • edited Loading

lovasoa commented Jun 20, 2020

Shnatsel commented Jun 20, 2020 • edited Loading

Shnatsel commented Jun 24, 2020

Shnatsel commented Jun 24, 2020

thomcc commented Jun 26, 2020

Shnatsel commented Oct 17, 2020

lovasoa commented Oct 18, 2020

Shnatsel commented Oct 18, 2020 • edited Loading

Shnatsel commented Oct 22, 2020

Shnatsel commented Oct 22, 2020 • edited Loading

lovasoa commented Oct 22, 2020

Shnatsel commented Oct 22, 2020

willcrichton commented Feb 25, 2021

lovasoa commented Feb 25, 2021

willcrichton commented Feb 25, 2021

HeroicKatora commented Feb 25, 2021

willcrichton commented Feb 25, 2021

Shnatsel commented Feb 25, 2021

willcrichton commented Feb 28, 2021 • edited Loading

lovasoa commented Feb 28, 2021 • edited Loading

Shnatsel commented May 13, 2022

Shnatsel commented May 14, 2022

Shnatsel commented Jun 20, 2020 •

edited

Loading

Shnatsel commented Jun 20, 2020 •

edited

Loading

Shnatsel commented Oct 18, 2020 •

edited

Loading

Shnatsel commented Oct 22, 2020 •

edited

Loading

willcrichton commented Feb 28, 2021 •

edited

Loading

lovasoa commented Feb 28, 2021 •

edited

Loading