Decoding animated WebP is 4x slower than `libwebp-sys` + `webp-animation` #119

Shnatsel · 2024-10-22T23:10:55Z

In image v0.25.4 and image-webp v0.2.0, decoding the attached animated WebP is 4x slower than using libwebp-sys + webp-animation: sample.zip

image

use std::error::Error;
use image::{codecs::webp::WebPDecoder, AnimationDecoder, ImageReader};

fn main() -> Result<(), Box<dyn Error>> {
    let input = std::env::args().nth(1).unwrap();    
    let reader = ImageReader::open(input)?.into_inner();
    let decoder = WebPDecoder::new(reader)?;
    let mut iter = decoder.into_frames();
    while let Some(_frame) = iter.next() {}
    Ok(())
}

hyperfine results:

  Time (mean ± σ):     411.9 ms ±   3.9 ms    [User: 372.2 ms, System: 38.8 ms]
  Range (min … max):   407.6 ms … 416.8 ms    10 runs

libwebp-sys + webp-animation

use std::error::Error;
use webp_animation::prelude::*;

fn main() -> Result<(), Box<dyn Error>> {
    let input = std::env::args().nth(1).unwrap();    

    let buffer = std::fs::read(input).unwrap();
    let decoder = Decoder::new(&buffer).unwrap();
    let mut iter = decoder.into_iter();
    while let Some(_frame) = iter.next() {}
    Ok(())
}

hyperfine results:

  Time (mean ± σ):      95.9 ms ±   0.4 ms    [User: 128.7 ms, System: 7.4 ms]
  Range (min … max):    95.3 ms …  96.7 ms    30 runs

Analysis

webp-animation shows a bit of multi-threading happening on the profile, with user time being longer than the total execution time, but even accounting for that image-webp is 3x slower.

Breakdown of where the time is spent in image, recorded by samply: https://share.firefox.dev/4fc3utg

The greatest contributors seem to be image_webp::vp8::Vp8Decoder::decode_frame (48%), image_webp::extended::do_alpha_blending (20%), image_webp::vp8::Frame::fill_rgba (16%).

Within decode_frame the biggest contributor is image_webp::vp8::Vp8Decoder::read_coefficients (12% self time, 32% total time), and the code of that function looks like it could be optimized further to reduce bounds checks, etc. #71 is also relevant, but only accounts for 20% of the total time.

The text was updated successfully, but these errors were encountered:

Shnatsel · 2024-12-14T07:07:45Z

I looked at the profile and the associated code a bit. The two low-hanging optimization opportunities are:

Applying the YUV->RGB optimization from Refactor fill_rgb to fill 2 pixels at a time #13 to the RGBA codepath as well
The alpha blending currently uses an f64 to blend every u8. No wonder that's slow - it kills vectorization. We should make it operate in integers (u16 might be enough?) and use chunks_exact() to get the compiler to vectorize it.

Shnatsel · 2024-12-15T16:03:17Z

Paper on fast alpha blending without divisions: https://arxiv.org/pdf/2202.02864

awxkee · 2024-12-15T19:10:04Z

Paper on fast alpha blending without divisions: https://arxiv.org/pdf/2202.02864

Here is ready method if needed https://github.com/awxkee/pic-scale-safe/blob/7f1066e006aac5eea4c1474f5bbe81cae837b2b7/src/alpha.rs#L43

Division by alpha if required much less straightforward.

Shnatsel · 2024-12-18T14:19:29Z

I've attempted to optimize alpha blending by performing it in u16 instead of f64. I got the primitives working (rounding integer division both by 255 and by an arbitrary u8) but when I assemble the entire thing with them it falls apart. Pretty early, too - calculating the resulting alpha is already wrong, we don't even get to the interesting part of blending the RGB values.

This is what I've got so far: https://github.com/image-rs/image-webp/compare/main...Shnatsel:image-webp:faster-alpha-blending?expand=1

By the way, @awxkee in your div_by_255 the .min(255) part is completely unnecessary. Instead of outputting 256 it already outputs 0, so you don't need any clamping. My branch includes an exhaustive correctness test for that function, you can try it for yourself. I kept the .min in my branch out of the abundance of caution, but I am convinced that it does nothing.

awxkee · 2024-12-18T14:44:19Z

min was accidentally added when I was trying to force auto-vectorization. You're correct that the division is exact, so clamping is unnecessary.

I think your rounding_div will be incredibly slow to perform two integer divisions at one time.
Maybe it’s worth trying to generate a reduction table since the denominator range is small [1,255], to perform Montgomery multiplication?
As LLVM does when dividing by constant https://godbolt.org/z/qnE5E8Eo3, it performs Montgomery multiplication in form b=(a*R)>>k.
Additionally, there are two divisions at the same time for the same denominator, so each step will require only one read from the reduction tables.

awxkee · 2024-12-18T15:00:59Z

And also it is possible to make special branch for numbers power of 2 for even faster division:

fn is_power_of_two(n: u32) -> bool {
    n != 0 && (n & (n - 1)) == 0
}

fn power_of_two(n: u32) -> u32 {
    n.trailing_zeros()
}

And then perform: value >> power_of_two

awxkee · 2024-12-18T15:29:44Z

Here is table to try out: https://gist.github.com/awxkee/b8df92e4f97346fb56ae91dc4dcca779
You can benchmark if it worth it on your CPU, because I'm actually not sure that it'll be better.

Shnatsel · 2024-12-18T15:39:54Z

Thank you! I'll benchmark that and dig deeper into the performance of these things once we actually have a working alpha blending routine. Right now I'm not even sure if u16 is even enough for alpha blending to work, so I don't want to over-invest into optimizing something I might throw away later.

Shnatsel · 2024-12-18T19:27:45Z

Okay, I checked how libwebp does it, and they actually do it in u32 rather than u16:

https://github.com/webmproject/libwebp/blob/e4f7a9f0c7c9fbfae1568bc7fa5c94b989b50872/src/demux/anim_decode.c#L215-L267

We should probably just port that.

Shnatsel · 2024-12-18T21:07:04Z

I've ported the libwebp algorithm. It is really inaccurate at low alpha levels but nobody is going to notice that anyway. It gives a 8% end-to-end performance boost on this sample.

https://github.com/Shnatsel/image-webp/blob/e9a8a6710db5d4af193312ad4983681068bc5534/src/alpha_blending.rs#L5-L52

My conversion of the f64 algorithm using u16 is more accurate than the libwebp blending routine, but is also a lot slower than even f64. However, the loop it's in is really not conducive to vectorization. If I can get the compiler to vectorize it and process 8 values at once via vector instructions, that might change.

Realistically though, the libwebp blending is still going to be faster, if we're willing to take the accuracy hit at very low alpha values.

awxkee · 2024-12-18T21:15:52Z

You could perhaps replace their 'approximate division by 255' with exact division? I think this is the main inaccuracies producer.
Also here should be nice work division by reduction table, because you may perform 3 accurate divisions with one table read, however I'm not sure then about speed, this is something really needs to be checked.

Shnatsel · 2024-12-18T21:25:44Z

I turned an assert! into a debug_assert! and that must have unlocked some huge optimizations because decoding is now 16% faster end-to-end, so the alpha blending function must be ~5x faster

https://github.com/Shnatsel/image-webp/blob/c0cfcb1d2afbb581e8a5b25614d6f9cecfb497d5/src/alpha_blending.rs#L6-L59

I also re-added support for big endian. C doesn't have a [u8; 4] type so they have to do cursed endianness-dependent bit shifts instead.

Shnatsel · 2024-12-19T10:14:12Z

@awxkee I've replaced libwebp's division approximation with your div_by_255 and got improved precision without sacrificing performance!

Combined with the image_webp::vp8::Frame::fill_rgba optimization in #122, we're now 27% faster end-to-end on this image compared to where we started.

awxkee · 2024-12-19T10:22:44Z

Actually if you're using u32 instead of u16 there will be a shorter reach for division by 255 = (a * 32897) >> 23.
This method in u16 makes more sense for SIMD vectors in u16 type.

Shnatsel · 2024-12-19T10:29:44Z

That method results in a less precise approximation of the floating-point division, and I'm seeing a greater divergence from the floating-point reference. I believe the trick with the other div_by_255 is that it also rounds to nearest, instead of always rounding down.

This was referenced Oct 22, 2024

Compare image-webp and webp-sys woelper/oculante#252

Closed

Tracking issue for improvements in our dependencies Shnatsel/wondermagick#1

Open

Shnatsel mentioned this issue Oct 30, 2024

Parallelization opportunities #120

Open

Shnatsel mentioned this issue Dec 14, 2024

Refactor fill_rgb to fill 2 pixels at a time #13

Merged

Shnatsel mentioned this issue Dec 18, 2024

Faster YUV -> RGBA conversion #122

Merged

Shnatsel mentioned this issue Dec 19, 2024

Faster alpha blending #123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding animated WebP is 4x slower than `libwebp-sys` + `webp-animation` #119

Decoding animated WebP is 4x slower than `libwebp-sys` + `webp-animation` #119

Shnatsel commented Oct 22, 2024

Shnatsel commented Dec 14, 2024

Shnatsel commented Dec 15, 2024

awxkee commented Dec 15, 2024

Shnatsel commented Dec 18, 2024

awxkee commented Dec 18, 2024 •

edited

Loading

awxkee commented Dec 18, 2024

awxkee commented Dec 18, 2024

Shnatsel commented Dec 18, 2024

Shnatsel commented Dec 18, 2024

Shnatsel commented Dec 18, 2024 •

edited

Loading

awxkee commented Dec 18, 2024

Shnatsel commented Dec 18, 2024

Shnatsel commented Dec 19, 2024

awxkee commented Dec 19, 2024

Shnatsel commented Dec 19, 2024

Decoding animated WebP is 4x slower than libwebp-sys + webp-animation #119

Decoding animated WebP is 4x slower than libwebp-sys + webp-animation #119

Comments

Shnatsel commented Oct 22, 2024

image

libwebp-sys + webp-animation

Analysis

Shnatsel commented Dec 14, 2024

Shnatsel commented Dec 15, 2024

awxkee commented Dec 15, 2024

Shnatsel commented Dec 18, 2024

awxkee commented Dec 18, 2024 • edited Loading

awxkee commented Dec 18, 2024

awxkee commented Dec 18, 2024

Shnatsel commented Dec 18, 2024

Shnatsel commented Dec 18, 2024

Shnatsel commented Dec 18, 2024 • edited Loading

awxkee commented Dec 18, 2024

Shnatsel commented Dec 18, 2024

Shnatsel commented Dec 19, 2024

awxkee commented Dec 19, 2024

Shnatsel commented Dec 19, 2024

Decoding animated WebP is 4x slower than `libwebp-sys` + `webp-animation` #119

Decoding animated WebP is 4x slower than `libwebp-sys` + `webp-animation` #119

awxkee commented Dec 18, 2024 •

edited

Loading

Shnatsel commented Dec 18, 2024 •

edited

Loading